Skip to content

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Aug 1, 2025

Description

  • Add SQLStorageClient which can accept a database connection string or a pre-configured AsyncEngine, or creates a default crawlee.db database in Configuration.storage_dir.

Issues

@Mantisus Mantisus self-assigned this Aug 1, 2025
@Mantisus Mantisus added this to the 1.0 milestone Aug 1, 2025
@Mantisus Mantisus requested a review from Copilot August 1, 2025 21:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new SQL-based storage client (SQLStorageClient) that provides persistent data storage using SQLAlchemy v2+ for datasets, key-value stores, and request queues.

Key changes:

  • Adds SQLStorageClient with support for connection strings, pre-configured engines, or default SQLite database
  • Implements SQL-based clients for all three storage types with database schema management and transaction handling
  • Updates storage model configurations to support SQLAlchemy ORM mapping with from_attributes=True

Reviewed Changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/crawlee/storage_clients/_sql/ New SQL storage implementation with database models, clients, and schema management
tests/unit/storage_clients/_sql/ Comprehensive test suite for SQL storage functionality
tests/unit/storages/ Updates to test fixtures to include SQL storage client testing
src/crawlee/storage_clients/models.py Adds from_attributes=True to model configs for SQLAlchemy ORM compatibility
pyproject.toml Adds new sql optional dependency group
src/crawlee/storage_clients/__init__.py Adds conditional import for SQLStorageClient
Comments suppressed due to low confidence (1)

tests/unit/storages/test_request_queue.py:23

  • The test fixture only tests 'sql' storage client, but the removed 'memory' and 'file_system' parameters suggest this may have unintentionally reduced test coverage. Consider including all storage client types to ensure comprehensive testing.
@pytest.fixture(params=['sql'])

Mantisus and others added 2 commits August 2, 2025 00:25
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Mantisus
Copy link
Collaborator Author

Mantisus commented Aug 1, 2025

When implementing, I opted out of SQLModel for several reasons:

  • Poor library support. As of today, SQLModel has a huge number of PRs and update requests, some of which are several years old. The latest releases have been mostly cosmetic (updating dependencies, documentation, builds, and checks, etc.).
  • Model hierarchy issue: if we use SQLModel, it's expected that we'll inherit existing Pydantic models from it. This greatly increases base dependencies (SQLModel, SQLAlchemy, aiosqlite). I don't think we should do this (see the last point).
  • It doesn't support optimization constraints for database tables, such as string length limits.
  • Poor typing when using anything other than select - Add an overload to the exec method with _Executable statement for update and delete statements fastapi/sqlmodel#909.
  • Overall, we can achieve the same behavior using only SQLAlchemy v2+ — https://docs.sqlalchemy.org/en/20/orm/dataclasses.html#integrating-with-alternate-dataclass-providers-such-as-pydantic. However, this retains the inheritance hierarchy and dependency issue.
  • I think that data models for SQL can be simpler while being better adapted for SQL than the models used in the framework. This way, we can optimize each data model for its task.

@Mantisus
Copy link
Collaborator Author

Mantisus commented Aug 1, 2025

The storage client has been repeatedly tested with SQLLite and a local PostgreSQL (a simple container installation without fine-tuning).
Сode for testing

import asyncio

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.storage_clients import SQLStorageClient
from crawlee.storages import RequestQueue, KeyValueStore
from crawlee import service_locator
from crawlee import ConcurrencySettings


LOCAL_POSTGRE = None  # 'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres'
USE_STATE = True
KVS = True
DATASET = True
CRAWLERS = 1
REQUESTS = 10000
DROP_STORAGES = True


async def main() -> None:
    service_locator.set_storage_client(
        SQLStorageClient(
            connection_string=LOCAL_POSTGRE if LOCAL_POSTGRE else None,
        )
    )

    kvs = await KeyValueStore.open()
    queue_1 = await RequestQueue.open(name='test_queue_1')
    queue_2 = await RequestQueue.open(name='test_queue_2')
    queue_3 = await RequestQueue.open(name='test_queue_3')

    urls = [f'https://crawlee.dev/page/{i}' for i in range(REQUESTS)]

    await queue_1.add_requests(urls)
    await queue_2.add_requests(urls)
    await queue_3.add_requests(urls)

    crawler_1 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_1)
    crawler_2 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_2)
    crawler_3 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_3)

    # Define the default request handler
    @crawler_1.router.default_handler
    @crawler_2.router.default_handler
    @crawler_3.router.default_handler
    async def request_handler(context: BasicCrawlingContext) -> None:
        if USE_STATE:
            # Use state to store data
            state_data = await context.use_state()
            state_data['a'] = context.request.url

        if KVS:
            # Use KeyValueStore to store data
            await kvs.set_value(context.request.url, {'url': context.request.url, 'title': 'Example Title'})
        if DATASET:
            await context.push_data({'url': context.request.url, 'title': 'Example Title'})

    crawlers = [crawler_1]
    if CRAWLERS > 1:
        crawlers.append(crawler_2)
    if CRAWLERS > 2:
        crawlers.append(crawler_3)

    # Run the crawler
    data = await asyncio.gather(*[crawler.run() for crawler in crawlers])

    print(data)

    if DROP_STORAGES:
        # Drop all storages
        await queue_1.drop()
        await queue_2.drop()
        await queue_3.drop()
        await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())

This allows you to load work with storage without real requests.

@Mantisus
Copy link
Collaborator Author

Mantisus commented Aug 1, 2025

The use of accessed_modified_update_interval is related to optimization. Frequent updates to metadata just to change the access time can overload the database.

@Mantisus Mantisus removed this from the 1.0 milestone Aug 4, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part review. I will do RQ and tests in second part.

I have only minor comments. My main suggestion is to extract more code that is shared in all 3 clients. It is easier to understand all the clients once the reader easily knows which part of the code is exactly the same in all clients and which part of the code is unique and specific to the client. It also makes it easier to maintain the code.

Drawback would be that understanding just one class in the isolation would be little bit harder. But who wants to understand just one client?

@Pijukatel
Copy link
Collaborator

It would also be good to mention it in docs and maybe show an example use.

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue with the review later. There are many ways how to approach the RQclient implementation. I guess I have some different expectations in my mind (I am not saying those are correct :D ). Maybe we should define the expectations first, so that I do the review correctly based on that.

My initial expectations for the RQclient:

  • Can be used on the APify platform and outside as well
  • Supports any persistance
  • Supports parallel consumers/producers (Use case being - speeding up crawlers on Apify platform with multiprocessing to fully utilize resources available -> for example Parsel based actor could have multiple ParselCrawlers under the hood and all of them working on the same RQ, but reducing the costs by avoiding ApifyRQClient)

Most typical use case:

  • Crawlee outside of Apify platform
  • Crawlee on Apify platform, but avoiding expensive ApifyRQClient

@Mantisus Mantisus requested a review from vdusek August 30, 2025 12:35
@Mantisus Mantisus requested a review from vdusek September 1, 2025 12:08
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you done any performance comparisons with the memory and file-system storage clients? If not, could you please run some? For example, you could run the Parsel crawler on crawlee.dev, enqueue all links, and store the URL + title to the dataset.

@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 1, 2025

Have you done any performance comparisons with the memory and file-system storage clients? If not, could you please run some? For example, you could run the Parsel crawler on crawlee.dev, enqueue all links, and store the URL + title to the dataset.

Great idea. I'll try to do that once we resolve these two issues.

#1382
#1383

@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 3, 2025

Have you done any performance comparisons with the memory and file-system storage clients? If not, could you please run some? For example, you could run the Parsel crawler on crawlee.dev, enqueue all links, and store the URL + title to the dataset.

MemoryStorageClient:

┌───────────────────────────────┬────────────┐
│ requests_finished             │ 2363       │
│ requests_failed               │ 0          │
│ retry_histogram               │ [2363]     │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 224.7ms    │
│ requests_finished_per_minute  │ 4484       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 8min 51.0s │
│ requests_total                │ 2363       │
│ crawler_runtime               │ 31.62s     │
└───────────────────────────────┴────────────┘

FileSystemStorageClient:

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 2363        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [2363]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 379.8ms     │
│ requests_finished_per_minute  │ 2489        │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 14min 57.4s │
│ requests_total                │ 2363        │
│ crawler_runtime               │ 56.96s      │
└───────────────────────────────┴─────────────┘

SqlStorageClient:

import asyncio

from crawlee import ConcurrencySettings
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import SqlStorageClient

CONNECTION = None


async def main() -> None:
    async with SqlStorageClient(connection_string=CONNECTION) as storage_client:
        http_client = HttpxHttpClient()

        crawler = ParselCrawler(
            storage_client=storage_client,
            http_client=http_client,
            concurrency_settings=ConcurrencySettings(desired_concurrency=20),
        )

        @crawler.router.default_handler
        async def request_handler(context: ParselCrawlingContext) -> None:
            context.log.info(f'Processing URL: {context.request.url}...')
            data = {
                'url': context.request.url,
                'title': context.selector.css('title::text').get(),
            }
            await context.push_data(data)
            await context.enqueue_links()

        await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

SQLite

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 2363        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [2363]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 727.0ms     │
│ requests_finished_per_minute  │ 1460        │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 28min 37.9s │
│ requests_total                │ 2363        │
│ crawler_runtime               │ 1min 37.1s  │
└───────────────────────────────┴─────────────┘

PostgreSQL (standard installation in Docker, without database settings optimization)

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 2363        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [2363]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 503.8ms     │
│ requests_finished_per_minute  │ 2144        │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 19min 50.5s │
│ requests_total                │ 2363        │
│ crawler_runtime               │ 1min 6.1s   │
└───────────────────────────────┴─────────────┘

SqlStorageClient (3 processes)

import asyncio
from concurrent.futures import ProcessPoolExecutor

from crawlee import ConcurrencySettings, service_locator
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import SqlStorageClient
from crawlee.storages import RequestQueue

CONNECTION = None

async def run(queue_name: str) -> None:
    async with SqlStorageClient(connection_string=CONNECTION) as storage_client:
        service_locator.set_storage_client(storage_client)
        queue = await RequestQueue.open(name=queue_name)

        http_client = HttpxHttpClient()

        crawler = ParselCrawler(
            http_client=http_client,
            request_manager=queue,
            concurrency_settings=ConcurrencySettings(desired_concurrency=20),
        )

        @crawler.router.default_handler
        async def request_handler(context: ParselCrawlingContext) -> None:
            context.log.info(f'Processing URL: {context.request.url}...')
            data = {
                'url': context.request.url,
                'title': context.selector.css('title::text').get(),
            }
            await context.push_data(data)
            await context.enqueue_links()

        await crawler.run(['https://crawlee.dev'])

def process_run(queue_name: str) -> None:
    asyncio.run(run(queue_name))

def multi_run(queue_name: str = 'multi') -> None:
    workers = 3
    with ProcessPoolExecutor(max_workers=workers) as executor:
        executor.map(process_run, [queue_name for i in range(workers)])

if __name__ == '__main__':
    multi_run()

SQLite

[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 811        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [811]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 964.6ms    │
│ requests_finished_per_minute  │ 669        │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 13min 2.2s │
│ requests_total                │ 811        │
│ crawler_runtime               │ 1min 12.8s │
└───────────────────────────────┴────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 735         │
│ requests_failed               │ 0           │
│ retry_histogram               │ [735]       │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 930.9ms     │
│ requests_finished_per_minute  │ 606         │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 11min 24.2s │
│ requests_total                │ 735         │
│ crawler_runtime               │ 1min 12.8s  │
└───────────────────────────────┴─────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 817         │
│ requests_failed               │ 0           │
│ retry_histogram               │ [817]       │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 992.7ms     │
│ requests_finished_per_minute  │ 669         │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 13min 31.0s │
│ requests_total                │ 817         │
│ crawler_runtime               │ 1min 13.3s  │
└───────────────────────────────┴─────────────┘

PostgreSQL (standard installation in Docker, without database settings optimization)

[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 787        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [787]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 609.6ms    │
│ requests_finished_per_minute  │ 1527       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 7min 59.7s │
│ requests_total                │ 787        │
│ crawler_runtime               │ 30.92s     │
└───────────────────────────────┴────────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 783       │
│ requests_failed               │ 0         │
│ retry_histogram               │ [783]     │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 625.0ms   │
│ requests_finished_per_minute  │ 1494      │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 8min 9.4s │
│ requests_total                │ 783       │
│ crawler_runtime               │ 31.45s    │
└───────────────────────────────┴───────────┘
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 793        │
│ requests_failed               │ 0          │
│ retry_histogram               │ [793]      │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 604.0ms    │
│ requests_finished_per_minute  │ 1512       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 7min 58.9s │
│ requests_total                │ 793        │
│ crawler_runtime               │ 31.47s     │
└───────────────────────────────┴────────────┘

@Mantisus Mantisus requested a review from vdusek September 3, 2025 23:17
@vdusek vdusek mentioned this pull request Sep 9, 2025
1 task
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM! Keep in mind that this will need to be slightly adjusted once #1175 is merged.

@janbuchar
Copy link
Collaborator

@Mantisus I still see a bunch of unresolved comments, mainly from @vdusek - can you take care of those please?

@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 11, 2025

@Mantisus I still see a bunch of unresolved comments, mainly from @vdusek - can you take care of those please?

I apologize, I missed that.

I marked them as resolved, since everything has already been implemented.

return values_to_set

@staticmethod
def _get_int_id_from_unique_key(unique_key: str) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use the lru_cache decorator from functools instead of managing the cache manually? Also, do we really need to manage integer IDs? Maybe we could just make the unique_key a primary key. I can imagine that some DBMS could have a problem with that though...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just make the unique_key a primary key. I can imagine that some DBMS could have a problem with that though...

For POST requests, it can be really large. I would prefer to avoid that. 🙂

Great idea with the lru_cache decorator, thanks for reminding me about it.

for request_id, request in sorted(unique_requests.items()):
existing_req_db = existing_requests.get(request_id)
# New Request, add it
if existing_req_db is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a 100% guarantee though - the request could appear in the DB between the two SQL queries. Or an existing request might get handled.

Maybe we can deal with that later, but it should be clearly stated in the PR description, preferably with follow up issues.

Copy link
Collaborator Author

@Mantisus Mantisus Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why we use _build_upsert_stmt for forefront and _build_insert_stmt_with_ignore for regular.

If it's regular, then in case of a conflict, the duplicate is simply discarded.

If it's forefront, then we only update the sequence_number field by shifting it to the left.

@Mantisus Mantisus requested a review from janbuchar September 11, 2025 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for SQLite storage client
5 participants