Skip to main content

Storage clients

Storage clients provide a unified interface for interacting with Dataset, KeyValueStore, and RequestQueue, regardless of the underlying implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. This abstraction makes it easy to switch between different environments, such as local development and cloud production setups.

Built-in storage clients

Crawlee provides three main storage client implementations:

File system storage client

The FileSystemStorageClient provides persistent storage by writing data directly to the file system. It uses intelligent caching and batch processing for better performance while storing data in human-readable JSON format. This is the default storage client used by Crawlee when no other storage client is specified, making it ideal for large datasets and long-running operations where data persistence is required.

Concurrency limitation

The FileSystemStorageClient is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time.

This storage client is ideal for large datasets, and long-running operations where data persistence is required. Data can be easily inspected and shared with other tools.

Run on
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import FileSystemStorageClient

# Create a new instance of storage client.
storage_client = FileSystemStorageClient()

# And pass it to the crawler.
crawler = ParselCrawler(storage_client=storage_client)

Configuration options for the FileSystemStorageClient can be set through environment variables or the Configuration class:

  • storage_dir (env: CRAWLEE_STORAGE_DIR, default: './storage') - The root directory for all storage data.
  • purge_on_start (env: CRAWLEE_PURGE_ON_START, default: True) - Whether to purge default storages on start.

Data is stored using the following directory structure:

{CRAWLEE_STORAGE_DIR}/
├── datasets/
│ └── {DATASET_NAME}/
│ ├── __metadata__.json
│ ├── 000000001.json
│ └── 000000002.json
├── key_value_stores/
│ └── {KVS_NAME}/
│ ├── __metadata__.json
│ ├── key1.json
│ ├── key2.txt
│ └── key3.json
└── request_queues/
└── {RQ_NAME}/
├── __metadata__.json
├── {REQUEST_ID_1}.json
└── {REQUEST_ID_2}.json

Where:

  • {CRAWLEE_STORAGE_DIR} - The root directory for local storage.
  • {DATASET_NAME}, {KVS_NAME}, {RQ_NAME} - The unique names for each storage instance (defaults to "default").
  • Files are stored directly without additional metadata files for simpler structure.

Here is an example of how to configure the FileSystemStorageClient:

Run on
from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import FileSystemStorageClient

# Create a new instance of storage client.
storage_client = FileSystemStorageClient()

# Create a configuration with custom settings.
configuration = Configuration(
storage_dir='./my_storage',
purge_on_start=False,
)

# And pass them to the crawler.
crawler = ParselCrawler(
storage_client=storage_client,
configuration=configuration,
)

Memory storage client

The MemoryStorageClient stores all data in memory using Python data structures. It provides fast access but does not persist data between runs, meaning all data is lost when the program terminates. This storage client is primarily suitable for testing and development, and is usually not a good fit for production use. However, in some cases where speed is prioritized over persistence, it can make sense.

Persistence limitation

The MemoryStorageClient does not persist data between runs. All data is lost when the program terminates.

Run on
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import MemoryStorageClient

# Create a new instance of storage client.
storage_client = MemoryStorageClient()

# And pass it to the crawler.
crawler = ParselCrawler(storage_client=storage_client)

SQL storage client

Experimental feature

The SqlStorageClient is experimental. Its API and behavior may change in future releases.

The SqlStorageClient provides persistent storage using a SQL database (SQLite by default, or PostgreSQL). It supports all Crawlee storage types and enables concurrent access from multiple independent clients or processes.

dependencies

The SqlStorageClient is not included in the core Crawlee package. To use it, you need to install Crawlee with the appropriate extra dependency:

  • For SQLite support, run: pip install 'crawlee[sql_sqlite]'
  • For PostgreSQL support, run: pip install 'crawlee[sql_postgres]'

By default, SqlStorageClient uses SQLite. To use PostgreSQL instead, just provide a PostgreSQL connection string via the connection_string parameter. No other code changes are needed—the same client works for both databases.

Run on
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import SqlStorageClient


async def main() -> None:
# Create a new instance of storage client.
# This will create an SQLite database file crawlee.db or created tables in your
# database if you pass `connection_string` or `engine`
# Use the context manager to ensure that connections are properly cleaned up.
async with SqlStorageClient() as storage_client:
# And pass it to the crawler.
crawler = ParselCrawler(storage_client=storage_client)

Data is organized in relational tables. Below are the main tables and columns used for each storage type:

Configuration options for the SqlStorageClient can be set through environment variables or the Configuration class:

  • storage_dir (env: CRAWLEE_STORAGE_DIR, default: './storage') - The root directory where the default SQLite database will be created if no connection string is provided.
  • purge_on_start (env: CRAWLEE_PURGE_ON_START, default: True) - Whether to purge default storages on start.

Configuration options for the SqlStorageClient can be set via constructor arguments:

  • connection_string (default: SQLite in Configuration storage dir) – SQLAlchemy connection string, e.g. sqlite+aiosqlite:///my.db or postgresql+asyncpg://user:pass@host/db.
  • engine – Pre-configured SQLAlchemy AsyncEngine (optional).

For advanced scenarios, you can configure SqlStorageClient with a custom SQLAlchemy engine and additional options via the Configuration class. This is useful, for example, when connecting to an external PostgreSQL database or customizing connection pooling.

from sqlalchemy.ext.asyncio import create_async_engine

from crawlee.configuration import Configuration
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import SqlStorageClient


async def main() -> None:
# Create a new instance of storage client.
# On first run, also creates tables in your PostgreSQL database.
# Use the context manager to ensure that connections are properly cleaned up.
async with SqlStorageClient(
# Create an `engine` with the desired configuration
engine=create_async_engine(
'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres',
future=True,
pool_size=5,
max_overflow=10,
pool_recycle=3600,
pool_pre_ping=True,
echo=False,
)
) as storage_client:
# Create a configuration with custom settings.
configuration = Configuration(
purge_on_start=False,
)

# And pass them to the crawler.
crawler = ParselCrawler(
storage_client=storage_client,
configuration=configuration,
)

Creating a custom storage client

A storage client consists of two parts: the storage client factory and individual storage type clients. The StorageClient acts as a factory that creates specific clients (DatasetClient, KeyValueStoreClient, RequestQueueClient) where the actual storage logic is implemented.

Here is an example of a custom storage client that implements the StorageClient interface:

Run on
from __future__ import annotations

from typing import TYPE_CHECKING

from crawlee.storage_clients import StorageClient
from crawlee.storage_clients._base import (
DatasetClient,
KeyValueStoreClient,
RequestQueueClient,
)

if TYPE_CHECKING:
from crawlee.configuration import Configuration

# Implement the storage type clients with your backend logic.


class CustomDatasetClient(DatasetClient):
# Implement methods like push_data, get_data, iterate_items, etc.
pass


class CustomKeyValueStoreClient(KeyValueStoreClient):
# Implement methods like get_value, set_value, delete, etc.
pass


class CustomRequestQueueClient(RequestQueueClient):
# Implement methods like add_request, fetch_next_request, etc.
pass


# Implement the storage client factory.


class CustomStorageClient(StorageClient):
async def create_dataset_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
) -> CustomDatasetClient:
# Create and return your custom dataset client.
pass

async def create_kvs_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
) -> CustomKeyValueStoreClient:
# Create and return your custom key-value store client.
pass

async def create_rq_client(
self,
*,
id: str | None = None,
name: str | None = None,
configuration: Configuration | None = None,
) -> CustomRequestQueueClient:
# Create and return your custom request queue client.
pass

Custom storage clients can implement any storage logic, such as connecting to a database, using a cloud storage service, or integrating with other systems. They must implement the required methods for creating, reading, updating, and deleting data in the respective storages.

Registering storage clients

Storage clients can be registered in multiple ways:

Run on
import asyncio

from crawlee import service_locator
from crawlee.crawlers import ParselCrawler
from crawlee.storage_clients import MemoryStorageClient
from crawlee.storages import Dataset


async def main() -> None:
# Create custom storage client, MemoryStorageClient for example.
storage_client = MemoryStorageClient()

# Register it globally via the service locator.
service_locator.set_storage_client(storage_client)

# Or pass it directly to the crawler, it will be registered globally
# to the service locator under the hood.
crawler = ParselCrawler(storage_client=storage_client)

# Or just provide it when opening a storage (e.g. dataset), it will be used
# for this storage only, not globally.
dataset = await Dataset.open(
name='my_dataset',
storage_client=storage_client,
)


if __name__ == '__main__':
asyncio.run(main())

You can also register different storage clients for each storage instance, allowing you to use different backends for different storages. This is useful when you want to use a fast in-memory storage for RequestQueue while persisting scraping results in Dataset or KeyValueStore.

Conclusion

Storage clients in Crawlee provide different backends for data storage. Use MemoryStorageClient for testing and fast operations without persistence, or FileSystemStorageClient for environments where data needs to persist. You can also create custom storage clients for specialized backends by implementing the StorageClient interface.

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!