Upgrading to v0.x
This page summarizes the breaking changes between Crawlee for Python zero-based versions.
Upgrading to v0.6โ
This section summarizes the breaking changes between v0.5.x and v0.6.0.
HttpCrawlerOptionsโ
- Removed
HttpCrawlerOptions- which contained options fromBasicCrawlerOptionsand unique optionsadditional_http_error_status_codesandignore_http_error_status_codes. Both of the unique options were added toBasicCrawlerOptionsinstead.
HttpClientโ
- The signature of the
HttpClientclass has been updated. The constructor parametersadditional_http_error_status_codesandignore_http_error_status_codeshave been removed and are now only available inBasicCrawlerOptions. - The method
_raise_for_error_status_codehas been removed fromHttpClient. Its logic has been moved to theBasicCrawlerclass.
SessionCookiesโ
- Replaces the
dictused for cookie storage inSession.cookieswith a newSessionCookiesclass.SessionCookiesusesCookieJar, which enables support for multiple domains.
PlaywrightCrawler and PlaywrightBrowserPluginโ
PlaywrightCrawlernow use a persistent browser context instead of the standard browser context.- Added
user_data_dirparameter forPlaywrightCrawlerandPlaywrightBrowserPluginto specify the directory for the persistent context. If not provided, a temporary directory will be created automatically.
Configurationโ
The Configuration fields chrome_executable_path, xvfb, and verbose_log have been removed. The chrome_executable_path and xvfb fields were unused, while verbose_log can be replaced by setting log_level to DEBUG.
CLI dependenciesโ
CLI dependencies have been moved to optional dependencies. If you need the CLI, install crawlee[cli]
Abstract base classesโ
We decided to move away from Hungarian notation and remove all the Base prefixes from the abstract classes. It includes the following public classes:
BaseStorageClient->StorageClientBaseBrowserController->BrowserControllerBaseBrowserPlugin->BrowserPlugin
EnqueueStrategyโ
The EnqueueStrategy has been changed from an enum to a string literal type. All its values and their meaning remain unchanged.
Upgrading to v0.5โ
This section summarizes the breaking changes between v0.4.x and v0.5.0.
Crawlers & CrawlingContextsโ
- All crawler and crawling context classes have been consolidated into a single sub-package called
crawlers. - The affected classes include:
AbstractHttpCrawler,AbstractHttpParser,BasicCrawler,BasicCrawlerOptions,BasicCrawlingContext,BeautifulSoupCrawler,BeautifulSoupCrawlingContext,BeautifulSoupParserType,ContextPipeline,HttpCrawler,HttpCrawlerOptions,HttpCrawlingContext,HttpCrawlingResult,ParsedHttpCrawlingContext,ParselCrawler,ParselCrawlingContext,PlaywrightCrawler,PlaywrightCrawlingContext,PlaywrightPreNavCrawlingContext.
Example update:
- from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+ from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
Storage clientsโ
- All storage client classes have been moved into a single sub-package called
storage_clients. - The affected classes include:
MemoryStorageClient,BaseStorageClient.
Example update:
- from crawlee.memory_storage_client import MemoryStorageClient
+ from crawlee.storage_clients import MemoryStorageClient
CurlImpersonateHttpClientโ
- The
CurlImpersonateHttpClientchanged its import location.
Example update:
- from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
+ from crawlee.http_clients import CurlImpersonateHttpClient
BeautifulSoupParserโ
- Renamed
BeautifulSoupParsertoBeautifulSoupParserType. Probably used only in type hints. Please replace previous usages ofBeautifulSoupParserbyBeautifulSoupParserType. BeautifulSoupParseris now a new class that is used in refactored classBeautifulSoupCrawler.
Service locatorโ
- The
crawlee.service_containerwas completely refactored and renamed tocrawlee.service_locator. - You can use it to set the configuration, event manager or storage client globally. Or you can pass them to your crawler instance directly and it will use the service locator under the hood.
Statisticsโ
- The
crawlee.statistics.Statisticsclass do not accept an event manager as an input argument anymore. It uses the default, global one. - If you want to set your custom event manager, do it either via the service locator or pass it to the crawler.
Requestโ
- The properties
json_andorder_nowere removed. They were there only for the internal purpose of the memory storage client, you should not need them.
Request storages and loadersโ
- The
request_providerparameter ofBasicCrawler.__init__has been renamed torequest_manager - The
BasicCrawler.get_request_providermethod has been renamed toBasicCrawler.get_request_managerand it does not accept theidandnamearguments anymore- If using a specific request queue is desired, pass it as the
request_manageronBasicCrawlercreation
- If using a specific request queue is desired, pass it as the
- The
RequestProviderinterface has been renamed toRequestManagerand moved to thecrawlee.request_loaderspackage RequestListhas been moved to thecrawlee.request_loaderspackageRequestListdoes not support.drop(),.reclaim_request(),.add_request()andadd_requests_batched()anymore- It implements the new
RequestLoaderinterface instead ofRequestManager RequestManagerTandemwith aRequestQueueshould be used to enable passing aRequestList(or any otherRequestLoaderimplementation) as arequest_manager,await list.to_tandem()can be used as a shortcut
- It implements the new
PlaywrightCrawlerโ
- The
PlaywrightPreNavigationContextwas renamed toPlaywrightPreNavCrawlingContext. - The input arguments in
PlaywrightCrawler.__init__have been renamed:browser_optionsis nowbrowser_launch_options,page_optionsis nowbrowser_new_context_options.
- These argument renaming changes have also been applied to
BrowserPool,PlaywrightBrowserPlugin, andPlaywrightBrowserController.
Upgrading to v0.4โ
This section summarizes the breaking changes between v0.3.x and v0.4.0.
Request modelโ
- The
Request.query_paramsfield has been removed. Please add query parameters directly to the URL, which was possible before as well, and is now the only supported approach. - The
Request.payloadandRequest.datafields have been consolidated. Now, onlyRequest.payloadremains, and it should be used for all payload data in requests.
Extended unique key computationโ
- The computation of
extended_unique_keynow includes HTTP headers. While this change impacts the behavior, the interface remains the same.
Upgrading to v0.3โ
This section summarizes the breaking changes between v0.2.x and v0.3.0.
Public and private interface declarationโ
In previous versions, the majority of the package was fully public, including many elements intended for internal use only. With the release of v0.3, we have clearly defined the public and private interface of the package. As a result, some imports have been updated (see below). If you are importing something now designated as private, we recommend reconsidering its use or discussing your use case with us in the discussions/issues.
Here is a list of the updated public imports:
- from crawlee.enqueue_strategy import EnqueueStrategy
+ from crawlee import EnqueueStrategy
- from crawlee.models import Request
+ from crawlee import Request
- from crawlee.basic_crawler import Router
+ from crawlee.router import Router
Request queueโ
There were internal changes that should not affect the intended usage:
- The unused
BaseRequestQueueClient.list_requests()method was removed RequestQueueinternals were updated to match the "Request Queue V2" implementation in Crawlee for JS
Service containerโ
A new module, crawlee.service_container, was added to allow management of "global instances" - currently it contains Configuration, EventManager and BaseStorageClient. The module also replaces the StorageClientManager static class. It is likely that its interface will change in the future. If your use case requires working with it, please get in touch - we'll be glad to hear any feedback.