Skip to main content
Version: 1.3

PuppeteerCrawlerOptions

Properties

handlePageFunction

Type: PuppeteerHandlePage

Function that is called to process each request. It is passed an object with the following fields:

{
request: Request,
response: Response,
page: Page,
session: Session,
browserController: BrowserController,
proxyInfo: ProxyInfo,
crawler: PuppeteerCrawler,
}

request is an instance of the Request object with details about the URL to open, HTTP method etc. page is an instance of the Puppeteer Page browserPool is an instance of the BrowserPool, browserController is an instance of the BrowserController, response is an instance of the Puppeteer Response, which is the main resource response as returned by page.goto(request.url). The function must return a promise, which is then awaited by the crawler.

If the function throws an exception, the crawler will try to re-crawl the request later, up to option.maxRequestRetries times. If all the retries fail, the crawler calls the function provided to the handleFailedRequestFunction parameter. To make this work, you should always let your function throw exceptions rather than catch them. The exceptions are logged to the request using the Request.pushErrorMessage() function.


Type: number = 60

Timeout in which page navigation needs to finish, in seconds.


handleFailedRequestFunction

Type: HandleFailedRequest

A function to handle requests that failed more than option.maxRequestRetries times.

The function receives the following object as an argument:

{
request: Request,
response: Response,
page: Page,
session: Session,
browserController: BrowserController,
proxyInfo: ProxyInfo,
crawler: PuppeteerCrawler,
}

Where the Request instance corresponds to the failed request, and the Error instance represents the last error thrown during processing of the request.


launchContext

Type: PuppeteerLaunchContext

Options used by Apify.launchPuppeteer() to start new Puppeteer instances.


handlePageTimeoutSecs

Type: number = 60

Timeout in which the function passed as handlePageFunction needs to finish, in seconds.


browserPoolOptions

Type: BrowserPoolOptions

Custom options passed to the underlying BrowserPool constructor. You can tweak those to fine-tune browser management.


persistCookiesPerSession

Type: boolean = true

Automatically saves cookies to Session. Works only if Session Pool is used.


proxyConfiguration

Type: ProxyConfiguration

If set, PuppeteerCrawler will be configured for all connections to use Apify Proxy or your own Proxy URLs provided and rotated according to the configuration. For more information, see the documentation.


preNavigationHooks

Type: Array<PuppeteerHook>

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, crawlingContext and gotoOptions, which are passed to the page.goto() function the crawler calls to navigate. Example:

preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { page } = crawlingContext;
await page.evaluate((attr) => { window.foo = attr; }, 'bar');
}
]

postNavigationHooks

Type: Array<PuppeteerHook>

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts crawlingContext as the only parameter. Example:

postNavigationHooks: [
async (crawlingContext) => {
const { page } = crawlingContext;
if (hasCaptcha(page)) {
await solveCaptcha (page);
}
};
]

requestList

Type: RequestList

Static list of URLs to be processed. Either requestList or requestQueue option must be provided (or both).


requestQueue

Type: RequestQueue

Dynamic queue of URLs to be processed. This is useful for recursive crawling of websites. Either requestList or requestQueue option must be provided (or both).


maxRequestRetries

Type: number = 3

Indicates how many times the request is retried if PuppeteerCrawlerOptions.handlePageFunction fails.


maxRequestsPerCrawl

Type: number

Maximum number of pages that the crawler will open. The crawl will stop when this limit is reached. Always set this value in order to prevent infinite loops in misconfigured crawlers. Note that in cases of parallel crawling, the actual number of pages visited might be slightly higher than this value.


autoscaledPoolOptions

Type: AutoscaledPoolOptions

Custom options passed to the underlying AutoscaledPool constructor. Note that the runTaskFunction and isTaskReadyFunction options are provided by the crawler and cannot be overridden. However, you can provide a custom implementation of isFinishedFunction.


minConcurrency

Type: number = 1

Sets the minimum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option.

WARNING: If you set this value too high with respect to the available system memory and CPU, your crawler will run extremely slow or crash. If you're not sure, just keep the default value and the concurrency will scale up automatically.


maxConcurrency

Type: number = 1000

Sets the maximum concurrency (parallelism) for the crawl. Shortcut to the corresponding AutoscaledPool option.


useSessionPool

Type: boolean = true

Puppeteer crawler will initialize the SessionPool with the corresponding sessionPoolOptions. The session instance will be than available in the handleRequestFunction.


sessionPoolOptions

Type: SessionPoolOptions

The configuration options for SessionPool to use.