Skip to main content
Version: Next

Session Management

‚Äč

SessionPool
is a class that allows you to handle the rotation of proxy IP addresses along with cookies and other custom settings in Apify SDK.

The main benefit of a Session pool is that you can filter out blocked or non-working proxies, so your actor does not retry requests over known blocked/non-working proxies. Another benefit of using SessionPool is that you can store information tied tightly to an IP address, such as cookies, auth tokens, and particular headers. Having your cookies and other identificators used only with a specific IP will reduce the chance of being blocked. Last but not least, another benefit is the even rotation of IP addresses - SessionPool picks the session randomly, which should prevent burning out a small pool of available IPs.

Now let's take a look at how to use a Session pool.

Example usage in PuppeteerCrawler


const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
requestQueue,
// To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxyConfiguration,
// Activates the Session pool.
useSessionPool: true,
// Overrides default Session pool configuration
sessionPoolOptions: {
maxPoolSize: 100
},
// Set to true if you want the crawler to save cookies per session,
// and set the cookies to page before navigation automatically.
persistCookiesPerSession: true,
handlePageFunction: async ({ request, page, session }) => {
const title = await page.title();

if (title === "Blocked") {
session.retire()
} else if (title === "Not sure if blocked, might also be a connection error") {
session.markBad();
} else {
// session.markGood() - this step is done automatically in puppeteer pool.
}

}
});

Example usage in CheerioCrawler

  const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new CheerioCrawler({
requestQueue,
// To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxyConfiguration,
// Activates the Session pool.
useSessionPool: true,
// Overrides default Session pool configuration.
sessionPoolOptions: {
maxPoolSize: 100
},
// Set to true if you want the crawler to save cookies per session,
// and set the cookie header to request automatically...
persistCookiesPerSession: true,
handlePageFunction: async ({request, $, session}) => {
const title = $("title");

if (title === "Blocked") {
session.retire()
} else if (title === "Not sure if blocked, might also be a connection error") {
session.markBad();
} else {
// session.markGood() - this step is done automatically in BasicCrawler.
}

}
});

Example usage in BasicCrawler

 const { gotScraping } = require('got-scraping');
const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new BasicCrawler({
requestQueue,
// Allows access to proxyInfo object in handleRequestFunction
proxyConfiguration,
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 100
},
handleRequestFunction: async ({request, session, proxyInfo }) => {
// To use the proxy IP session rotation logic, you must turn the proxy usage on.
const proxyUrl = proxyInfo.url;
const requestOptions = {
url: request.url,
proxyUrl,
throwHttpErrors: false,
headers: {
// If you want to use the cookieJar.
// This way you get the Cookie headers string from session.
Cookie: session.getCookieString(),
}
};
let response;

try {
response = await gotScraping(requestOptions);
} catch (e) {
if (e === "SomeNetworkError") {
// If a network error happens, such as timeout, socket hangup etc...
// There is usually a chance that it was just bad luck and the proxy works.
// No need to throw it away.
session.markBad();
}
throw e;
}

// Automatically retires the session based on response HTTP status code.
session.retireOnBlockedStatusCodes(response.statusCode);

if (response.body.blocked) {
// You are sure it is blocked.
// This will throw away the session.
session.retire();

}

// Everything is ok, you can get the data.
// No need to call session.markGood -> BasicCrawler calls it for you.

// If you want to use the CookieJar in session you need.
session.setCookiesFromResponse(response);
}
});

Example solo usage

Actor.main(async () => {
const sessionPoolOptions = {
maxPoolSize: 100
};
const sessionPool = await SessionPool.open(sessionPoolOptions);

// Get session
const session = sessionPool.getSession();

// Increase the errorScore.
session.markBad();

// Throw away the session
session.retire();

// Lower the errorScore and marks the session good.
session.markGood();
});

These are the basics of configuring SessionPool. Please, bear in mind that a Session pool needs time to find working IPs and build up the pool, so you will probably see a lot of errors until it becomes stabilized.