Tools / Crawler / Configuration

Parameter

appId

The ID of the application you want to store the crawler extractions in.

apiKey

API key for your targeted application.

indexPrefix

Prefix added to the names of all indices defined in the crawler’s configuration.

rateLimit

Number of concurrent tasks per second that can run for this configuration.

schedule

How often a complete crawl should be performed.

startUrls

The crawler uses these URLs as entry points to start crawling.

sitemaps

URLs found in sitemaps are treated as startUrls for the crawler: they are used as starting points for the crawl.

ignoreRobotsTxtRules

When set to true, the crawler will ignore rules set in your robots.txt.

ignoreNoIndex

Whether the Crawler should extract records from a page whose robots meta tag contains noindex or none.

ignoreNoFollowTo

Whether the Crawler should follow links marked as nofollow (i.e., with the rel=”nofollow” tag) and extract links from a page whose robots meta tag contains nofollow or none.

ignoreCanonicalTo

Whether the Crawler should extract records from a page that has a canonical URL specified.

extraUrls

URLs found in extraUrls are treated as startUrls for your crawler: they are used as starting points for the crawl.

maxDepth

Limits the processing of URLs to the specified depth, inclusively.

maxUrls

Limits the number of URLs your crawler can process.

saveBackup

Whether to save a backup of your production index before it is overwritten by the index generated during a crawl.

renderJavaScript

When true, all web pages are rendered with a chrome headless browser. The crawler will use the rendered HTML.

initialIndexSettings

Defines the settings for the indices that the crawler updates.

exclusionPatterns

Tells the crawler which URLs to ignore or exclude.

ignoreQueryParams

Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs.

requestOptions

Modify all crawler’s requests behavior.

linkExtractor

Override the default logic used to extract URLs from pages.

externalDataSources

Defines external data sources you want to retrieve during every crawl and make available to your extractor function.

login

This property defines how the crawler acquires a session cookie.

safetyChecks

A configurable collection of safety checks to make sure the crawl was successful.

actions

Determines which web pages are translated into Algolia records and in what way.

discoveryPatterns

Indicates additional web pages that the Crawler should visit.

hostnameAliases

Defines mappings to replace given hostname(s).

cache

Turn crawler’s cache on or off.

Did you find this page helpful?