actions
Action[]
{ actions: [ { indexName: 'index_name', pathsToMatch: ['url_path', ...] fileTypesToMatch: ['file_type', ...], autoGenerateObjectIDs: true|false, schedule: 'every 1 day', recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => { } }, ], }
About this parameter
Determines which web pages are translated into Algolia records and in what way.
A single action defines:
- the subset of your crawler’s websites it targets,
- the extraction process for those websites,
- and the indices to which the extracted records are pushed.
A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.
Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
actions: [
{
indexName: 'dev_blog_algolia',
pathsToMatch: ['https://blog.algolia.com/**'],
fileTypesToMatch: ['pdf'],
autoGenerateObjectIDs: false,
schedule: 'every 1 day',
recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => {
...
}
},
],
}
Parameters
Action
name
|
type: string
Optional
The unique identifier of this action (useful for debugging).
Required if |
||
indexName
|
type: string
Required
The index name targeted by this action. This value is appended to the |
||
schedule
|
type: string
Optional
How often to perform a complete crawl for this action. See main property |
||
pathsToMatch
|
type: string
Required
Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more. |
||
selectorsToMatch
|
type: string
Optional
Checks for the presence or absence of DOM nodes. |
||
fileTypesToMatch
|
type: string
default: html
Optional
Set this value if you want to index documents. Chosen file types will be converted to HTML using Tika, then treated as a normal HTML page. See the documents guide for a list of available |
||
autoGenerateObjectIDs
|
type: bool
default: true
Generate an |
||
recordExtractor
|
type: function
Required
A
Copy
|
action ➔ recordExtractor
$
|
type: object (Cheerio instance)
Optional
A Cheerio instance containing the HTML of the crawled page. |
url
|
type: Location object
Optional
A |
fileType
|
type: string
Optional
The fileType of the crawled page (e.g.: html, pdf, …). |
contentLength
|
type: number
Optional
The number of bytes in the crawled page. |
dataSources
|
type: object
Optional
Array of external data sources. |
helpers
|
type: object
Optional
Collection of functions to help you extract content and generate records. |
recordExtractor ➔ helpers
splitContentIntoRecords
|
type: function
Optional
The
Copy
In the example
Copy
Assuming that the automatic generation of In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable
Copy
Please be aware that using |
helpers ➔ splitContentIntoRecords
$elements
|
type: string
default: $("body")
A Cheerio instance that determines from which element(s) textual content will be extracted and turned into records. |
baseRecord
|
type: object
default: {}
Attributes (and their values) to add to all resulting records. |
maxRecordBytes
|
type: number
default: 10000
Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size. |
textAttributeName
|
type: string
default: text
Name of the attribute in which to store the text of each record. |
orderingAttributeName
|
type: string
Optional
Name of the attribute in which to store the number of each record. |