Tools / Crawler / actions

Jan. 08, 2021

actions

Type: Action[]

Required

Parameter syntax

{
  actions: [
    {
      indexName: 'index_name',
      pathsToMatch: ['url_path', ...]
      fileTypesToMatch: ['file_type', ...],
      autoGenerateObjectIDs: true|false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
      }
    },
  ],
}

See code examples

About this parameter

Determines which web pages are translated into Algolia records and in what way.

A single action defines:

the subset of your crawler’s websites it targets,
the extraction process for those websites,
and the indices to which the extracted records are pushed.

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples

Copy
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters

Action


            name

type: string

Optional

The unique identifier of this action (useful for debugging). Required if schedule is set.


            indexName

type: string

Required

The index name targeted by this action. This value is appended to the indexPrefix, when specified.


            schedule

type: string

Optional

How often to perform a complete crawl for this action. See main property schedule for more information.


            pathsToMatch

type: string

Required

Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more.


            selectorsToMatch

type: string

Optional

Checks for the presence or absence of DOM nodes.


            fileTypesToMatch

type: string

default: html

Optional

Set this value if you want to index documents. Chosen file types will be converted to HTML using Tika, then treated as a normal HTML page. See the documents guide for a list of available fileTypes.


            autoGenerateObjectIDs

type: bool

default: true

Generate an objectID for records that don’t have one. Setting this parameter to false means we’ll raise an error if an extracted record doesn’t have an objectID.


            recordExtractor

type: function

Required

A recordExtractor is a custom Javascript function that lets you execute your own code and extract what you want from a page. Your record extractor should return either an array of JSON or an empty array. If the function returns an empty array, the page is skipped.

Copy
recordExtractor: ({ url, $, contentLength, fileType})  => {
  return [
    {
      url: url.href,
      text: $('p').html()
      ... /* anything you want */
    }
  ];
  // return []; skips the page
}

action ➔ recordExtractor

`$`	type: object (Cheerio instance) Optional A Cheerio instance containing the HTML of the crawled page.
`url`	type: Location object Optional A `Location` object containing the URL and metadata for the crawled page.
`fileType`	type: string Optional The fileType of the crawled page (e.g.: html, pdf, …).
`contentLength`	type: number Optional The number of bytes in the crawled page.
`dataSources`	type: object Optional Array of external data sources.
`helpers`	type: object Optional Collection of functions to help you extract content and generate records.

recordExtractor ➔ helpers


            splitContentIntoRecords

type: function

Optional

The helpers.splitContentIntoRecords() function is callable from your recordExtractor. It extracts textual content from the resource (i.e. HTML page or document) and splits it into in one or more records. It can be used to index the textual content exhaustively and in a way to prevent record_too_big errors.

Copy
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName = 'text',
    orderingAttributeName = 'part',
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
}

In the example recordExtractor() function above, crawling a long HTMTL page will return an array of records that will never exceed the limit of 1000 bytes per record. The records, extracted by the splitContentIntoRecords method, would look similar to this:

Copy
[
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 0
    text: 'Welcome on test.com, the best resource to',
  },
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 1
    text: 'find interesting content online.',
  }
]

Assuming that the automatic generation of objectIDs is enabled in your configuration, the crawler generates an objectID for each of the generated records.

In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable distinct in your index settings, set the attributeForDistinct, searchableAttributes, and add a custom ranking from first record on your page to the last:

Copy
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Please be aware that using distinct comes with some specificities.

helpers ➔ splitContentIntoRecords

`$elements`	type: string default: $("body") A Cheerio instance that determines from which element(s) textual content will be extracted and turned into records.
`baseRecord`	type: object default: {} Attributes (and their values) to add to all resulting records.
`maxRecordBytes`	type: number default: 10000 Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size.
`textAttributeName`	type: string default: text Name of the attribute in which to store the text of each record.
`orderingAttributeName`	type: string Optional Name of the attribute in which to store the number of each record.

safetyChecks

discoveryPatterns

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

PHP

Ruby

JavaScript

Python

Swift

Kotlin

Android

.NET

Java

Golang

Scala

InstantSearch.js

React InstantSearch

Vue InstantSearch

Angular InstantSearch

InstantSearch iOS

InstantSearch Android

Autocomplete

Crawler Configuration API

Index settings and search parameters

A full reference of API Endpoints

Rails

Symfony

Django

Laravel

Crawler

Magento 2

WordPress

Shopify

Salesforce Commerce Cloud B2C

Netlify

actions

About this parameter

Examples

Parameters

Action

action ➔ recordExtractor

recordExtractor ➔ helpers

helpers ➔ splitContentIntoRecords

Did you find this page helpful?

On this page