Tools / Crawler / linkExtractor

Jan. 08, 2021

linkExtractor

Type: function

Parameter syntax

linkExtractor: ({ $, url, defaultExtractor }) ==> {
  ...
  // return ['https://...']
}

See code examples

About this parameter

Override the default logic used to extract URLs from pages.

By default, we queue all URLs that comply with pathsToMatch, fileTypesToMatch, and exclusions. You can override this default logic by providing a custom function which executes on each crawled page, and returns the URLs to queue.

The expected return value is an array of URLs (as strings).

Examples

Copy
  {
    linkExtractor: ({ $, url, defaultExtractor }) => {
      if (/example.com\/doc\//.test(url.href) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
      }
      // Otherwise, use the default logic (queue all found links)
      return defaultExtractor();
    },
  }

Copy
{
  linkExtractor: ({ $, url, defaultExtractor }) => {
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
    // This turns off link discovery, except for URLs listed in sitemap.xml
  },
}

Copy
{
  linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
  },
}

Parameters

`url`	type: URL Optional URL of the resource that was just crawled.
`defaultExtractor`	type: function Optional Default function used internally by the Crawler to discover URLs from a resource’s content. It returns an array of strings containing all URLs found on the current resource (if they match the configuration).
`$`	type: object (Cheerio instance) Optional A Cheerio instance containing the HTML of the crawled page.

requestOptions

externalDataSources

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

PHP

Ruby

JavaScript

Python

Swift

Kotlin

Android

.NET

Java

Golang

Scala

InstantSearch.js

React InstantSearch

Vue InstantSearch

Angular InstantSearch

InstantSearch iOS

InstantSearch Android

Autocomplete

Crawler Configuration API

Index settings and search parameters

A full reference of API Endpoints

Rails

Symfony

Django

Laravel

Crawler

Magento 2

WordPress

Shopify

Salesforce Commerce Cloud B2C

Netlify

linkExtractor

About this parameter

Examples

Parameters

Did you find this page helpful?

On this page