API Reference / REST API / Crawler REST API

Overview

The Crawler REST API is only available if you have access to Crawler.

The Crawler REST API lets your interact directly with your crawlers.

All API calls go through the https://crawler.algolia.com/api/1 domain.

Request Format

Authentication is done via basic authentication.

Response Format

The response format for all requests is a JSON object.

The success or failure of an API call is indicated by its HTTP status code. A 2xx status code indicates success, whereas a 4xx status code indicates failure. When a request fails, the response body is still JSON, but contains a message field which you can use for debugging.

Crawler Endpoints

Quick Reference

Verb Path Method

POST

/crawlers

Create a crawler

GET

/crawlers

Get available crawlers

GET

/crawlers/{id}

Get a crawler

PATCH

/crawlers/{id}

Partially update a crawler

PATCH

/crawlers/{id}/config

Partially update a crawler's configuration

POST

/crawlers/{id}/run

Run a crawler

POST

/crawlers/{id}/pause

Pause a crawler's run

POST

/crawlers/{id}/reindex

Reindex with a crawler

GET

/crawlers/{id}/stats/urls

Get statistics on a crawler

POST

/crawlers/{id}/urls/crawl

Crawl specific URLs

POST

/crawlers/{id}/test

Test a URL on a crawler

GET

/crawlers/{id}/tasks/{tid}

Get status of a task

POST

/crawlers/{id}/tasks/{tid}/cancel

Cancel a blocking task

Create a crawler

A

Path: /crawlers
HTTP Verb: POST

Description:

Create a new crawler with the given configuration.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
2
3
4
curl -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d @create_crawler.json

This is the create_crawler.json file used in the curl call.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "name": "Algolia Website",
  "config": {
    "appId": "YOUR_APP_ID",
    "apiKey": "YOUR_API_KEY",
    "indexPrefix": "crawler_",
    "rateLimit": 8,
    "startUrls": [
      "https://www.algolia.com"
    ],
    "actions": [
      {
        "indexName": "algolia_website",
        "pathsToMatch": [
          "https://www.algolia.com/**"
        ],
        "selectorsToMatch": [
          ".products",
          "!.featured"
        ],
        "fileTypesToMatch": [
          "html",
          "pdf"
        ],
        "recordExtractor": {
          "__type": "function",
          "source": "() => {}"
        }
      }
    ]
  }
}

When the request is successful, the HTTP response is a 200 OK and returns the ID of the created crawler:

1
2
3
{
  "id": "x000x000-x00x-00xx-x000-000000000x00"
}

Get available crawlers

A

Path: /crawlers
HTTP Verb: GET

Description:

Get a list of your available crawlers and pagination information

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers"

When the request is successful, the HTTP response is 200 OK and returns a list of your available crawlers:

1
2
3
4
5
6
7
8
9
10
11
{
  "items": [
    {
      "id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809",
      "name": "My Crawler"
    }
  ],
  "itemsPerPage": 20,
  "page": 1,
  "total": 100
}

Get a crawler

A

Path: /crawlers/{id}
HTTP Verb: GET

Description:

Get information about the specified crawler and its settings.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}?withConfig=true"

When the request is successful, the HTTP response is 200 OK and returns information on the specified crawler:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
   "name": "Algolia Blog",
   "createdAt": "2019-08-22T08:17:09.680Z",
   "updatedAt": "2019-09-02T09:57:33.968Z",
   "running": false,
   "reindexing": false,
   "blocked": false,
   "lastReindexStartedAt": "2019-08-29T07:56:51.682Z",
   "lastReindexEndedAt": "2019-08-29T07:57:35.991Z"
   "config": {
     // YOUR CRAWLER'S CONFIGURATION
   }
}

Partially update a crawler

A

Path: /crawlers/{id}
HTTP Verb: PATCH

Description:

Update a crawler’s configuration or change its name.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
2
3
curl -X PATCH --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/" \
  -H "Content-Type: application/json" \
  -d @crawler_patch.json

The crawler_patch.json file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "name": "New config name",
  "config": {
    "appId": "YOUR_APP_ID",
    "apiKey": "YOUR_API_KEY",
    "indexPrefix": "crawler_",
    "rateLimit": 8,
    "startUrls": [
      "https://www.algolia.com"
    ],
    "actions": [
      {
        "indexName": "algolia_website",
        "pathsToMatch": [
          "https://www.algolia.com/**"
        ],
        "selectorsToMatch": [
          ".products",
          "!.featured"
        ],
        "fileTypesToMatch": [
          "html",
          "pdf"
        ],
        "recordExtractor": {
          "__type": "function",
          "source": "() => {}"
        }
      }
    ]
  }
}

When the request is successful, the HTTP response is a 200 OK and returns the taskId of your request:

1
2
3
{
  "taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Partially update a crawler’s configuration

A

Path: /crawlers/{id}/config
HTTP Verb: PATCH

Description:

Update parts of a crawler’s configuration.

The partial update endpoint currently only supports top-level configuration properties. For example, to update a recordExtractor, you need to pass the complete actions property.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
2
3
curl -X PATCH --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/config" \
  -H "Content-Type: application/json" \
  -d @crawler_config_patch.json

The crawler_config_patch.json file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
  "rateLimit": 8,
  "startUrls": [
    "https://www.algolia.com",
    "https://www.algolia.com/blog"
  ],
  "actions": [
    {
      "indexName": "algolia_website",
      "pathsToMatch": [
        "https://www.algolia.com/**"
      ],
      "recordExtractor": {
        "__type": "function",
        "source": "() => {}"
      }
    }
  ]
}

When the request is successful, the HTTP response is a 200 OK and returns the taskId of your request:

1
2
3
{
  "taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Run a crawler

A

Path: /crawlers/{id}/run
HTTP Verb: POST

Description:

Request the specified crawler to run.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/run"

When the request is successful, the HTTP response is a 200 OK and returns the task ID of the crawl:

1
2
3
{
  "taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Pause a crawler’s run

A

Path: /crawlers/{id}/pause
HTTP Verb: POST

Description:

Request the specified crawler to pause its eecution.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/pause"

When the request is successful, the HTTP response is a 200 OK and returns the task ID of your request:

1
2
3
{
  "taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Reindex with a crawler

A

Path: /crawlers/{id}/reindex
HTTP Verb: POST

Description:

Request the specified crawler to start (or restart) crawling.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/reindex"

When the request is successful, the HTTP response is a 200 OK and returns the task ID of the crawl:

1
2
3
{
  "taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Get statistics on a crawler

A

Path: /crawlers/{id}/stats/urls
HTTP Verb: GET

Description:

Get a summary of the current status of crawled URLs for the specified crawler.

Parameters:

id
type: string
Required

The ID of your targeted crawler.

Errors:

  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/stats/urls"

When the request is successful, the HTTP response is a 200 OK and returns URL statistics for your specified crawler’s last crawl:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  "count":129,
  "data":[
    {
      "reason":"success",
      "status":"DONE",
      "category":"success",
      "readable":"Success",
      "count":127
    },
    {
      "reason":"http_not_found",
      "status":"SKIPPED",
      "category":"fetch",
      "readable":"HTTP Not Found (404)",
      "count":1
    },
    {
      "reason":"network_error",
      "status":"FAILED",
      "category":"fetch",
      "readable":"Network error",
      "count":1
    }
  ]
}

Crawl specific URLs

A

Path: /crawlers/{id}/urls/crawl
HTTP Verb: POST

Description:

Immediately crawl the given URLs. The generated records are pushed to the live index if there’s no ongoing reindex, and to the temporary index otherwise.

There’s a rate limit of 200 calls every 24h for this endpoint.

Parameters:

path

id
type: string
Required

The ID of your targeted crawler.

body

urls
type: string[]
Required

The URLs to crawl (maximum 50 per call).

save
type: boolean
Optional
  • When true, the given URLs are added to the extraUrls list of your configuration (unless already present in startUrls or sitemaps).
  • When false, the URLs aren’t saved in the configuration.
  • When unspecified, the URLs are added to the extraUrls list of your configuration, but only if they haven’t been indexed during the last reindex, and they aren’t already present in startUrls or sitemaps.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
2
3
curl -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/urls/crawl" \
  -H "Content-Type: application/json" \
  -d @crawl_urls.json

The crawl_urls.json file:

1
2
3
4
5
6
7
{
  "urls": [
    "https://www.algolia.com/blog/article-42.html",
    "https://www.algolia.com/blog/article-101.html"
  ],
  "save": false
}

When the request is successful, the HTTP response is a 200 OK and returns the taskId of your request:

1
2
3
{
  "taskId": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Test a URL on a crawler

A

Path: /crawlers/{id}/test
HTTP Verb: POST

Description:

Test an URL against the given crawler’s configuration and see what will be processed. You can also override parts of the configuration to try your changes before updating the configuration.

Parameters:

id
type: string
Required

The ID of your targeted crawler

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
2
3
4
curl -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/test" \                                           [ INSERT
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d @test_crawler.json

This is the test_crawler.json file used in curl call.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "url": "https://www.algolia.com/blog",
  "config": {
    "appId": "YOUR_APP_ID",
    "apiKey": "YOUR_API_KEY",
    "indexPrefix": "crawler_",
    "rateLimit": 8,
    "startUrls": [
      "https://www.algolia.com"
    ],
    "actions": [
      {
        "indexName": "algolia_website",
        "pathsToMatch": [
          "https://www.algolia.com/**"
        ],
        "selectorsToMatch": [
          ".products",
          "!.featured"
        ],
        "fileTypesToMatch": [
          "html",
          "pdf"
        ],
        "recordExtractor": {
          "__type": "function",
          "source": "() => {}"
        }
      }
    ]
  }
}

When the request is successful, the HTTP response is a 200 OK and returns the ID of the created crawler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "startDate": "2019-05-21T09:04:33.742Z",
  "endDate": "2019-05-21T09:04:33.923Z",
  "logs": [
    [
      "Processing url 'https://www.algolia.com/blog'"
    ]
  ],
  "records": [
    {
      "indexName": "testIndex",
      "records": [
        {
          "objectID": "https://www.algolia.com/blog",
          "numberOfLinks": 2
        }
      ],
      "recordsPerExtractor": [
        {
          "index": 0,
          "type": "custom",
          "records": [
            {
              "objectID": "https://www.algolia.com/blog"
            }
          ]
        }
      ]
    }
  ],
  "links": [
    "https://blog.algolia.com/challenging-migration-heroku-google-kubernetes-engine/",
    "https://blog.algolia.com/tale-two-engines-algolia-unity/"
  ],
  "externalData": {
    "dataSourceId1": {
      "data1": "val1",
      "data2": "val2"
    },
    "dataSourceId2": {
      "data1": "val1",
      "data2": "val2"
    }
  },
  "error": {}
}

Get status of a task

A

Path: /crawlers/{id}/tasks/{tid}
HTTP Verb: GET

Description:

Get the status of a specific task

Parameters:

id
type: string
Required

The ID of the targeted crawler.

tid
type: string
Required

The ID of the targeted task.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -X GET --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/tasks/${TASK_ID}"

When the request is successful, the HTTP response is 200 OK and returns whether the task is pending or not:

1
2
3
{
  "pending": false
}

Cancel a blocking task

A

Path: /crawlers/{id}/tasks/{tid}/cancel
HTTP Verb: POST

Description:

Cancel a specific task that is currently blocking a crawler. The blocking task ID to use for{tid} is available in the Get a Crawler response in the blockingTaskId field. This field is only present if the queried crawler is blocked. You can’t cancel non-blocking tasks. Using the endpoint on a non-blocking task returns an error.

Parameters:

id
type: string
Required

The ID of the targeted crawler.

tid
type: string
Required

The ID of the blocking task.

Errors:

  • 400: Bad request or request argument.
  • 401: Authorization information is missing or invalid.
  • 403: The user doesn’t have access rights to the specified crawler, or the crawler doesn’t exist.

Example:

A

1
curl -H "Content-Type: application/json" -X POST --user ${CRAWLER_USER_ID}:${CRAWLER_API_KEY} "https://crawler.algolia.com/api/1/crawlers/${CRAWLER_ID}/tasks/${TASK_ID}/cancel"

When the request is successful, the HTTP response is 200 OK.

Did you find this page helpful?