Guides / Scaling

Apr. 07, 2021

Scaling to Larger Datasets

Preparation

If possible, it’s best to keep Algolia in the loop when you plan on indexing a massive quantity of data. With advanced notice, we can help by monitoring the infrastructure and engine, and optimizing the configuration of the machines and indices. For example, we might need to manually fine-tune the internal indices sharding for your specific data.

Contact either your dedicated Solutions Engineer or support@algolia.com to prepare for massive indexing operations.

Configure your indices before pushing data

It’s best to configure your index before pushing the records. You only need to do this once per index.

The searchableAttributes setting is particularly important beforehand to ensure the best indexing performance. By default, Algolia indexes all attributes, which requires more processing power than indexing only necessary ones.

Ensure the data fits on your machine

Plans with a dedicated cluster come with servers with 128 GB of RAM. For optimal performance, you should keep the total size of indices below 80% of the total allocated RAM, since Algolia stores all indices in memory. The remaining 20% is for other tasks, such as indexing. When the data size exceeds the RAM capacity, the indices swap back and forth between the SSD and the RAM as operations are performed, which severely degrades performance.

Since Algolia processes your data, the actual size of an index is often larger than the size of your raw data. The exact factor heavily depends on the structure of your data and configuration of your index. Usually, it’s between two to three times as large.

Pushing data

Use the API clients

It’s best to use the official API clients for pushing data, as opposed to using the REST API directly, a custom wrapper, or an unofficial client that Algolia doesn’t maintain internally. The official API clients follow strict specifications that contain optimizations for both performance and reliability. These optimizations are required when performing bulk imports.

Batch indexing jobs

All official API clients have a Save objects method that lets you push records in batches. Pushing records one by one is a lot harder to process for the engine because it needs to keep track of the progress of each job. Batching decreases the overhead.

Batches between 1 and 100K records tend to be optimal, depending on the average record size. Each batch should remain below ~10 MB of data for optimal performance. The API can technically handle batches up to 1 GB, but sending much smaller batches yields better indexing performance.

Multi-thread your indexing

You can push the data from multiple servers, or multiple parallel workers.

Multi-Cluster Management (MCM)

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

Building Search UI

PHP

Ruby

JavaScript

Python

Swift

Kotlin

Android

.NET

Java

Golang

Scala

InstantSearch.js

React InstantSearch

Vue InstantSearch

Angular InstantSearch

InstantSearch iOS

InstantSearch Android

Autocomplete

Crawler Configuration API

Index settings and search parameters

A full reference of API Endpoints

Rails

Symfony

Django

Laravel

Crawler

Magento 2

WordPress

Shopify

Salesforce Commerce Cloud B2C

Netlify

Scaling to Larger Datasets

On this page

Preparation

Configure your indices before pushing data

Ensure the data fits on your machine

Pushing data

Use the API clients

Batch indexing jobs

Multi-thread your indexing

Did you find this page helpful?

On this page