Searching Confluence Data
On this page
Introduction
Using Algolia to search through your Confluence space brings powerful relevance tools and customization in displaying the results. It’s also helpful when you want to centralize all your online resources (Google Drive, Dropbox, Salesforce…) under a single search experience.
The self-hosted solution and cloud service of Confluence have distinct APIs, but here we’ll focus on Confluence Cloud.
This tutorial guides you into indexing the pages from Confluence Cloud, following these steps, using Node.js:
- Getting your Confluence credentials
- Fetching and indexing documents
- Looping over the pagination
- Setting up incremental updates
Prerequisites
Familiar with Node.js
This tutorial assumes you are familiar with Node.js, how it works, and how to create and run Node.js scripts. If you want to learn more before going further, we recommend you read the following resources:
Also, you need to install Node.js in your environment.
Have a Confluence and Algolia account
For this tutorial, we assume that you:
- already have a Confluence account,
- already created an Algolia account. If not, you can create an account.
Install dependencies
You need to connect to your Algolia account. For that, you can use our Algolia Search library. We’ll also install Request-Promise-Native to avoid using callbacks, and striptags to sanitize HTML data.
Let’s add these dependencies to your project by running the following command in your terminal:
1
npm install algoliasearch request-promise-native striptags
Fetching data
Getting your Confluence credentials
Confluence provides two ways to authenticate: JSON Web Tokens and Basic Auth. For the sake of simplicity, we will go for Basic Auth. For more information on JWT authentication, you can read Authentication for Apps on the Confluence Cloud documentation.
On the Confluence Cloud admin panel, go to Users › Invite user, and add the new user’s email address. We recommend using a service account email because you’ll have to store the credentials, plus you’ll be able to tweak the space and group permissions regardless of your employees’ access levels. See Invite, edit, and remove users on the Confluence Cloud documentation to know more.
On the first login, you’ll be asked to create a password for the email you just invited. The email and password of this account are the credentials for Basic Auth.
Building the query
We’ll first create a helpers.js
file and add a reusable function for querying resources.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
const rp = require('request-promise-native');
const CONFLUENCE_HOST = 'https://yourdomain.atlassian.net/wiki';
const CONFLUENCE_USERNAME = 'user@example.com';
const CONFLUENCE_PASSWORD = 'user_password';
module.exports = {
confluenceGet(uri) {
return rp({
url: CONFLUENCE_HOST + uri,
// GET parameters
qs: {
limit: 20, // number of item per page
orderBy: 'history.lastUpdated', // sort them by last updated
expand: [
// fields to retrieve
'history.lastUpdated',
'ancestors.page',
'descendants.page',
'body.view',
'space'
].join(',')
},
headers: {
// auth headers
Authorization: `Basic ${Buffer.from(
`${CONFLUENCE_USERNAME}:${CONFLUENCE_PASSWORD}`
).toString('base64')}`
},
json: true
});
}
};
In a new index.js
file, which will be our main job, we can make our first API call:
1
2
3
4
5
const { confluenceGet } = require('./helpers.js');
const run = () => confluenceGet('/rest/api/content');
run();
Preparing the records
Before sending content to Algolia, we want to make sure we only include the attributes we need. We also need to keep our data as flat as possible for performance reasons. We’ll therefore create a parseDocuments
function that takes an array of documents and returns properly formatted records.
First, we need to create two internal functions: buildURL
to convert a Confluence URI into a full URL, and parseContent
to clean up the HTML content.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// helpers.js
const rp = require('request-promise-native');
const striptags = require('striptags');
const buildURL = uri =>
uri ? CONFLUENCE_HOST + uri.replace(/^\/wiki/, '') : false;
const parseContent = html =>
html
? striptags(html)
.replace(/(\r\n?)+/g, ' ')
.replace(/\s+/g, ' ')
: '';
Then, we can add our parseDocuments
function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
module.exports = {
confluenceGet(uri) {
// ...
},
parseDocuments(documents) {
return documents.map(doc => ({
objectID: doc.id,
name: doc.title,
url: buildURL(doc._links.webui),
space: doc.space.name,
spaceMeta: {
id: doc.space.id,
key: doc.space.key,
url: buildURL(doc.space._links.webui)
},
lastUpdatedAt: doc.history.lastUpdated.when,
lastUpdatedBy: doc.history.lastUpdated.by.displayName,
lastUpdatedByPicture: buildURL(
doc.history.lastUpdated.by.profilePicture.path.replace(
/(\?[^\?]*)?$/,
'?s=200'
)
),
createdAt: doc.history.createdDate,
createdBy: doc.history.createdBy.displayName,
createdByPicture: buildURL(
doc.history.createdBy.profilePicture.path.replace(
/(\?[^\?]*)?$/,
'?s=200'
)
),
path: doc.ancestors.map(({title}) => title).join(' › '),
level: doc.ancestors.length,
ancestors: doc.ancestors.map(({id, title, _links}) => ({
id: id,
name: title,
url: buildURL(_links.webui)
})),
children: doc.descendants
? doc.descendants.page.results.map(({id, title, _links}) => ({
id: id,
name: title,
url: buildURL(_links.webui)
}))
: [],
content: parseContent(doc.body.view.value)
}));
}
};
Let’s break down what’s happening here:
- We deliberately set an
objectID
, so Algolia doesn’t generate one for us. This will allow us to avoid creating duplicates every time we run the script. - We parse the
lastUpdatedByPicture
to remove unnecessary parameters, and appends=200
to set the desired size of the output image (in pixels). - We set a
path
attribute to make ancestors searchable, and have subpages showing when searching for a document. - We set a
level
attribute that represents the depth of the document in the wiki tree. It could be useful in the tie-breaking strategy. - We keep track of
ancestors
andchildren
for presentation purposes, in case you needed to display clickable breadcrumbs and dependencies. - We strip HTML and line breaks from our
content
to avoid noise during search.
Indexing content
Setting up your index
In your Algolia Dashboard, create a new confluence
index and add an API key with all the “records” permissions checked.
Sending to Algolia
The Confluence API paginates results. To get them all, you need to keep on looping as long as there is a next
link available in the response.
Confluence space could get quite crowded. For that reason, it would be better for performance reasons to upload documents to Algolia in several batches, instead of sending one big payload.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// index.js
const algoliasearch = require('algoliasearch');
const { confluenceGet, parseDocuments } = require('./helpers.js');
const client = algoliasearch('YourApplicationID', 'YourAdminAPIKey');
const index = client.initIndex('confluence');
const run = () => {
const saveObjects = (link = '/rest/api/content') => confluenceGet(link).then(({results, _links}) => {
index.saveObjects(parseDocuments(results)).then(res => {
if (_links.next) saveObjects(_links.next);
});
});
saveObjects();
};
run();
Dealing with large documents
As we index the content of each document and not only their metadata, we may exceed the record size limit. At Algolia, we recommend that you keep records small in order to get more relevant and faster search. To prevent this, we can chunk the data and use the distinct feature. This means we’re going to save multiple records with the same metadata, each carrying a chunk of the content, but we’ll set our search results to only show one document at a time.
We need to update parseDocuments
to split a document into several records, based on content length.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// helpers.js
module.exports = {
confluenceGet(uri) {
// ...
},
parseDocuments(documents) {
return documents.map(({body}) => {
const record = {
// ...
content: null // initialize with null value instead
};
let content = parseContent(body.view.value);
while (content.length) {
// extract the first 600 characters (without splitting words)
const chunk = content.replace(/^(.{600}[^\s]*).*/, '$1');
// remove chunk from the original content
content = content.substring(chunk.length);
// inject the content chunk into a copy of the initial record
return Object.assign({}, record, { content: chunk });
}
});
}
};
We also need to set the attribute name
as attributeForDistinct
in our Algolia index. You can do this either programmatically or from your Dashboard.
1
2
3
index.setSettings({ attributeForDistinct: 'name' })).then(() => {
// done
})
When you build your search, you’ll need to set the distinct
parameter to true
in the search query, to make sure you only get one hit for each chunked document.
Incremental sync
Incrementally synchronizing your data is about only indexing pages that were recently updated within a certain period of time (for example, the last 10 minutes) and running the script at a fixed interval. When your Confluence space has a large amount of pages, this prevents you from hitting the API rate limits and indexation will be fast.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// index.js
const run = (from = 0) => {
const saveObjects = (link = '/rest/api/content') => {
let lastUpdatedAt = 0;
return confluenceGet(link).then(({results, _links}) => {
index.saveObjects(parseDocuments(results)).then(res => {
lastUpdatedAt = new Date(
results[results.length - 1].lastUpdatedAt
).getTime();
if (_links.next && lastUpdatedAt >= from)
saveObjects(_links.next);
});
});
};
saveObjects();
};
const args = process.argv.slice(2);
const from = args.length
? new Date(args[0]).getTime()
: Date.now() - 10 * 60 * 1000;
run(from);
From now on, you only need to schedule a job that will run node index.js
every 10 minutes. If you need to run it from a specific date, run the script with the date as an ISO 8601 string in arguments (for example, node index.js 2000-01-01T00:00:00+00:00
).
You can also run node index.js 0
to perform a full sync of your space.