Set up a Content Source with API

SearchUnify supports more than 30 content sources out-of-the-box. Any data repository that is not in the list can be transformed into a content source with APIs. It's secure because API is OAuth 1.0 compliant. This article walks you through the process.

PREREQUISITES

  • Content source with a working API and structured data
  • Familiarity with essential REST API terminology
  • Access to the API of the platform, where content resides

Establish a Connection

  1. Navigate to Content Sources and click Add New Content Sources.

  1. Find API and click Add.

  2. Enter the following details for authentication:
    • Name. Insert a label for your content source.

    • Client URL. Enter the base URL of your API. It might resemble https://mycompany.platform.com/api/v2.

    • Language. Select the language of your API platform. English-en is the default selection.

    • Permission. Leave this field blank.

  3. Click Connect.

Configure the Calls

The Configuration section covers the set-up process to crawl and index your content through APIs.

  1. Select either GET or POST from Method Type.

  2. Select an Authentication Type. You can pick:
    • No Auth. No details are needed to authenticate the connection with your content source.
    • Basic Auth. Enter the user name and password required to make calls to your REST API.
    • OAuth1. Enter the Consumer Key, Consumer Secret, Token, Token Secret, and Realm to make calls to your REST API.

  3. Write all the key-value pairs to be used in Header in JSON. If you want to keep the field empty, write a pair of curly braces ({}).

  4. Open Request Parameters.

  5. Enter query parameters in JSON to consume your REST API.

  6. Select a pagination method:
    • By Page No. Almost no platform returns data in a single go. That means a caller will have to make multiple calls to get all the documents and each call will be slightly different. The difference is usually a field in your query parameter which has to be selected in the Page No. Tag dropdown. Page No. Tag appears when you select By Page No. as your pagination method. It gets incremented with each call and fetches a new batch of documents each time it is run. Default Page Size Tag—which also appears when you select By Start Offset—determines the size of each page. If the value of its parameter is 10, then there will be 10 results on each page.
    • By Start Offset. It is similar to By Page No. The only difference being that the variable used to differentiate calls is Page Start Offset Tag.
    • By Next Link. It is an object with two key-pair values: a Boolean and a link. The crawler uses the value stored in the link as the differentiating variable as long as the Boolean is true. For Next Link Tag, select the URL key to move to the next page.
    • No Pagination. The entire data is indexed in one go. It is extremely rare to encounter a content source that will accept this method.
    • Has More & Next ID. It is an object with two key-pair values: a Boolean and a tag. The crawler uses the value stored in the link as the differentiating variable as long as the Boolean is true. For NEXT ID field, select the key from responses where the ID is present for the next page.

  7. Open Response Parameters and enter a JSON snippet that contains all the response parameters in Response.

  8. Select a Result Iteration Tag and an Index Tag. Result Iteration Tag is a response field that contains the data you wish to index. Index Tag contains a unique ID to distinguish document from one another.

  9. SearchUnify will stop making calls to your content source once all your documents have been indexed. Total Results Method keeps a count of the total and indexed documents and there are two ways to keep this count: By Total Count or By Has More Flag. Select one.

  10. From Total Count Tag, select a parameter returns total count. Then, click Set.

Re-Connect

The Authentication screen is displayed when an existing Content Source is opened for editing, as given below. An admin can edit a Content Source for multiple reasons:

  • To reauthenticate

  • To fix a crawl error

  • To change frequency

  • To add or remove an object or a field for crawling

  • For multiple other reasons

When you edit a Content Source, there can be any one of two cases as listed below:

If the Content Source authentication is successful; a Connected message is displayed.

Case 1: There are no crawl errors and the Content Source authentication is valid.

If the Content Source authentication fails or is disrupted; a Re-Connect button is displayed.

There is a crawl error or the authentication details have changed. In both cases, the SearchUnify Content Source connection must be authenticated again i.e. re-authenticated. To authenticate a Content Source again, enter the authentication details, and click Re-Connect.

Set Up Crawl Frequency

The first crawl is always manual and is performed after configuring the content source. In Choose A Date, select a date to start crawling; the data created after the selected date will be crawled. For now, keep the frequency to its default value Never and click Set and move to the next section.

Add the Objects and Fields for Indexing

  1. Enter an object where your data is stored, give it a label, and click Add Object.

  2. Repeat the previous step if you wish to add more than one object.
  3. Click to manage fields.

  4. Add all the fields that you wish to index and click Save.

  5. Click Save.

After the First Crawl

Return to the Content Sources screen and click in Actions. The number of indexed documents is updated after the crawl is complete. You can view crawl progress in in Actions. Documentation on crawl progress is in View Crawl Logs.

Once the first crawl is complete, click in Actions open the content source for editing, and set a crawl frequency.

  1. In Choose a Date, click to fire up a calendar and select a date. Only the data after the selected date is indexed.

  2. The following options are available for the Frequency field:

    • When Never is selected, the content source is not crawled until an admin opts for a manual crawl on the Content Sources screen.

    • When Minutes is selected, a new dropdown appears where the admin can choose between three values: 15, 20, and 30. Picking 20 means that the content source crawling starts every 20 minutes.

    • When Hours is selected, a new dropdown is displayed where the admin can choose between eight values between 1, 2, 3, 4, 6, 8, 12, and 24. Picking 8 means that the content source crawling starts every 8 hours.

    • When Daily is selected, a new dropdown is displayed where the admin can pick a value between 0 and 23. If 15 is chosen, then the content source crawling starts at 03 p.m. or 1500 hours every single day.

    • When Day of Week is selected, a new dropdown is displayed where the admin can pick a day of the week. If Tuesday is chosen, then content source crawling starts at 0000 hours on every Tuesday.

    • When Day of Month is selected, a new dropdown appears where the admin can select a value between 1 and 30. If 20 is chosen, then content source crawling starts on the 20th of each month.

      It’s recommended to pick a date in the range 1-28. If 30 is chosen, then the crawler may throw an error in February. The error will be “Chosen date will not work for this month.”

    • When Yearly is selected, the content source crawling starts at midnight on 1 January each year.

  3. Click Set to save crawl frequency settings. On clicking Set, you are taken to the Rules tab.

Smart Crawls

Smart Crawls reduce indexing time. When it's toggled on, then during each frequency crawl the search index isn't created from scratch. Instead, it's updated to reflect the data changes in your content source. Let's say that two new docs have been created in your content source since the previous crawl. In this scenario, the new docs will be added to the index.

When Smart Crawls is toggled off, then the search index is generated from scratch during each frequency crawl. Instead of merely adding the docs that have been created in your content source since the last crawl, the current index is deleted entirely and a brand new index is created from scratch.