Crawl and Index CSS, JSON, and Excel Files

If your company data is stored in files like Excel sheets, CSV files, and JSON files, you can now make them searchable on platform(s), where SearchUnify-powered search is deployed. The newly introduced File content source ensures seamless accessibility.

Establish a Connection

  1. Navigate to Content Sources and click Add New Content Sources.

  1. Find File content source and click Add.

  2. Enter the required details:

    • Name: Give your content source a Name.

    • Add File URL or Upload Xlsx/Csv/Json files: Either enter the URL of the file or upload the file whose data you want to crawl to SearchUnify index.

    • Primary ID: Use this to select what parent column they want to keep as the ID field from the uploaded file.

      Note: If you leave this field empty, the system will automatically pick the primary ID from the first three columns of the file.

    • Language: Select the language of the contents of your uploaded file.

  3. After you have uploaded the file, click Connect.

Re-Connect

An admin can edit a Content Source for multiple reasons, including:

  • To reauthenticate

  • To fix a crawl error

  • To change frequency

  • To add or remove an object or a field for crawling

When a Content Source is edited, either a Connect or a Re-Connect button is displayed.

Case 1: When the Connect button is displayed:

When the Connect button is displayed if the Content Source authentication is successful. Along with the button, a message is displayed There are no crawl errors and the Content Source authentication is valid.

Fig. The Connect button is displayed on the Authentication tab.

Case 2: When the Re-connect button is displayed:

The Re-connect button is displayed when the authentication details change or the authentication fails for any reason.

In both cases, the Content Source connection must be authenticated again. To reauthenticate a Content Source, enter the authentication details, and click Re-Connect.

Fig. The Re-Connect button is displayed on the Authentication tab.

Set Up Crawl Frequency

The first crawl is always performed manually after configuring the content source. In the Choose a Date field, select a date to start the crawl; only data created after the selected date will be crawled*. For now, leave the frequency set to its default value, Never, and click Set.

Fig. The Frequency tab when "Frequency" is set to "Never".

Add Objects and Fields for Indexing

Navigate to the Rules tab and you will see the By Content Type subtab. SearchUnify gives admins the flexibility to add multiple Objects for separate crawling and indexing different sets of data from your files. This is because the files that admins will upload may have different sets of data, which will be crawled and indexed separately.

  1. To add an object, enter object details in the Name and Search Label fields, and click Add Object. The name and the search label do not have to be valid HTML tags. You can add different objects for different sections of the file.

  2. After adding the required object(s), click Manage Fields to add the fields whose data you want to crawl.

  3. To add a field for crawling, add the required values as listed below:

    1. Give the selector a Name and Label.

    2. The Selector is the most important field. The values are pre-configured from the uploaded files. Select the field values you want to crawl.

    3. Assign the Selector a Type. The default type is string. Always use the default value unless the value is date or time.

    4. The Single and Multiple field options determine how data from repeated elements is managed.

      1. Single Field

        Use this option when you want to consolidate repeated content into a single entry in the search index.

        Example:

        Imagine an internal knowledge base article where the content is divided into multiple sections, such as an introduction, benefits, and a conclusion. If you select "Single," all sections will be combined into one searchable entry. This ensures users can find the entire article as a single result, making it easier to access the full context of the information.

      2. Multiple Field

        Use this option when you want to capture each repeated piece of information as a separate entry in the search index.

        Example:

        Imagine a corporate document where each key point is listed as a bullet point, and each bullet point is wrapped inside a <li> HTML tag. If you select Multiple, each key point (e.g., Increase productivity, Streamline workflows, Enhance collaboration) will be stored as a separate searchable entry. This ensures that users can search for and find individual points directly, rather than retrieving the entire document for a single query.

    5. IsMerged is to mark those fields that are combined.

  4. Click Add and then Apply.

  5. Save the settings and initiate crawling to crawl your data in SearchUnify.

Find and Replace

Users on the Q2 '24 release or a later version will notice a new button next to each object on the Rules screen. It resembles a magnifying glass and is labeled "Find and Replace." You can use this feature to find and replace values in a single field or across all fields. The changes will occur in the search index and not in your content source.

Fig. The "Find and Replace" button on the Rules tab in the Actions column.

Find and Replace proves valuable in various scenarios. A common use case is when a product name is altered. Suppose your product name has changed from "SearchUnify" to "SUnify," and you wish for the search result titles to immediately reflect this change.

  1. To make the change, click .

  2. Now, choose either "All" or a specific content source field from the "Enter Name" dropdown. When "All" is selected, any value in the "Find" column is replaced with the corresponding value in the "Replace" column across all content source fields. If a particular field is chosen, the old value is replaced with the new value solely within the selected field.

  3. Enter the value to be replaced in the Find column and the new value in the Replace column. Both columns accept regular expressions.

    Fig. Snapshot of Find and Replace.

  4. Click Add. You will see a warning if you are replacing a value in all fields.

  5. Click Save to apply settings

  6. Run a crawl for the updated values to reflect in the search results.

After the First Crawl

Return to the Content Sources screen and click in Actions. The number of indexed documents is updated after the crawl is complete. You can view crawl progress in in Actions. Documentation on crawl progress is in View Crawl Logs.

Once the first crawl is complete, click in Actions to open the content source for editing, and set a crawl frequency.

  1. In Choose a Date, click to fire up a calendar and select a date. Only the data created or updated after the selected date is indexed.

  2. The following options are available for the Frequency field:

    • When Never is selected, the content source is not crawled until an admin opts for a manual crawl on the Content Sources screen.

    • When Minutes is selected, a new dropdown appears where the admin can choose between three values: 15, 20, and 30. Picking 20 means that the content source crawling starts every 20 minutes.

    • When Hours is selected, a new dropdown is displayed where the admin can choose between eight values between 1, 2, 3, 4, 6, 8, 12, and 24. Selecting 8 initiates content crawling every 8 hours.

    • When Daily is selected, a new dropdown is displayed where the admin can pick a value between 0 and 23. If 15 is selected, the content source crawling starts at 3:00 p.m. (1500 hours) each day.

    • When Day of Week is selected, a new dropdown is displayed where the admin can pick a day of the week. If Tuesday is chosen, then content source crawling starts at 0000 hours on every Tuesday.

    • When Day of Month is selected, a new dropdown appears where the admin can select a value between 1 and 30. If 20 is chosen, then content source crawling starts on the 20th of each month.

      It is recommended to pick a date between the 1st and 28th of the month. If 30 is chosen, then the crawler may throw an error in February. The error will be “Chosen date will not work for this month.”

    • When Yearly is selected, the content source crawling starts at midnight on 1 January each year.

    Fig. The content source crawling starts at 00:00 on each Tuesday.

  3. Click Set to save the crawl frequency settings.

  4. Click Save.