Use Google Drive As a Content Source

Use SearchUnify-powered search to crawl, index, and search the data in your Google Drive instance. This article explains how to start using Google Drive as a content repository for your search clients.

PERMISSIONS

The person authenticating Google Drive can index only those files to which they have view-access.

SearchUnify respects user permissions during searches. Only if a user has access to a file named company_file_x, can the user can find company_file_x through search.

SearchUnify needs read-only access to Google Drive. In the Google's parlance, it means view and download access.

Files on which export is disabled aren't crawled.

If file permissions are altered, these changes will be reflected in the index following the frequency crawl. Additionally, any modifications to shared drive and folder permissions will also be updated in the SearchUnify index.

A person creates a folder with multiple files and shares it with a teammate. If the teammate removes a file from the folder, then the file data is not deleted from the SearchUnify index. However, if the creator themselves remove a file, then an event is triggered and the data is removed from the SearchUnify index.

With Google OAuth 2.0-based authentication in new SearchUnify instances, an API rate limit of 10,000 grants per day is applicable. It might impact the crawling in SearchUnify.

Establish a Connection

  1. Navigate to Content Sources and click Add New Content Sources.

  1. Find Google Drive and click Add.

  2. Under the Authentication tab, enter the required details:

    A) SearchUnify instances on Q1 '24 or newer versions. Give your content source a Name and select the Language. Also, enter Client ID and Client Secret of your Google account.

    Refer to this doc on how to get the Google Client ID and Client Secret - Obtain Google Client ID and Client Secret.

    B) SearchUnify instances older to Q1 '24. Give your content source a Name and select the Language, and click Connect.

  3. A permissions window pops up asking for required permissions. Click Allow.

The connection has been successfully set up if you see a connection successful message will appear. Click Next to proceed to setting the frequency.

Google hasn't verified this app

If you encounter an error where Google indicates that the app has not been verified, please disregard the warning and continue with the authentication process. This issue is slated for resolution in future updates.

Re-Connect

The Authentication screen is displayed when an already-created Content Source is opened for editing. An admin can edit a Content Source for multiple reasons, including:

  • To reauthenticate

  • To fix a crawl error

  • To change frequency

  • To add or remove an object or a field for crawling

When a Content Source is edited, either a Connect or a Re-Connect button is displayed.

Case 1: When the Connect button is displayed:

When the Connect button is displayed if the Content Source authentication is successful. Along with the button, a message is displayed There are no crawl errors and the Content Source authentication is valid.

Fig. The Connect button is displayed on the Authentication tab.

Case 2: When the Re-connect button is displayed:

The Re-connect button is displayed when the authentication details change or the authentication fails for any reason.

In both cases, the Content Source connection must be authenticated again. To reauthenticate a Content Source, enter the authentication details, and click Re-Connect.

Fig. The Re-Connect button is displayed on the Authentication tab.

Set Up Crawl Frequency

The first crawl is always manual and is performed after configuring the content source. In Choose A Date, select a date to start crawling; the data created after the selected date will be crawled. For now, keep the frequency to its default value Never. Click Set.

Fig. The Frequency tab when "Frequency" is set to "Never".

Select Types and Fields for Indexing

Google Drive supports only file content type. File types that are not supported while crawling are as follows:

  • 'application/vnd.google-apps.drive-sdk',

  • 'application/vnd.google-apps.audio',

  • 'application/vnd.google-apps.map',

  • 'application/vnd.google-apps.photo',

  • 'application/vnd.debian.binary-package',

  • 'image/png',

  • 'application/vnd.google-apps.site',

  • 'application/vnd.google-apps.shortcut',

  • 'application/gzip', 'application/zip',

  • 'image/jpeg',

  • 'application/vnd.google-apps.video',

  • 'application/x-gzip'

  1. Under the Rules tab, you will land on the By Content Type subtab.

  2. Click to select a content type. You can see the list of pre-configured fields here.

    NOTE. You can add or delete the content fields. Although, it is not recommended for users other than Admins to make any changes in the fields.
  3. Switch to By Folders subtab.

  4. From My Folders, Shared Folders, and Shared Drive select the directories and move then to Select Files and Folders section for indexing.

  5. After selecting the repositories and click Save.

You have successfully added Google Drive as a content source in SearchUnify. Perform a manual crawl to start indexing the Drive data in SearchUnify.

Related

Difference between Manual and Frequency Crawls

Find and Replace

Users on the Q2 '24 release or a later version will notice a new button next to each object on the Rules screen. It resembles a magnifying glass and is labeled "Find and Replace." You can use this feature to find and replace values in a single field or across all fields. The changes will occur in the search index and not in your content source.

Fig. The "Find and Replace" button on the Rules tab in the Actions column.

Find and Replace proves valuable in various scenarios. A common use case is when a product name is altered. Suppose your product name has changed from "SearchUnify" to "SUnify," and you wish for the search result titles to immediately reflect this change.

  1. To make the change, click .

  2. Now, choose either "All" or a specific content source field from the "Enter Name" dropdown. When "All" is selected, any value in the "Find" column is replaced with the corresponding value in the "Replace" column across all content source fields. If a particular field is chosen, the old value is replaced with the new value solely within the selected field.

  3. Enter the value to be replaced in the Find column and the new value in the Replace column. Both columns accept regular expressions.

    Fig. Snapshot of Find and Replace.

  4. Click Add. You will see a warning if you are replacing a value in all fields.

  5. Click Save to apply settings

  6. Run a crawl for the updated values to reflect in the search results.

After the First Crawl

Return to the Content Sources screen and click in Actions. The number of indexed documents is updated after the crawl is complete. You can view crawl progress in in Actions. Documentation on crawl progress is in View Crawl Logs.

Once the first crawl is complete, click in Actions to open the content source for editing, and set a crawl frequency.

  1. In Choose a Date, click to fire up a calendar and select a date. Only the data created or updated after the selected date is indexed.

  2. The following options are available for the Frequency field:

    • When Never is selected, the content source is not crawled until an admin opts for a manual crawl on the Content Sources screen.

    • When Minutes is selected, a new dropdown appears where the admin can choose between three values: 15, 20, and 30. Picking 20 means that the content source crawling starts every 20 minutes.

    • When Hours is selected, a new dropdown is displayed where the admin can choose between eight values between 1, 2, 3, 4, 6, 8, 12, and 24. Selecting 8 initiates content crawling every 8 hours.

    • When Daily is selected, a new dropdown is displayed where the admin can pick a value between 0 and 23. If 15 is selected, the content source crawling starts at 3:00 p.m. (1500 hours) each day.

    • When Day of Week is selected, a new dropdown is displayed where the admin can pick a day of the week. If Tuesday is chosen, then content source crawling starts at 0000 hours on every Tuesday.

    • When Day of Month is selected, a new dropdown appears where the admin can select a value between 1 and 30. If 20 is chosen, then content source crawling starts on the 20th of each month.

      It is recommended to pick a date between the 1st and 28th of the month. If 30 is chosen, then the crawler may throw an error in February. The error will be “Chosen date will not work for this month.”

    • When Yearly is selected, the content source crawling starts at midnight on 1 January each year.

    Fig. The content source crawling starts at 00:00 on each Tuesday.

  3. Click Set to save the crawl frequency settings.

  4. Click Save.

Data Deletion and SU Index

All the data deleted from your Drive content source is removed from the SearchUnify index with every frequency crawl.

OAuth2.0 Setup Pending

If you are an existing SearchUnify user and you migrate your instance to Q1 '21 or newer versions, your YouTube and Google Drive content sources will continue to work. However, you will see the following error on your YouTube and Google Drive content sources in case you haven't authenticated them with OAuth 2.0.

We recommend you set up OAuth 2.0 on your Google account and re-authenticate your content sources using the client ID and client secret.

Help Article - Google OAuth 2.0 Setup.