Use a Website as a Content Source

Index your website's data in SearchUnify by adding your website as a content source. SearchUnify webmasters (Super Admin and Admin) can upload a sitemap to index a website or configure a highly customized crawling plan that allows them to:

  • Crawl and index multiple public and gated websites simultaneously (except those behind a CAPTCHA wall)
  • Crawl only the selected objects(sections or pages) of their websites
  • Control the depth and path of crawl
  • Crawl and index websites in more than 20 different languages
  • Crawl and index JavaScript-dependent websites

Establish a Connection

  1. Navigate to Content Sources and click Add New Content Source.

  1. Find Website and click Add.

  2. On the Authentication screen, enter the following details:

    Name. Give your content source a label.

    Website URL/Sitemap, or Upload Sitemap. Enter your website or sitemap URL, or upload a .txt or a .xml sitemap file.

    Depth. Select website depth that you want to crawl. When Depth is one, SearchUnify crawls the links in your sitemap. When Depth is two, SearchUnify follows the hyperlinks on the webpages and crawls them as well. The recommended value of Depth is less than 10.

    Language. Select the language of your website content.

    Javascript Enabled Crawling. Some websites rely on JavaScript to function while others, such as Wikipedia, can function without it. If your website is dependent on JavaScript, then toggle it on to crawl and index JavaScript-dependent websites.

  3. A way to find out if a website depends on JavaScript for its regular functions is to go to Chrome://settings/content/javascript, and turn off JavaScript. Reload the website. If you receive an error message or the website starts behaving erratically, then turn on JavaScript Enabled Crawling.

  4.  

  5.  

  6. Authentication Method

    • No Authentication

    This authentication method is adequate if your website is public, such as Wikipedia or SearchUnify Docs. Public websites are accessible to everyone with an Internet connection. In this case, select No Authentication as the Authentication method and click Connect.

    However, if your website requires users to log in before they can view an article, a video, or another content, you have to select between Basic and Form authentication methods.

    ⚠ IMPORTANT.

    Gated websites cannot be crawled and indexed until Basic and Form are set up.

    • Basic Authentication

    Select Basic from the Authentication Method dropdown if the website requires users to enter their username and password. When you select Basic as the Authentication method, two new fields appear. Enter a valid username/email and password, and click Connect.

    ⚠ IMPORTANT.

    If there is a CAPTCHA on the login form, then your website will not be crawled.

    • Form

    NOTE. This only works when Javascript Enabled crawling is enabled.

    Form is an advanced version of Basic authentication and requires an acquaintance with CSS selectors to set up. This authentication type is used when a website is gated and requires the user to fill login form to access a website.

    You can think of CSS selectors as guideposts which tell a browser how to interpret data. Consider the next image where a pair of <h1> tags tells the browser to interpret the text contained within them as a heading and a pair <p> tags tells the browser to interpret the text contained within them as a paragraph. Both are CSS selectors.

    When you select Form-based authentication, then an admin has to specify, besides the username and password:

    1. Login URL
    2. CSS selectors of the username field
    3. CSS selector of the password field
    4. CSS selector of the login button

    Login URL is straightforward. Chrome Users can find the CSS selectors by pressing Ctrl+Shift+I and hovering the cursor over the each field and button one at a time. In the next image, you can see the CSS selector of username, which is #username.

  7. After entering all the required details, click Connect.

Set Up Crawl Frequency

The first crawl is always manual and is performed after configuring the content source. In Choose A Date, select a date to start crawling; the data created after the selected date will be crawled. For now, keep the frequency to its default value Never and click Set and move to the next section.

Add Objects and Fields for Indexing

SearchUnify indexes a website by capturing and storing the data inside HTML elements. Recognizing that websites may have different sections or pages that require separate crawling and indexing, SearchUnify gives the flexibility to users to add multiple Content types for different sections/pages of the website.

⚠ IMPORTANT

A website is not indexed if no HTML element is specified.

The admins can write CSS selectors to specify the elements for indexing. The CSS selectors are stored in an Object, which we will create next. To begin with, navigate to the Rules tab and you will land on By Content Type subtab.

  1. Enter object details in Name, Label, and Path. The name and the label do not have to be valid HTML tags. You can add different objects for different sections and pages of a website content source.

    Note. If you leave the Path field empty, then the entire website will be crawled and only one object with empty path can be added.

  2. Click Add Object to add more object(s), or to say specify the sections/pages that you want to index. The paths should be added in the format shown in the below image.

    To effectively crawl specific sections of your website, you can use regular expressions (regex) to define the paths of interest. For example, if your website contains sections like https://yourwebsite.com/java, https://yourwebsite.com/css, and https://yourwebsite.com/node, you can specify these sections for individual crawling by using regex patterns such as /java/.*, /css/.*, and /node/.*. Each .* in the regex pattern matches any sequence of characters within that directory, ensuring that all related documents are included in the crawl.

    To apply these patterns, add them as separate entries in your crawler's configuration. This allows you to see document counts and data for each section individually, facilitating more targeted analysis or content management.

    Remember to adjust the regex patterns according to the specific URLs of your website's communities or other sections you wish to include in the crawl. By doing so, you can fine-tune the scope of your crawling efforts to match your specific needs.

  3. Click EDIT  to add content fields for indexing.

  4. Selector is the most important field. It's used to find content for indexing. On a webpage, all HTML tags, classes, and IDs are valid selectors. To index paragraphs on a website, enter p in the Selector field. Don't use angle brackets on HTML selectors. The right way is p, h2, ol, and title instead of <p>, <h1>, <ol>, and <title>. The right way to enter IDs is to prefix them with an # (octothorpe). So <hometoc> is incorrect, but #hometoc is correct. As for classes, each class name is prefixed by a . (dot) in Selector.

  5. Assign the Selector a Type. The default Type is string. Always stick with the default unless the value is time or date.

  6. Give the selector a Name and Label. Both are used in the backend. You can be creative here.

  7. An HTML element can occur once or several times on a webpage. In almost all cases, the title is found only once and wrapped in a pair of h1 tags. At the same time, a page usually has multiple paragraphs surrounded by p tags. Single/Multiple is for those elements that occur more than once, such as p, h2, and i. When Single is selected, then the data stored in multiple HTML elements is collated. Think of a five-paragraph where. If you pick Single, then all the five paragraphs are collated as they should be. But if you pick Multiple, then each paragraph is stored separately. Multiple can be selected in those scenarios when a web page lists items. For example, a list of books where each title is surrounded by the same pair of i tags. In that case, Multiple is the way to go.

  8. Press Add and then Save.

  9. Switch to the By Filter subtab. Next, the four fields in URL Filter Configuration enable admins to highly customize crawling.

    • Should crawl.This feature is used to send crawler to select pages. Extremely useful if you have a thousands of pages in sitemap but don't want to crawl them all. List your webpages pages in Should Crawl or enter a regular expression specifying the pages for crawling. Should Crawl accelerates crawling when it's used properly. While using, make sure that you enter both the secure and insecure addresses of the webpage mentioned in Authentication.

    • Should not crawl. This feature is used to specify the webpages where the crawler shouldn't go. The exact opposite of what Should Crawl does. Use this feature when it's simpler to exclude URLs than include them. Think of it this way. Out of 10000 URLs, you don't want the crawler to visit 20. You can either specify the remaining 9800 URLs in Should Crawl or 20 URLs in Should Not Crawl. At least in this particular scenario, Should Not Crawl saves much time.

    • Outlink filters for URL. This feature is used to crawl selected pages on depth two. Think of it this way: A Home Page links to an Our Products page and Our Products links to 50 different pages. You don't want them all in your index. A way to configure indexing is to first turn off Limit Crawling to Sitemap in Authentication and then enter the URLs of the products to be indexed. In spite of dozens of URLs, you can insert a regular expression that covers them all. https://su.com/product1 and https://su.com/product2 are equivalent to https://su.com/product* in Outlink Filters for URL. NOTE:  It is to be used only if you have entered a sitemap in Authentication.

    • Indexing filters for URL. This feature is used to index select pages. While Should Crawl determines the pages to be crawled, Indexing filters for URL determines which pages are to be stored in SearchUnify's database or index. Ensure that you have specified Should Crawl correctly. If the crawler cannot get to page.html, then mentioning page.html in Indexing Filters for URL will not add it to the SearchUnfiy database. You can enter the URLs of the pages to be indexed or a regular expression.

  10. Press Save to save the settings.

You have successfully added Website as a content source. Perform a manual crawl to start indexing the website data in SearchUnify.

Related

Difference between Manual and Frequency Crawls

After the First Crawl

Return to the Content Sources screen and click in Actions. The number of indexed documents is updated after the crawl is complete. You can view crawl progress in in Actions. Documentation on crawl progress is in View Crawl Logs.

NOTE 1

Review the settings in Rules if there is no progress in Crawl Logs.

NOTE 2

For Mamba '22 and newer instances, search isn't impacted during a crawl. However, in older instances, some documents remain inaccessible while a crawl is going on.

Once the first crawl is complete, click in Actions open the content source for editing, and set a crawl frequency.

  1. In Choose a Date, click to fire up a calendar and select a date. Only the data after the selected date is indexed.

  2. Use the Frequency dropdown to select how often SearchUnify should index the data. For illustration, the frequency has been set to Weekly and Tuesday has been chosen as the crawling day. Whenever the Frequency is other than Never, a third dropdown appears where you can specify the interval. Also, whenever Frequency is set to Hourly, then manual crawls are disabled.

  3. Click Set to save crawl frequency settings. On clicking Set, you are taken to the Rules tab.