Use a Website as a Content Source

Submitting a website to a web search engine is relatively straightforward. All a webmaster has to do is to submit a sitemap to Google, Bing, DuckDuckGo, or another search engine. The simplicity, however, comes at a cost.

Besides a robots.txt file, which tells search engines what not to index, there is little a webmasters can do to customize crawling and indexing. Advanced configurations—such as limiting crawl and index only to titles, headings, or the content in a particular div—are simply not available.

SearchUnify is different. Webmasters, or admins, as they are known as in the SearchUnify realm, can upload a sitemap to index a website or configure a highly customized crawling plan that allows them to:

  • Crawl and index multiple public and gated websites simultaneously (although not the ones behind a captcha wall)
  • Control the speech and depth of crawl
  • Crawl and index websites in languages other than English (more than 20 languages supported)
  • Crawl and index websites that rely on JavaScript to function properly

Three Stages of Website Crawling

This article covers how SearchUnify can be used crawl and index almost any website. The entire process consists of three stages, which are first summarized and then laid out.

  1. Establish a connection with the website. It's here that an admin
    1. Uploads a sitemap or inserts its website address in the URL field to specify the website(s) to be crawled
    2. Inserts an integer in Depth and toggles Limit Crawling to SiteMap to define crawl depth
    3. Inserts the number of server cores in the Thread field to define crawl speed
    4. Specifies whether a website is public or gated through the Authentication Method dropdown
    5. Defines the role of JavaScript on a website's functionality in JavaScript Enabled Crawling
    6. Gives an internal (to be used in the SearchUnify instance) name to the website in Name
    7. Toggles the Session Initializer on or off; this field is discussed in the section Establish a Connection
  2. Set up crawl frequency. In contrast to mainstream web search engines, whose crawl rate depends on how often updates are pushed to the website in consideration, SearchUnify admins can set up a crawl frequency ranging from every 24 hours to once in a year. More adventurous admins can opt for Manual as well, which enables to run a crawl anytime they wish.
  3. Define what is to be crawled. On top of specifying what URLs to crawl and what to avoid, admins can mention precisely what data to crawl by specifying the tags in which that data is stored. The usage requires some familiarity with HTML and CSS selectors.

Stage 1: Establish a Connection

  1. From Content Sources in main navigation, go to Add a Content Source.

  2. In the grid of available content sources, select Website.

  3. Selecting Website takes you to the Authorization screen, where 11—actually 13 if you count each authentication method separately—connection settings are available. Next, each field on the screen is explained.

Understand the Authentication Screen

Name

Each content source in an instance has a name, which helps admins distinguish one from another.

A good practice is to enter a descriptive term, such as "Public Setup Tutorials on YouTube" or "Promotional Videos on YouTube", in the Name field.

⚠ IMPORTANT

In the Name field, the only characters allowed are lowercase, uppercase, and spaces. Note that a space at the end of a name will throw an error. Public Setup Tutorials on YouTube is correct but Public Setup Tutorials on YouTube " (notice the trailing space) is not.

URL

Right next to Name is the URL field, which can be used in two ways:

  • An admin can insert the address of the website to be crawled or the URL of the sitemap. A typical website address will be for the form: https://searchunify.com/ and a sitemap might look like https://docs.searchunify.com/Sitemap.xml. Don't forget to enter http (if the website is unsecure) or https (if the website is secure).

  • An admin can upload a TXT or an XML file following using the upload button () on the right end of the URL field. The uploaded file will contain one of the following:
    • Links; one link on each line

    • Sitemap.xml

Sitemap.xml Capabilities

Uploading a sitemap offers several advantages. The primary two are: 

  • Accelerated Recrawls. SearchUnify crawler identifies the URLs in a sitemap with the attribute <lastmod> and crawls only the pages updated or created since the last crawl. This type of crawling can be several times faster on a large website.
  • Customer Filters. Admins can add custom attributes, such as <author>, <priority> for each URL in the sitemap. After crawling, these customer attributes can be used as filters on a search client. To index a filter, use the format {{sitemap}}{{filterName}} in Rules.

SearchUnify is capable of handling complex sitemaps which contain more than just a list of plain URLs. One such feature is the tag <lastmod>.

⚠ WARNING.

SearchUnify doesn't crawl a website when the links in the sitemap.xml or TXT file doesn't match with real web addresses. Two common issues are:

  1. Redirects. If a link redirects, it will not be crawled.
  2. HTTPS and HTTP. If a website is secure, then crawling will proceed only when sitemap.xml or TXT file contain secure links; starting with HTTPS. If a website is insecure, then HTTP links will work. In both cases, the web addresses should be exactly the same.
  3. Trailing Spaces. Any space after the end of a URL causes the indexing to stop.
  4. Comments. If there are any comments in the TXT file or sitemap.xml, remove them before the upload.
  5. Empty Lines. Indexing can stop abruptly if empty lines are found in the beginning, middle, or end of a TXT file or sitemap.xml.

If You Are Inserting a Link or Uploading a TXT File with Links

Then, specify the following:

Depth

.When Depth is one, SearchUnify crawls the links in the URL field or the TXT file. When Depth is two, SearchUnify follows the hyperlinks on the webpages in the URL field or the TXT file and crawls them as well.

The recommended value of Depth is less than 10.

Thread.

Admins have complete control over crawling speed. If the server, on which their instance is hosted, has n server cores cores, then admins can put all but one cores to work by entering n-1 server cores in the Thread field. If There is only one core on the server. then the value of Thread is 1. This equation describes all the values of Thread: The maximum value of Thread is the number of cores on a server minus one.. A quick way to find out the cores on the server is to insert a randomly high number, such as 32. The error message will display the max cores available.

  • NOTE 1. If Thread value is set to one, then turn on Session Initializer by toggling it to the right.

    . For the geeks among you, turning on Session Initializer improves crawling speed on single core. Instead of launching a new Chrome window to test each crawled page in the background, SearchUnify opens pages in new tabs — which saves time.
  • NOTE 2. It's not recommended to use a Thread value higher than 1 and turning on Session Initializer because it accelerates the crawling speed to the extent that the server, on which the content source is hosted, can block SearchUnify's crawler.

Batch

Crawling and indexing on SearchUnify occurs in stages. The crawler finds pages, stores them in the index, then finds the next batch of pages and stores them in the index. As an admin, you can assign a value. A recommended value is 50.

JavaScript Enable Crawling

Some websites rely on JavaScript to function while others, such as Wikipedia, can function without it. If the content source website is dependent on JavaScript, then toggle JavaScript Enable Crawling on. Else, keep it turned off. A way to find out if a website depends on JavaScript for its regular functions is to go to Chrome://settings/content/javascript, turn off JavaScript.

Reload the website. If you receive an error message or the website starts behaving erratically, then turn on JavaScript Enabled Crawling.

Else, leave the function in its default state.

If You Are Uploading Sitemap.xml

Then read the previous section to find out the behavior of Depth, Thread, Batch, and JavaScript Enabled Crawling. Except for Depth, which remains one irrespective of the integer entered if Limit Crawling to Sitemap is turned on, all other functions behave identically.

Authentication Method

The settings described in the first two sections are adequate if the content source is a public website, such as Wikipedia or SearchUnify Docs. A public website is any website that is accessible to anyone with an Internet connection and the web address.

If you website is public, click Connect and jump to the section Frequency.

However, if your content source is a website that requires users log in before they can view an article, a video, or another content, you will see to configure security settings as well. The configurations are found in Authentication Method can be of the type Base or Form.

⚠ IMPORTANT.

Gated websites cannot be crawled and indexed until Base and Form are set up.

Base

Select Base from the Authentication Method dropdown if the website requires users to enter their username and password. Two new fields will appear. Enter a valid ID and password and click Connect.

⚠ IMPORTANT.

If there is a captcha on the login form, then your website will not be crawled.

Form

Form is an advanced version of Base and requires an acquaintance, on part of the SearchUnify admin, with CSS selectors to set up. You can think of them as guideposts which tell a browser how to interpret data. In the next time, a pair of <h1> tags tells browser to interpret the text contained within them as a heading and a pair <p> tags tells the browser to interpret the text contained with them as a paragraph. Both are CSS selectors.

When Form is selected, then an admin has to specify, besides the username and password:

  1. Login URL
  2. CSS selectors of the username field
  3. CSS selector of the password field
  4. CSS selector of the login button

Login URL is straightforward. A quick way to find the CSS selectors is to press Ctrl+Shift+I if you are using Chrome and hover the cursor over the each field and button one at a time. In the next image, you can see the CSS selector of username, which is #username.

Stage 2: Set up Crawl Frequency

Setting up crawl frequency has no impact, whatsoever, on indexing. Click Set and move to the next screen.

Stage 3: Select Fields for Indexing

SearchUnify indexes a website by capturing and storing the data inside HTML elements.

⚠ IMPORTANT

A website is not indexed if no HTML element is specified.

The admins can write CSS selectors to specify the elements for indexing. The CSS selectors are stored in an Object, which we will create next.

  1. Give your Object a Name and a Label. The name and the label do not have to be valid HTML tags.

  2. Click Add Object to create an empty object.

  3. Click to add content fields for indexing.

  4. Selector is the most important field. HTML tags, classes, and IDs are valid selectors. IDs are preceded by an # (octothorpe), classes by a . (dot), but the standard HTML tags are not surrounded by angle brackets.

  5. Assign the Selector a Type.

  6. Give the selector a Name and Label.

  7. Select Multiple to treat each instance of a matched HTML element separate in search, or Single to combine their data into a field.

  8. Select isFilterable to use the field as a facet, or leave the default value to find it in search results.

  9. Press Add and then Save.

  10. Switch to By Filter.

  11. Next, the four fields in URL Filter Configuration enable admins to highly customize crawling.

    Should crawl. Enter both the secure and insecure addresses of the webpage that was mentioned in Authentication

    Should not crawl. Enter the secure and insecure addresses of the webpages that should not be crawled.

    Outlink filters for URL. These websites are cawled to the second Depth. It means that the webpages linked to are crawled as well.

    Indexing filters for URL. Same as Should Crawl, but with a carel appended to the left end.

  12. Press Save.

You have successfully added Website as a content source.

Last updatedFriday, September 25, 2020