Use a Website as a Content Source
Submitting a website to a web search engine is straightforward. All it takes for webmasters is to submit a sitemap to Google, Bing, DuckDuckGo, or another search engine. The simplicity, however, comes at a cost.
Besides a robots.txt
file, which tells search engines what not to index, there is little webmasters can do to customize crawling and indexing. Advanced configurations—such as limiting crawl and
index only to titles, headings, or the content in a div—are lacking.
SearchUnify is different. Webmasters (known as Admins in the SearchUnify realm) can upload a sitemap to index a website or configure a highly customized crawling plan that allows them to:
- Crawl and index multiple public and gated websites simultaneously (although not the ones behind a captcha wall)
- Control the depth of crawl
- Crawl and index websites in languages other than English (more than 20 languages supported)
- Crawl and index JavaScript-dependent websites
Three Stages of Website Crawling
This article covers how SearchUnify can be used to crawl and index websites. The entire process consists of three stages, which are first summarized and then laid out.
- Establish a connection with the website. It's here that an admin
- Uploads a sitemap or inserts a website address in the URL field to specify the website(s) to be crawled
- Inserts an integer in Depth and toggles Limit Crawling to SiteMap to define crawl depth
- Specifies whether a website is public or gated through the Authentication Method dropdown
- Defines the role of JavaScript on a website's functionality in JavaScript Enabled Crawling
- Gives an internal (to be used in the SearchUnify instance) name to the website in Name
- Set up crawl frequency. In contrast to mainstream web search engines, whose crawl rate depends on how often updates are pushed to the website in consideration, SearchUnify admins can set up a crawl frequency ranging from every 24 hours to once in a year. Irrespective of the Frequency, the first crawl is manual.
- Define what is to be crawled. On top of specifying URLs to crawl and avoid, admins can mention precisely what data to crawl by specifying the tags in which that data is stored. The usage requires some familiarity with HTML and CSS selectors.
Stage 1: Establish a Connection
- From Content Sources in main navigation, go to Add a Content Source.
- Find website from the search box and click Add.
- Selecting Website takes you to the Authentication screen, where seven—actually nine if you count each authentication method separately—connection settings are available. Next, each field on the screen is explained.
Understand the Authentication Screen
Name
Each content source in an instance has a name, which helps admins distinguish one from another.
A good practice is to enter a descriptive term, such as "Public Setup Tutorials on YouTube" or "Promotional Videos on YouTube", in the Name field.
⚠ IMPORTANT
In the Name field, the only characters allowed are lowercase letters, uppercase letters, and spaces. Note that a space at the end throws an error.
Public Setup Tutorials on YouTube
is correct butPublic Setup Tutorials on YouTube
" (notice the trailing space) is not.
URL
Right next to Name is the URL field, which can be used in two ways:
- An admin can insert the address of the website to be crawled or the URL of the sitemap. A typical website address looks like:
https://searchunify.com/
and a sitemap looks likehttps://docs.searchunify.com/Sitemap.xml
. Don't forget to enterhttp
(if the website is unsecure) orhttps
(if the website is secure). - An admin can upload a TXT or an XML file using the upload button (
) on the right end of the URL field. The uploaded file should contain one of the following:
Sitemap.xml Capabilities
Uploading a sitemap offers several advantages. The primary two are:
- Accelerated Recrawls. SearchUnify crawler identifies the URLs in a sitemap with the attribute
<lastmod>
and crawls only the pages updated or created since the last crawl. This type of crawling can be several times faster on a large website. Related: Generate a Sitemap from URL (loc) and Last Modified (lastmod) Attributes (Requires a login to the SearchUnify Community) - Customer Filters. Admins can add custom attributes, such as
<author>
,<priority>
for each URL in the sitemap. After crawling, these customer attributes can be used as filters on a search client. To index a filter, use the format{{sitemap}}{{filterName}}
in Rules.
SearchUnify is capable of handling complex sitemaps which contain more than just a list of plain URLs. One such feature is the tag <lastmod>
.
⚠ WARNING.
SearchUnify doesn't crawl a website when the links in the sitemap.xml or TXT file don't match with real web addresses. Five common issues are:
- Redirects. If a link redirects, it will not be crawled.
- HTTPS and HTTP. If a website is secure, then crawling will proceed only when sitemap.xml or TXT file contain secure links; starting with HTTPS. If a website is insecure, then HTTP links work. In both cases, the specified and actual web addresses should match.
- Trailing Spaces. Any space after the end of a URL causes the indexing to stop.
- Comments. If there are any comments in the TXT file or sitemap.xml, remove them before the upload.
- Empty Lines. Indexing can stop abruptly if empty lines are found in the beginning, middle, or end of a TXT file or sitemap.xml.
Depth
Depth works on sitemaps. When Depth is one, SearchUnify crawls the links in your sitemap. When Depth is two, SearchUnify follows the hyperlinks on the webpages and crawls them as well. The next image outlines the function.
The recommended value of Depth is less than 10.
JavaScript Enable Crawling
Some websites rely on JavaScript to function while others, such as Wikipedia, can function without it. If the content source website is dependent on JavaScript, then toggle JavaScript Enable Crawling on. Else, keep it turned off. A way to find out if a website depends on JavaScript for its regular functions is to go to Chrome://settings/content/javascript
, turn off JavaScript.
Reload the website. If you receive an error message or the website starts behaving erratically, then turn on JavaScript Enabled Crawling.
Else, leave the function in its default state.
Authentication Method
The settings described in the first two sections are adequate if the content source is a public website, such as Wikipedia or SearchUnify Docs. Public websites are accessible to everyone with an Internet connection.
If your website is public, click Connect and jump to the section Frequency.
However, if your website requires users to log in before they can view an article, a video, or another content, you have to configure security settings. The configurations are found in Authentication Method are of two kinds: Basic and Form.
⚠ IMPORTANT.
Gated websites cannot be crawled and indexed until Basic and Form are set up.
Basic
Select Basic from the Authentication Method dropdown if the website requires users to enter their username and password. On selection, two new fields appear. Enter a valid ID and password and click Connect.
⚠ IMPORTANT.
If there is a captcha on the login form, then your website will not be crawled.
Form
Form is an advanced version of Basic and requires an acquaintance with CSS selectors to set up. You can think of CSS selectors as guideposts which tell a browser how to interpret data.
Consider the next image where a pair of <h1> tags tells the browser to interpret the text contained within them as a heading and a pair <p> tags tells the browser to interpret the text contained within them as a paragraph. Both are CSS selectors.
When Form is selected, then an admin has to specify, besides the username and password:
- Login URL
- CSS selectors of the username field
- CSS selector of the password field
- CSS selector of the login button
Login URL is straightforward. Chrome Users can find the CSS selectors by pressing Ctrl+Shift+I and hovering the cursor over the each field and button one at a time. In the next image, you can see the CSS selector of username, which is #username.
Stage 2: Set up Crawl Frequency
The first crawl is always manual. After that you can set up a crawl frequency. When a sitemap.xml with a <lastmod> tag is used, the crawls are faster because only the web pages updated or added since the last crawl are indexed. In case of a website URL or a TXT file, the entire website is indexed during each crawl.
Stage 3: Select Fields for Indexing
SearchUnify indexes a website by capturing and storing the data inside HTML elements.
⚠ IMPORTANT
A website is not indexed if no HTML element is specified.
The admins can write CSS selectors to specify the elements for indexing. The CSS selectors are stored in an Object, which we will create next.
- To start, enter object details in Name and Label. The name and the label do not have to be valid HTML tags. You can add only one object for a website content source.
- Click Add Object to create an empty object.
- Click
to add content fields for indexing.
- Selector is the most important field. It's used to find content for indexing. On a webpage, all HTML tags, classes, and IDs are valid selectors. To index paragraphs on a website, enter p in the Selector field. Don't use angle brackets on HTML selectors. The right way is p, h2, ol, and title instead of <p>, <h1>, <ol>, and <title>. The right way to enter IDs is to prefix them with an # (octothorpe). So <hometoc> is incorrect, but #hometoc is correct. As for classes, each class name is prefixed by a . (dot) in Selector.
- Assign the Selector a Type. The default Type is string. Always stick with the default unless the value is time or date.
- Give the selector a Name and Label. Both are used in the backend. You can be creative here.
- An HTML element can occur once or several times on a webpage. In almost all cases, the title is found only once and wrapped in a pair of h1 tags. At the same time, a page usually has multiple paragraphs surrounded by p tags. Single/Multiple is for those elements that occur more than once, such as p, h2, and i. When Single is selected, then the data stored in multiple HTML elements is collated. Think of a five-paragraph where. If you pick Single, then all the five paragraphs are collated as they should be. But if you pick Multiple, then each paragraph is stored separately. Multiple can be selected in those scenarios when a web page lists items. For example, a list of books where each title is surrounded by the same pair of i tags. In that case, Multiple is the way to go.
- Press Add and then Save.
- Switch to By Filter.
- Next, the four fields in URL Filter Configuration enable admins to highly customize crawling.
Should crawl.This feature is used to send crawler to select pages. Extremely useful if you have a thousands of pages in sitemap but don't want to crawl them all. List your webpages pages in Should Crawl or enter a regular expression specifying the pages for crawling. Should Crawl accelerates crawling when it's used properly. While using, make sure that you enter both the secure and insecure addresses of the webpage mentioned in Authentication
Should not crawl. This feature is used to specify the webpages where the crawler shouldn't go. The exact opposite of what Should Crawl does. Use this feature when it's simpler to exclude URLs than include them. Think of it this way. Out of 10000 URLs, you don't want the crawler to visit 20. You can either specify the remaining 9800 URLs in Should Crawl or 20 URLs in Should Not Crawl. At least in this particular scenario, Should Not Crawl saves much time.
Outlink filters for URL. This feature is used to crawl selected pages on depth two. Think of it this way: A Home Page links to an Our Products page and Our Products links to 50 different pages. You don't want them all in your index. A way to configure indexing is to first turn off Limit Crawling to Sitemap in Authentication and then enter the URLs of the products to be indexed. In spite of dozens of URLs, you can insert a regular expression that covers them all. https://su.com/product1 and https://su.com/product2 are equivalent to https://su.com/product* in Outlink Filters for URL. NOTE: It is to be used only if you have entered a sitemap in Authentication.
Indexing filters for URL. This feature is used to index select pages. While Should Crawl determines the pages to be crawled, Indexing filters for URL determines which pages are to be stored in SearchUnify's database or index. Ensure that you have specified Should Crawl correctly. If the crawler cannot get to page.html, then mentioning page.html in Indexing Filters for URL will not add it to the SearchUnfiy database. You can enter the URLs of the pages to be indexed or a regular expression.
- Press Save.
You have successfully added Website as a content source.