The Web knows no bounds. With a seemingly infinite amount of data at our finger tips, effective navigation through this unending maze of information becomes as important as its comprehension. The easiest way to find the proverbial needle in the haystack? The search engine.
For all the complexity behind the search engine, it has two primary functions – crawling and indexing, and providing results by calculating relevancy and serving results.
What does it mean when we say Google has “indexed” a site? Colloquially, we mean that we see the site in a [site: www.site.com] search on Google. This shows the pages in Google’s database that have been added to the database – but technically, they are not necessarily crawled, which is why you can see this from time to time
Indexing is something entirely different. If you want to simplify it, think of it this way, URLs have to be discovered before they can be crawled, and they have to be crawled before they can be “indexed” or more accurately, have some of the words in them associated with the words in Google’s index.
Google learns about URLs, and then adds these URLs to its crawl scheduling system. It deduces the list and then rearranges the list of URLs in priority order and crawls in that order. Once a page is crawled, Google then goes through another algorithmic process to determine whether to store the page in their index. What this means is that Google doesn’t crawl every page it knows about and doesn’t index every page they crawl.
Which brings us to how these pages are ranked. At first glance, it seems reasonable to believe that what a search engine does, is keep an index of all these web pages, and when a user types in a query search, the engine browses through its index and counts the occurrences of the key words in each web file. The winners are the pages with the highest number of occurrences of the key words. These get displayed back to the user. Indeed, this was how things were done in early search engines, with their text based ranking systems. This leads to a host of issues. For example, if one searches for “ACM”, one would expect that www.acm.org would be the most relevant result. However, there may millions of pages on the web using the term “ACM”. Suppose one were to write nothing but the term “ACM” a billion times on a web page. Since the search engine simply counts the occurrences of the words in the query, such a page would, invariably, make it to the top of the results.
The usefulness of a search engine depends on the relevance of the result set it gives back. There may of course be millions of web pages that include a particular word or phrase; however some of them will be more relevant, popular, or authoritative than others. A user does not have the ability or patience to scan through all pages that contain the given query words. One expects the relevant pages to be displayed within the top 20-30 pages returned by the search engine.
One of the most well-known algorithms for computing the relevance of web pages is Google’s Page Rank algorithm. The idea that PageRank brought up was that, the importance of any web page can be judged by looking at the pages that link to it. If we create a web page i and include a hyperlink to the web page j, this means that we consider j important and relevant for our topic. If there are a lot of pages that link to j, this means that the common belief is that page j is important. If on the other hand, j has only one backlink, but that comes from an authoritative site k, (like www.google.com, www.cnn.com) we say that k transfers its authority to j; in other words, k asserts that j is important. Whether we talk about popularity or authority, we can iteratively assign a rank to each web page, based on the ranks of the pages that point to it.
A quick overview of PageRank:
- The higher the page’s score, the further up the search results list it will appear.
- Scores are partially determined by the number of other Web pages that link to the target page. Each link is counted as a vote for the target. The logic behind this is that pages with high quality content will be linked to more often than mediocre pages.
- Not all votes are equal. Votes from a high-ranking Web page count more than votes from low-ranking sites. You can’t really boost one Web page’s rank by making a bunch of empty Web sites linking back to the target page.
- The more links a Web page sends out, the more diluted its voting power becomes. In other words, if a high-ranking page links to hundreds of other pages, each individual vote won’t count as much as it would if the page only linked to a few sites.
- Other factors that might affect scoring include the how long the site has been around, the strength of the domain name, how and where the keywords appear on the site and the age of the links going to and from the site. Google tends to place more value on sites that have been around for a while.