Tuesday, August 12, 2008

How Google Works

If you aren’t interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

  • Googlebot, a web crawler that finds and fetches web pages.
  • The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
  • The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let’s take a closer look at each part.

1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Screen shot of web page for adding a URL to Google.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

2. Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index database. This index is sorted alphabetically by search term, with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. This data structure allows rapid access to documents that contain user query terms.

To improve search performance, Google ignores (doesn’t index) common words called stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s performance.

3. Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates queries and matches them to relevant documents, and the results formatter.

PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank.

Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for an interpretation of the concepts and the practical applications contained in Google’s patent application.

Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search Form and Using Search Operators (Advanced Operators).

Let’s see how Google processes a query.

1. The web server sends the query to the index        servers. The content inside the index servers is similar        to the index in the back of a book--it tells which pages        contain the words that match any particular query       term.          2. The query travels to the doc servers, which   actually retrieve the stored documents. Snippets are    generated to describe each search result.       3. The search results are returned to the user          in a fraction of a second.

For more information on how Google works, take a look at the following articles.

tags (keywords): , , , , , , , ,

This page was last modified on: Friday February 2, 2007

Results Page

The results page is filled with information and links, most of which relate to your query.

Screen shot indicating what is shown on a Google results page.

  • Google Logo: Click on the Google logo to go to Google’s home page.
  • Statistics Bar: Describes your search, includes the number of results on the current results page and an estimate of the total number of results, as well as the time your search took. For the sake of efficiency, Google estimates the number of results; it would take considerably longer to compute the exact number. This estimate is unreliable.

    Every underlined term in the statistics bar is linked to its dictionary definition. Queries that are linked to just one definition are followed by a definition link.

  • Tips: Sometimes Google displays a tip in a box just below the statistics bar.
    Screen shot of a Google tip
    Screen shot of a Google tip
    Screen shot of a Google tip
  • Search Results: Ordered by relevance to your query, with the result that Google considers the most relevant listed first. Consequently you are likely to find what you’re seeking quickly by looking at the results in the order in which they appear. Google assesses relevance by considering over a hundred factors, including how many other pages link to the page, the positions of the search terms within the page, and the proximity of the search terms to one another.

    Below are descriptions of some search-result components. These components appear in fonts of different colors on the result page to make it easier to distinguish them from one another.

    • Page Title: (blue) The web page’s title, if the page has one, or its URL if the page has no title or if Google has not indexed all of the page’s content. Click on the page title (e.g., The History of the Brassiere - Mary Phelps Jacob) to display the corresponding page.
    • Snippets: (black) Each search result usually includes one or more short excerpts of the text that matches your query with your search terms in boldface type. Each distinct excerpt or snippet is separated by an ellipsis (…). These snippets, which appear in a black font, may provide you with

      • The information you are seeking
      • What you might find on the linked page
      • Ideas of terms to use in your subsequent searches

      When Google hasn’t crawled a page, it doesn’t include a snippet. A page might not be crawled because its publisher requested no crawling, or because the page was written in such a way that it was too difficult to crawl.

    • URL of Result: (green) Web address of the search result. In the screen shot, the URL of the first result is inventors.about.com/library/weekly/aa042597.htm.
    • Size: (green) The size of the text portion of the web page. It is omitted for sites not yet indexed. In the screen shot, “5k” means that the text portion of the web page is 5 kilobytes. One kilobyte is 1,024 (210) bytes. One byte typically holds one character. In general, the average size of a word is six characters. So each 1k of text is about 170 words. A page containing 5K characters thus is about 850 words long.

      Large web pages are far less likely to be relevant to your query than smaller pages. For the sake of efficiency, Google searches only the first 101 kilobytes (approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file. Assuming 15 words per line and 50 lines per page, Google searches the first 22 pages of a web page and the first 26 pages of a pdf file. If a page is larger, Google will list the page as being 101 kilobytes or 120 kilobytes for a pdf file. This means that Google’s results won’t reference any part of a web page beyond its first 101 kilobytes or any part of a pdf file beyond the first 120 kilobytes.

    • Date: (green) Sometimes the date Google crawled a page appears just after the size of the page. The date tells you the freshness of Google’s copy of the page. Dates are included for pages that have recently had a fresh crawl.
    • Indented Result: When Google finds multiple results from the same website, it lists the most relevant result first with the second most relevant page from that same site indented below it. In the screen shot, the indented result and the one above it are both from the site inventors.about.com.

      Limiting the number of results from a given site to two ensures that pages from one site will not dominate your search results and that Google provides pages from a variety of sites.

    • More Results: When there are more than two results from the same site, access the remaining results from the “More results from…” link.

      When Google returns more than one page of results, you can view subsequent pages by clicking either a page number or one of the “o”s in the whimsical “Gooooogle” that appears below the last search result on the page.

      Click on a number or an "o" to see another page of results.

      If you find yourself scrolling through pages of results, consider increasing the number of results Google displays on each results page by changing your global preferences.

      In practice, however, if pages of interest to you aren’t within the first 10 results, consider refining your query instead of sifting through pages of irrelevant results. To simplify such refinements, Google includes a search box at the bottom of the page you can use to enter your refined query.

  • Sponsored Links: Your results may include some clearly identified sponsored links (advertisements) relevant to your search. If any of your search terms appear in the ads, Google displays them in boldface type.
  • Spelling Corrections, Dictionary Definition, Cached, Similar Pages, News, Product Information, Translation, Book results: Your results may include these links, which are described in the next few chapters.

Here’s another screen shot of the results page in case the one at the top of this page scrolled off your screen.

Screen shot indicating what is shown on a Google results page.

For more on what’s included on Google’s results page, visit www.google.com/help/interpret.html.

tags (keywords): , ,

This page was last modified on: Friday February 2, 2007

No comments:

How to Get files from the directory - One more method

 import os import openpyxl # Specify the target folder folder_path = "C:/Your/Target/Folder"  # Replace with the actual path # Cre...