Thursday, July 31, 2008

What is "The Invisible Web"?

Many untrained users have the naive expectation that they can locate anything on the world wide web by using Google or Yahoo or Ask.com. No, as powerful as these search engines are, they do not index everything on the world wide web. In fact, search engines index less than 10% of the entire web! That remaining 90% is called the "Invisible Web", or in other words, "The Cloaked Web" or "The Deep Web". This is the massive content that is publicly available, but hidden from regular search engines.
Indeed, this is a tough concept to grasp - that billions of web pages cannot be found by Google. But it's true, billions of pages are beyond the abilities of search engine cataloging. The robot "spiders" which scan and catalog the world wide web are limited... they cannot see nor index everything.
To better visualize this concept, let's start with some size estimates from Google.com, Yahoo.com, Cyberatlas, and MIT. These stats are current to Fall 2007:

  • Google.com indexes 12.5 billion public web pages.
  • 71 billion static web pages are publicly-available. These pages can easily be found by Google and other search engines. (e.g. www.honda.com, www.australia.gov.au)
  • 6.5 billion static pages are hidden from the public. As private intranet content, these are the corporate pages that are only open to employees of specific companies. (e.g. employees.honda.com, secure.australia.gov.au)
  • 220+ billion database-driven pages are completely invisible to Google. These invisible pages are not the regular web pages you and I can make. Rather, these are dynamic database reports that exist only when called from large databases.
    (e.g. custom online car quote for Shelly, Australian government discussion on aboriginal taxation)

Google, considered the best search database today, can only catalog a fraction of this monstrous content. Even with electronic spiders to catalog millions of web pages each week, Google current indexes only 12.5 billion out of the 220+ billion pages out there...less than 6% of all available internet content.
So if Google only catalogs 6% of the World Wide Web, and other search engines catalog even less, then where is the remaining 90%of web content hidden?

The "Invisible Web" (aka "Deep Web" or "Cloaked Web") is content that is closed-off to search engines. To be specific: the Invisible Web is comprised of 220+ billion web pages that are not stored as static web pages. Instead, the Invisible Web is made of on-demand database content...pages which exist only as reports of changing data. As of August 2007, robot spiders are not advanced enough to read these private databases. Only a human reader can see these "invisible pages" by directly visiting these sites and making direct database requests.
Technical terminology:

  • "Spider": an artifical intelligence program, or robot, that is sent out weekly to scour the public Internet and read millions of static web pages. The spider reports back to its mother database with its results, and those results get collated into search engine catalogs for public use.
  • "Database-Driven Web Content": web pages that exist only temporarily, and are generated only when readers request answers from a large database. These temporary web pages are dynamic, usually cannot be bookmarked, and commonly have extremely long URL addresses.
  • Examples of databased web content: Today's job postings in Honolulu, apartments available in Singapore, today's weather report for Dublin, flights available to Istanbul, stock quotes for the NYSE, houses for sale in Winnipeg, reviews on the movie "Bourne Ultimatum", leather jackets for sale on eBay, hard drives for sale at Best Buy, your current savings account balance.
  • These temporary web pages, once displayed to the reader, cease to exist moments later. Minute-by-minute, these databased web pages are re-created to reflect updated information on the database.

"OK. So I think I understand this now. 'Invisible Web Pages' are really 'Dynamic Web Pages'. That's when a database builds me a temporary page to answer my database question! Neat! So how do I find these thousands of 'Invisible Web' databases?"

Finding and reading "Invisible Web" pages is not difficult nor prohibited, but it is time-consuming. Because Invisible Web ("Deep Web") content is very specialized and unique to a specific topic of changing content, you will need to search twice:

  1. First: you need to use primary search engines (Google, Yahoo, MSN, A9, Vivisimo, Dogpile, Ask) to locate the database you want. Expect this to be at least 30 minutes of searching (e.g. search for "jobs in honolulu", "weather reports dublin", "houses for sale winnipeg", "hard drive sales at best buy")
  2. Secondly: once you find the specific database you want, you then need to search within that database. For example: you may find www.monster.com helpful for searching databased jobs in Honolulu. Perhaps you will read www.theweathernetwork.com to find your weather reports, or www.expedia.com to search for available flights. Note: expect this "invisible web" database searching to take hours; these databases overflow with interesting and useful content!

No comments:

How to Get files from the directory - One more method

 import os import openpyxl # Specify the target folder folder_path = "C:/Your/Target/Folder"  # Replace with the actual path # Cre...