The deep web is the portion of the internet that cannot be found via search engines. This could be because the data is not linked from anywhere, the data is hidden behind a pay wall or password, the data is stuck in a database with no html to render the contents of the database, or the pages explicitly tell search engine not to crawl via “nocrawl” and robots.txt files. The deep web is also known as the or dark web, invisible web, etc… each has their own subtle meaning, but they’re all often used interchangeably.
I find it incredibly ironic that in the process of trying to learn more about the deep web, I have been hindered by the issues that cause the deep web. For example, I think this paper on Accessing the Deep Web would be a terrific read. When I click on the link for the full-text article, I’m faced with an ACM login page. The deep web is hindering my access to knowledge about the deep web.
I’m particularly interested in the methodology of estimating the size of the deep web. I keep hearing widely varying estimates that vary even more widely in their methodology. I’d be curious to learn more about research projects and or companies efforts to make this body of information more accessible. If you’re barking up this tree, ping me, or comment below.



Andrew hey - When we started Quigo, our focus was specifically on exposing the deep web. I even found an old page we were tinkering with a bunch of years ago… - http://www.deepweb.com/
Anyhow - ping when you have a chance, and I’d be happy to share whatever knowledge we accumulated in our deep web days… ;-)