What Is a Web Crawler?

Web Robot Site Crawler

The term "web crawler" is used quite often in web design and search engine optimization articles, but what exactly is it? And why is it essential to the functioning of the internet?

Definition of Web Crawler

While images of robotic spiders clambering over websites come to mind, a better metaphor for this program would be a librarian. Any website is made up of files - .php or .html or .asp or whatever - and the crawler is simply an automated data-gathering program - designed to acquire only the data that the crawler's creator needs.

How Does a Crawler Finds Your Site?

The science of search engine optimization is almost entirely designed around making websites attractive to web crawlers, also known as "bots". The reason people are looking to have their site linked to as many other sites as possible is because the "bot" program finds your site through those hyperlinks - often while "crawling" another site. The more links you have out there, the more likely it is that more than one bot will find your site - and that's the first piece of information that the search engines (who use the majority of these programs) get back: this is a popular site.

The crawlers are designed to gather a lot more information as well. Search engines are keenly interested in the content of websites, so any information on the text or images on your site also become important. This is why things like "alt" tags and image and video descriptions become essential to the SEO of any site. They can't actually "see" images or videos - all of the information they gather is textual, whether it's a paragraph about the vampires in Twilight: Eclipse or the size of the picture of Jacob available for download.

Search engines use this information to compile statistics on keywords as well as site popularity - which is how Google, for example, is able to so successfully market its AdSense program. If you use AdSense on a site that is popular (found by many crawlers) and has a high percentage of keywords in the content, the odds are that advertisers will have better luck selling to your audience. Without the bot to "spider" the web, it would be a hit-or-miss proposition.

Not Just Web Stats

While the vast majority of these indexing "bot" scripts are used by search engines (such as Yahoo's SLURP, Microsoft's MSNBOT, and the eponymous WebCrawler, which was used to build the first full text index of the web) programmers can gather more information than keywords and links. Some of these scripts are used to archive the web, or to keep track of what parts of a site change. Linguists can use them to see what kind of language people are using on blogs and forums or via twitter, for example. Anyone can configure their own web crawler, in fact, using open-source applications such as Aspseek. You could use it to check your own site for broken hyperlinks, or to make sure all the images have proper alt tags.

Crawling Problems

Unfortunately, criminal minds have also configured crawlers to gather less legitimate information - such as trying to find social security numbers, bank account numbers, and other "phishing" information. Since crawlers have to ask the server to deliver information, they can be configured to be "impolite" and ask for information at a rate that ends up crippling the server and causing it to crash or somehow expose a weakness. This actually happens on occasion with legitimate "bots" that catalogue a website, and many protocols have been put in place to ensure that every web crawler is "polite" when asking for server information.

Another problem of these kinds of indexing "bots" simply lies in the fact that the web is huge, and constantly changing. The odds are that by the time it has finished looking at the last page of a site, the first page will have been changed. While they are essential to web, they are an inefficient method and are only able to effectively cover a fraction of the full Internet. At some point it is likely that future web users will look at web crawlers in the way people now look at card catalogues in the library - a quaint artifact.

What Is a Web Crawler?