Friday, June 10, 2011

So How do Search engines work Anyway?

Everyone knows that when you’re going to look for something on the web, you first visit a search engine website, such as Google or Bing, and type in what you’re looking for. Ta-dah! You instantly have results, and likely the information that you were looking for. It all seems rather magical, but search engines work off of real world principles.

A search engine is basically the index you use to find information on the World Wide Web. The web itself is far too large, and individual websites too narrow to navigate without using tools to filter and organize information. There are two main types of search engines: crawler-based and human-based.

Crawler-based engines, such as Google, automatically create listings. If you make any changes to your website, a crawler-based engine will eventually find these changes, but it will affect how you are listed. Elements such as page titles and body text also affect your listings.

A crawler engine works by a spider (also called a crawler) computer program visiting a website. It reads the page and visits links within the website. The spider periodically returns to the website, usually somewhere between two weeks to a month, to look for any changes. Everything that the spider finds on the website goes into the search engine’s index. The index or catalog contains a copy of every webpage that the spider visits, and they are massive databases. Google’s index contains around 38 billion pages, as estimated on June 3rd, 2011 by World Wide Web Size (the actual number is constantly changing). If a webpage changes, then the new version is entered into the index. This can take some time, from when a webpage is “spidered” but not yet “indexed,” but until a webpage is indexed, it cannot be found by the search engine. This is why it is recommended for new businesses to build and list their websites before opening their doors to business; it gives search engines time to get your website listed online with them.

The last part of a crawler search engine is its software. The program sifts through millions and millions of webpages to find matches to your query, and then lists them from most relevant to least. This is the familiar results page that you see when using a search engine yourself.

All search engines have these components, but each one applies a different value to certain areas through its computer algorithms. That is why you can search for the same thing in one engine, and get an entirely different result from another.

When a spider visits a webpage, it records information such as how many times certain words are repeated, where the words are located, such as in the header or body of the webpage, and it weighs how relevant the page is to the subject, based on the surrounding words. This last portion is the hardest job for a search engine to do, and this is what makes a search engine effective, and thus, popular. HowStuffWorks has a great article on comparing some of the largest search engines’ spiders. The Google spider, Googlebot, was built to index every significant word on a page, leaving the articles “a,” “an” and “the.” The Lycos spider is said to use the words in the title, sub-headings and links, along with the 100 most frequently used words on the page, and each word in the first 20 lines of text. AltaVista’s spider indexes every single word on a page, including “a,” “an,” “the” and other insignificant words.
The actual method that each algorithm uses are actually trade secrets. WiseGEEK has this to say: “The algorithms that various search engines use are well protected, to prevent people from specifically creating pages to get better ranks, or at least to limit the degree to which they can do that. This difference is why different search engines yield different results for the same terms. Google might determine that one page is the best result for a search term, and Ask might determine that same page is not even in the top 50. This is all just based on how they value inbound and outbound links, the density of the keywords they find important, how they value different placement of words, and any number of smaller factors.”

A human-based directory, such as Mahalo, uses editors to write descriptions about websites, or depends on users to submit their own listings. Any search only lists exact matches in these descriptions to your query. If you change anything on your website, it will not affect your listing at all. The benefit of human-based directories is that they are evaluated by actual people, thus tend list websites of higher value and relevance to search users. However, there is a tendency for listings to go ‘stale’ due to a lack of updates, and search results are much smaller than with crawler-based engines.

Many modern search engines use a hybrid of these two types, according to Search Engine Watch, such as MSN Search blending results from human-powered LookSmart and crawler-based Inktomi for more obscure requests.

The cutting-edge in search engine technology is using concept-based algorithms, as opposed to keyword-based. This means that the search engine will suggest webpages that are related topically to your search term, as well as direct matches to your inquiry, by using statistical analysis (think Netflix’ widely-touted recommendations algorithm). Research is also being done in natural-language queries, that is, typing in a question exactly the way you would ask it to another person. The most popular natural-language search engine is Ask Jeeves, but it is presently restricted to only simple questions.

Search Engine Optimization (SEO) works to maximize the placement of your website for different search engine results, so you will get more ‘hits’ from visitors using that search engine. We’ll talk more about this in later articles, but this is a tool that most businesses will want to use to promote their business online.

Search engines are marvelous everyday tools of computer technology, and a key to making your business’ website successful.


How google works

 Janet Houck is the Senior Editor for the Small Business Online Toolkit
She has been a freelance editor and published author for almost a decade.
Read more about her on her SBwebtoolkit Bio page

No comments:

Post a Comment