search engine position

 

Search Engine Technology: A Primer

Robots
Keyword Searching
Problems With Keyword Searching
Concept-based searching
Refining Your Search
Phrases and Capitals
Frequency and Relevance
Meta Tags

Robots
Search engines* like Google, Alta Vista, and HotBot use specialized software "robots" to survey the Web and build their databases. These robots run around continuously, retrieving web documents (pages), and indexing them. When you enter a keyword query at a search engine, your input is checked against the engine's keyword indices. The best matches are then returned to you as web hits - clickable listings.

There are two primary methods of text searching: keyword and concept.

back to top

Keyword Searching

This is the most common form of text search on the Web. Most search engines use keywords to do their text query and retrieval. A keyword is sometimes a single word, e.g. "coffee," but more often it's a phrase, "Italian gourmet coffee." Today, surfers tend to use phrases because a single word will bring back too many hits -- many of which may be irrelevant.

Unless the author of a web document identifies the keywords in the background code (and other places), it's up to the search engine to decide what the important words are. Essentially this means that search engines pull out and index words that they believe are significant. Words found near the top of a document and words repeated throughout the page tend to be those that the engines see as significant.

But this can get very complex. Some sites index every word on every page. Others index only part of the document. For example, Lycos indexes the title, headings, subheadings and the hyperlinks to other sites, along with the first 20 lines of text.

Full-text indexing systems generally pick up every word in the text except commonly occurring stop words such as "a," "an," "the," "is," "and," "or," and "www." AltaVista claims to index all words, even the articles, "a," "an," and "the." Some of the search engines discriminate upper case from lower case. Others store all words without reference to capitalization.

back to top

Problems With Keyword Searching

Engines often have a hard time distinguishing between words that are spelled the same way, but have different meanings. For instance, consider the difficulties in interpreting such words as: meet, hard, stake, high, break, or string. Such words can have a variety of meanings depending on context, and can produce web hits that are completely irrelevant to a query.

Some search engines also have trouble with "word stemming." For instance, if you enter the word "dog," do they return a hit on the word, "dogged?" What about singular and plural words? What about verb tenses that differ from the word you entered by only an "s," or an "ed"?

Search engines also tend not to return hits on keywords that mean the same, but aren't entered in your query. Your search for "light" would not return a document that used the word "lamp," even though the latter is what you were looking for.

back to top

Concept-based searching

Unlike keyword search systems, concept-based search systems try to determine what you mean, not just what you say. In the best cases, a concept-based search returns hits on documents that are "about" the subject or theme you're exploring, even if the words in the document don't precisely match the words you enter into the query.

Excite is currently the best-known general-purpose search engine site on the Web that relies on concept-based searching. This type of search is also known as clustering -- which means that words are examined in relation to other words found nearby. When several words or phrases connected to a particular concept appear close to each other, the search engine concludes that the piece is "about" a certain subject.

But, again, context is critical. For example, the word "glove", when used in sports context, would be likely to appear with such words as catcher, pitcher, base, baseball, umpire, outfield, etc.. If the same word appears in a document with words such as sleeve, mitten, coat, hat, etc., the search engine would return hits on the subject of clothing or fashion.

Concept search often works better in theory than in practice-a good idea, but one that's far from perfect. The results are best when you enter a lot of words, all of which roughly refer to the concept you're pursuing.

back to top

Refining Your Search

Most sites offer two different types of searches--"basic" and "refined." In a basic search, you enter a keyword without sifting through pull-down menus of options. Depending on the engine, these "basic" searches can be quite complex.

Refined search options differ from engine to engine, but some of the possibilities include the ability to search on more than one word, to give more weight to one search term than to another, and to exclude words that could create confusion. You might also be able to search on proper names, on phrases, and on words that are found within a certain proximity to other search terms.

Many, but not all search engines allow you to use so-called Boolean operators to refine your search. These are the logical terms AND, OR, NOT, and the so-called proximal locators, NEAR and FOLLOWED BY.

The Boolean AND means that all the terms you specify must appear in the documents, i.e., "orange" AND "juice." You would use this if you wanted to exclude hits related to the color orange or to juice in general, and concentrate (pun intended) on the breakfast drink.

The Boolean OR means that at least one of the terms you specify must appear in the documents, e.g., Poe, poems OR biography.

The Boolean NOT means that the term you specify must not appear in the found pages. You might use this approach if you anticipated results that would be off-target, e.g., program AND tv, NOT computer.

NEAR means that the terms you enter should be within a certain number of words of each other. FOLLOWED BY means that one term must directly follow the other. ADJ, for adjacent, serves the same function. A search engine that will allow you to search on phrases uses, essentially, the same method (i.e., determining adjacency of keywords).

back to top

Phrases and Capitals

Given the total number of possible hits, the ability to query on phrases is critical. Traditionally, this meant enclosing the phrase in quotation marks, i.e., "The Bristol Stomp." Today, however, many engines default to the phrase-that it, they assume you meant to use quotation marks.

Capitalization is essential for searching on proper names of people, companies or products. Unfortunately, many words in English are used both as proper and common nouns--Bill, bill, Chuck, chuck, Gates, gates, Times Square, times, square, Lotus, lotus -- and many more.

Each of the search engines has a different method for refining queries. The best way to learn these methods is to read the sites' help files and practice.

Most of the search engines return results with confidence or relevancy rankings. In other words, they list the hits according to how closely they think the results match the query. However, these lists often leave users shaking their heads on confusion, since, to the user, the results often seem completely irrelevant.

Why so many irrelevant results? Basically it's because search engine technology has not yet reached the point where humans and computers understand each other well enough to communicate clearly.

back to top

Frequency and Relevance

Most search engines use search term frequency as a primary way of determining whether a document is relevant. If you're researching catarrh and the word "catarrh" appears multiple times in a Web document, it's reasonable to assume that the document will contain useful information. Therefore, a document that repeats the word "catarrh" over and over is likely to turn up near the top of your list.

If your keyword is a common one, or if it has multiple other meanings, you could end up with a lot of irrelevant hits. And if your keyword is a subject about which you desire information, you don't need to see it repeated over and over--it's the information about that word that you're interested in, not the word itself.

Some search engines consider both the frequency and the positioning of keywords to determine relevancy, reasoning that if the keywords appear early in the document, or in the headers, this increases the likelihood that the document is on target. For example, Lycos ranks hits according to how many times your keywords appear in their indices of the document and in which fields they appear (i.e., in headers, titles or text). It also takes into consideration whether the documents that emerge as hits are frequently linked to other documents on the Web, reasoning that if other folks consider them important, you should, too.

As far as the user is concerned, relevancy ranking is critical, and becomes more so as the sheer volume of information on the Web grows. Most of us don't have the time to sift through scores of hits to determine which hyperlinks we should actually explore. The more clearly relevant the results are, the more we're likely to value the search engine.

back to top

Meta Tags

Some search engines still index web documents by the meta tags in the documents' HTML (code). These appear at the beginning of the document in the "head" tag.

These "tags" give the website author influence over which keywords are used to index the document. This is obviously very important if you are trying to draw people to your website based on your rank in the engines.

There is much confusion about meta-tagging in general. We know that different search engines look at meta-tags in different ways. Some rely heavily on meta-tags; others don't use them at all. The general opinion in 2002 is that meta-tags are less useful than they were a few years ago. This is because of the high rate of "spamdexing" where web authors using false and misleading keywords in the meta tags.

The "keyword" meta tag has been abused by some webmasters. For example, a recent tactic has been to put such words "sex" or "mp3" into keyword meta tags, in hopes of luring searchers to one's website by using such popular keywords. But the engines are aware of such deceptive tactics, and have devised various methods to circumvent them. SearchCoach does NOT use tactics that are likely to create problems of this kind -- (see "What").

There is no perfect way to ensure that you'll receive a high ranking. Even if you do get a good ranking, there's no assurance that you'll keep it for long without professional assistance. You can achieve some certainty by using a pay-per-click service such as Overture - but, obviously, at a cost.

The best general advice for do-it-yourself search engine optimizers: study, study, study, and practice, and measure your results. Then try again until you get it right.

back to top

*Note that Yahoo, Looksmart, and others are "directories, not search engines." This means that real live human beings create the listings.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Search Coach is a wholly-owned subsidiary of
Novation, Inc.
190 M.t. Vernon Ave.
Rochester NY, 14620
info@searchcoach.com

© Novation, 2002