Robots
Search engines* like Google, Alta Vista, and
HotBot use specialized software "robots"
to survey the Web and build their databases. These robots
run around continuously, retrieving web documents (pages),
and indexing them. When you enter a keyword query at
a search engine, your input is checked against the engine's
keyword indices. The best matches are then returned
to you as web hits - clickable listings.
There are two primary methods of text
searching: keyword and concept.
This is the most common form of text
search on the Web. Most search engines use keywords
to do their text query and retrieval. A keyword is sometimes
a single word, e.g. "coffee," but more often
it's a phrase, "Italian gourmet coffee." Today,
surfers tend to use phrases because a single word will
bring back too many hits -- many of which may be irrelevant.
Unless the author of a web document
identifies the keywords in the background code (and
other places), it's up to the search engine to decide
what the important words are. Essentially this means
that search engines pull out and index words that they
believe are significant. Words found near the top of
a document and words repeated throughout the page tend
to be those that the engines see as significant.
But this can get very complex. Some
sites index every word on every page. Others index only
part of the document. For example, Lycos
indexes the title, headings, subheadings and the hyperlinks
to other sites, along with the first 20 lines of text.
Full-text indexing systems generally
pick up every word in the text except commonly occurring
stop words such as "a," "an," "the,"
"is," "and," "or," and
"www." AltaVista claims to
index all words, even the articles, "a," "an,"
and "the." Some of the search engines discriminate
upper case from lower case. Others store all words without
reference to capitalization.
Engines often have a hard time distinguishing
between words that are spelled the same way, but have
different meanings. For instance, consider the difficulties
in interpreting such words as: meet, hard, stake,
high, break, or string. Such words can
have a variety of meanings depending on context, and
can produce web hits that are completely irrelevant
to a query.
Some search engines also have trouble
with "word stemming." For instance, if you
enter the word "dog," do they return a hit
on the word, "dogged?" What about singular
and plural words? What about verb tenses that differ
from the word you entered by only an "s,"
or an "ed"?
Search engines also tend not to return
hits on keywords that mean the same, but aren't entered
in your query. Your search for "light" would
not return a document that used the word "lamp,"
even though the latter is what you were looking for.
Unlike keyword search systems, concept-based
search systems try to determine what you mean, not just
what you say. In the best cases, a concept-based search
returns hits on documents that are "about"
the subject or theme you're exploring, even if the words
in the document don't precisely match the words you
enter into the query.
Excite is currently
the best-known general-purpose search engine site on
the Web that relies on concept-based searching. This
type of search is also known as clustering -- which
means that words are examined in relation to other words
found nearby. When several words or phrases connected
to a particular concept appear close to each other,
the search engine concludes that the piece is "about"
a certain subject.
But, again, context is critical. For
example, the word "glove", when used in sports
context, would be likely to appear with such words as
catcher, pitcher, base, baseball, umpire, outfield,
etc.. If the same word appears in a document with words
such as sleeve, mitten, coat, hat, etc., the
search engine would return hits on the subject of clothing
or fashion.
Concept search often works better in
theory than in practice-a good idea, but one that's
far from perfect. The results are best when you enter
a lot of words, all of which roughly refer to the concept
you're pursuing.
Most sites offer two different types
of searches--"basic" and "refined."
In a basic search, you enter a keyword without sifting
through pull-down menus of options. Depending on the
engine, these "basic" searches can be quite
complex.
Refined search options differ from engine
to engine, but some of the possibilities include the
ability to search on more than one word, to give more
weight to one search term than to another, and to exclude
words that could create confusion. You might also be
able to search on proper names, on phrases, and on words
that are found within a certain proximity to other search
terms.
Many, but not all search engines allow
you to use so-called Boolean operators to refine
your search. These are the logical terms AND, OR, NOT,
and the so-called proximal locators, NEAR and FOLLOWED
BY.
The Boolean AND means that all the terms
you specify must appear in the documents, i.e., "orange"
AND "juice." You would use this if you wanted
to exclude hits related to the color orange or to juice
in general, and concentrate (pun intended) on the breakfast
drink.
The Boolean OR means that at least one
of the terms you specify must appear in the documents,
e.g., Poe, poems OR biography.
The Boolean NOT means that the term
you specify must not appear in the found pages. You
might use this approach if you anticipated results that
would be off-target, e.g., program AND tv, NOT computer.
NEAR means that the terms you enter
should be within a certain number of words of each other.
FOLLOWED BY means that one term must directly follow
the other. ADJ, for adjacent, serves the same function.
A search engine that will allow you to search on phrases
uses, essentially, the same method (i.e., determining
adjacency of keywords).
Given the total number of possible
hits, the ability to query on phrases is critical. Traditionally,
this meant enclosing the phrase in quotation marks,
i.e., "The Bristol Stomp." Today, however,
many engines default to the phrase-that it, they assume
you meant to use quotation marks.
Capitalization is essential for searching
on proper names of people, companies or products. Unfortunately,
many words in English are used both as proper and common
nouns--Bill, bill, Chuck, chuck, Gates, gates, Times
Square, times, square, Lotus, lotus -- and many more.
Each of the search engines has a different
method for refining queries. The best way to learn these
methods is to read the sites' help files and practice.
Most of the search engines return results
with confidence or relevancy rankings. In other words,
they list the hits according to how closely they think
the results match the query. However, these lists often
leave users shaking their heads on confusion, since,
to the user, the results often seem completely irrelevant.
Why so many irrelevant results? Basically
it's because search engine technology has not yet reached
the point where humans and computers understand each
other well enough to communicate clearly.
Most search engines use search term
frequency as a primary way of determining whether a
document is relevant. If you're researching catarrh
and the word "catarrh" appears multiple times
in a Web document, it's reasonable to assume that the
document will contain useful information. Therefore,
a document that repeats the word "catarrh"
over and over is likely to turn up near the top of your
list.
If your keyword is a common one, or
if it has multiple other meanings, you could end up
with a lot of irrelevant hits. And if your keyword is
a subject about which you desire information, you don't
need to see it repeated over and over--it's the information
about that word that you're interested in, not the word
itself.
Some search engines consider both the
frequency and the positioning of keywords to determine
relevancy, reasoning that if the keywords appear early
in the document, or in the headers, this increases the
likelihood that the document is on target. For example,
Lycos ranks hits according to how many
times your keywords appear in their indices of the document
and in which fields they appear (i.e., in headers, titles
or text). It also takes into consideration whether the
documents that emerge as hits are frequently linked
to other documents on the Web, reasoning that if other
folks consider them important, you should, too.
As far as the user is concerned, relevancy
ranking is critical, and becomes more so as the sheer
volume of information on the Web grows. Most of us don't
have the time to sift through scores of hits to determine
which hyperlinks we should actually explore. The more
clearly relevant the results are, the more we're likely
to value the search engine.
Some search engines still index web
documents by the meta tags in the documents' HTML (code).
These appear at the beginning of the document in the
"head" tag.
These "tags" give the website
author influence over which keywords are used to index
the document. This is obviously very important if you
are trying to draw people to your website based on your
rank in the engines.
There is much confusion about meta-tagging
in general. We know that different search engines look
at meta-tags in different ways. Some rely heavily on
meta-tags; others don't use them at all. The general
opinion in 2002 is that meta-tags are less useful than
they were a few years ago. This is because of the high
rate of "spamdexing" where web authors using
false and misleading keywords in the meta tags.
The "keyword" meta tag has
been abused by some webmasters. For example, a recent
tactic has been to put such words "sex" or
"mp3" into keyword meta tags, in hopes of
luring searchers to one's website by using such popular
keywords. But the engines are aware of such deceptive
tactics, and have devised various methods to circumvent
them. SearchCoach does NOT
use tactics that are likely to create problems of this
kind -- (see "What").
There is no perfect way to ensure that
you'll receive a high ranking. Even if you do get a
good ranking, there's no assurance that you'll keep
it for long without professional assistance. You can
achieve some certainty by using a pay-per-click service
such as Overture - but, obviously,
at a cost.
The best general advice for do-it-yourself
search engine optimizers: study, study, study, and practice,
and measure your results. Then try again until you get
it right.