|
Components of a Search Engine
All search engines are busy with doing three things - collecting documents from the Internet,
analyzing the documents collected, and serving Web users' search requests. Each of those major
tasks are performed by corresponding software component of a search engine.
- Crawler or Spider This is a bot to collect documents on the Web.
- Indexer A software analyzes documents and generates searchable indexes for the documents.
- Query Server A system responses to user query and returns relevant documents.
Crawler Writing a simple crawler is not particularly challenging,
if it's not trivial. To crawl billions of pages effectively, however, a crawler needs to make two major
challenging decisions:
- What Page to Crawl Each search engine uses different criteria to determine what pages
to crawl. Google will not include a page if it's not linked by indexed page(s).
- Frequency of Updating Google updates pages with higher PageRank values more
frequently and updates home page of a site on a daily basis.
Compared with other parts of a search engine, crawler is the least sophisticated piece. What surprised me
is that other search engines are clearly far behind Google in terms of completeness of documents and
timeness of update. This is more likely an issue of philosophy rather than a issue of technology. Other
search engines apparently don't think the completeness and timeness of crawling is important at all.
Even though this is one of reasons they lost market shares to Google.
Indexer This is the heart of a search engine. The indexer performs
two major tasks. First, it generates a list of page characteristics to summarize a document. Second,
it produces a weighted searchable keyword list from the characteristics of the document.
There're many startups in search engine industry besides the search engine titans Google, Yahoo and MSN.
There will be conceptual breakthrough in indexing document in order to take search technology to next level.
A new way of characterizing document may take eventually one startup from unknown to a major player.
The Google's success in the search industry is clearly the result of leveraging PageRank concept.
Google characterizes a document in over one hundred factors. Not all factors are used for ranking search
results currently. However, the completeness of ranking factors holds the key for the flexibility of ranking algorithm
tuning in response to the market change. The Characteristics of Document are categorized into four groups:
- Page Content,
- Back Links,
- Page Traffic and
- Use Feedback.
The details of ranking factors are discussed in
The Factors that Impact the Search Engine Result Ranking.
Query Server Query Server or Search Engine Software serves users' search requests.
This is the part of the search engine that is visible to Web users.
- First, it tries to understand and interpret users' search term. This is likely the competitive front
for search engines to improve the quality of search results in near future. Unless search terms are properly interpreted,
search engines can't find the most relevant documents. The way how search terms are interpreted will partly
determine how documents are indexed in search engines.
- Second, the Query Server will try to retrieve and rank relevant documents.
The fundamental reason that search engines sometimes fail to return relevant documents is the lack of
understanding of search relevance. The quality of search is about quality and relevance of pages retrieved.
Google has taken the search experience to a new level by using simple yet elegant PageRank algorithm for
measuring importance of a Web page. Any search engines (Google, Yahoo, MSN, or
startups), who want to take search experience to next level, will have to come up with a proper algorithm
for computing search relevance based on new definition of the relevance.
Related Topics
Google PageRank - Basics, Secrets and Common Misunderstandings PageRank Calculator PageRank and Linking - the Definitive Guide Factors Affecting Search Engine Ranking Keyword Popularity on Google - an Online SEO Tool Website Classification for Search Engine Marketing Increase hits - Build Web Site Traffic from Search Engines
|
|
|