THE REGNET PROJECT 
  
  
  
  
  
 
      A Review of Academic Research on Information 
                              Retrieval  
                                        
                                         
                          By Charles H. Heenan 
                                        
                                         
                     Engineering Informatics Group 
            Department of Civil and Environmental Engineering 
                           Stanford University 
                        Stanford, California  94305 
 
                        Email:  heenan@stanford.edu 
                                        
                                         
                                         
                                        
                               August 6, 2002 





                                   


                                                       
                                                         
                                                         
                                                        
                                  Acknowledgement and Disclaimer 
                                                      
                 This report is intended to review current academic research into 
                 information retrieval for unstructured multimedia content.  This 
                 review has been performed as part of the Regnet Project, which is 
                 funded by the National Science Foundation under Grant No. EIA-
                 0085998. 
                  Any opinions, findings, and conclusions or recommendations 
                 expressed in this report are those of the author and do not necessarily 
                 reflect the views of the National Science Foundation. 





Draft 0.5 ­ Charles H. Heenan                                                                  2


 
INTRODUCTION ....................................................................................................................... 4 

METHODOLOGY AND ORGANIZATION............................................................................ 4 

SECTION 1.................................................................................................................................... 5 

     AN APPROACH TO THE SYNONYMY PROBLEM:  QUERY EXPANSION ............................................ 5 
        Reference-Based Query Expansion................................................................................................... 5 
        Multilingual Query Expansion ....................................................................................................... 6 

SECTION 2.................................................................................................................................... 8 

     AN APPROACH TO THE POLYSEMY PROBLEM:  CONTEXT VECTORS AND CONTEXT DISTANCE .... 8 

SECTION 3.................................................................................................................................. 11 

     SEARCH INTERFACES:  THE CATEGORIZATION OF SEARCH RESULTS.......................................... 11 
     SEARCH INTERFACES:  THE INCORPORATION OF SUBJECTIVE "EXPERT" OPINION.................... 15 
     SEARCH INTERFACES:  THE DEEP WEB ...................................................................................... 17 

CONCLUSION ........................................................................................................................... 19 

REFERENCES: .......................................................................................................................... 20 

     INFORMATION VISUALIZATION .................................................................................................. 20 
     CATEGORIZATION OF SEARCH RESULTS ..................................................................................... 21 
     CATEGORIZATION OF AN INFORMATION SPACE FOR BROWSING ................................................ 21 
     APPROACHES TO INFORMATION RETRIEVAL .............................................................................. 22 





Draft 0.5 ­ Charles H. Heenan                                                                                                         3


Introduction 
 
The purpose of this document is to give an overview of academic research into information retrieval 
(IR) of unstructured content.  Unstructured content typically includes text, speech, music, video, or 
still images.  This evaluation focuses on the retrieval of unstructured text.  The scope of the 
evaluation includes papers published no earlier than 1990 for conferences sponsored by the 
Association for Computing Machinery (ACM).  The research for this document was conducted as 
part of the Regnet Project at Stanford University.  The Regnet Project is funded by the National 
Science Foundation and is focused on the application of information technology to regulation 
management and regulatory compliance.   
 

Methodology and Organization 
 
The research for this document was limited to academic papers available through the Association 
for Computing Machinery's online digital library.  The initial queries used to gather documents for 
review were:  
     ·  text mining                                                  ·  data retrieval  
     ·  text discovery                                               ·  information management  
     ·  text retrieval                                               ·  knowledge management  
     ·  information mining                                           ·  text classification  
     ·  information discovery                                        ·  information classification  
     ·  information retrieval                                        ·  text categorization 
     ·  data mining                                                  ·  information categorization 
     ·  data discovery  
      
Of the thousands of papers that matched one or more of these queries, more than 500 were selected 
for an initial review.  Of those, 60 papers were selected for the more detailed reviews that form the 
basis of this document.  These 60 papers were chosen either because they represent areas of active 
research or because they are particularly creative or cutting-edge.  In the former case, there are often 
other similar papers in the ACM portal, any one of which reasonably could have been chosen.  In 
the latter case, there may be no other papers that approach the given research question in the same 
way.  No doubt, there are relevant papers in the ACM portal that did not match any of the starting 
queries.  Likewise, there are relevant papers that were not reviewed in detail.  This review is not 
meant to be exhaustive but is intended to focus on the issues and approaches that may be of 
relevance to the Regnet Project.  Those wishing to pursue further research in this area are 
encouraged to visit the ACM digital library at http://www.acm.org.    
 The field of text-based information retrieval is hardly new.  In the ACM archive, there exists a 
mountain of published technical papers on various aspects of the text IR problem.  A major topic 
addressed by information retrieval research is the dual problem of synonymy and polysemy.  This 
problem stems from the fact that, in response to a given query, any retrieval engine must strike a 
balance between the conflicting demands of precision and recall.1  With available techniques, an 
                                                 
1 Precision is the ratio of relevant retrieved documents to retrieved documents. 
  Recall is the ratio of relevant retrieved documents to relevant documents. 


Draft 0.5 ­ Charles H. Heenan                                                                             4


increase in the precision of a retrieval engine tends to result in a concomitant decrease in the recall 
performance of that engine.  That is to say, if you tweak an IR system so that a very high percentage 
of the result set is relevant to the query, you increase the risk that many other relevant documents 
will be excluded from the result set.  Conversely, if you optimize an IR system so that a very high 
percentage of all documents that are relevant to the query are included in the result set, you increase 
the risk that many irrelevant documents will also be included.   
 
The first section of this paper addresses the use of query expansion in solving the problem of 
synonymy.  The second section addresses the use of context vectors in solving the polysemy 
problem.  The third section discusses new developments in search interfaces. 
 

Section 1 

An Approach to the Synonymy Problem:  Query Expansion 
 One recurring problem in text IR is how to deal with multiple terms that refer to the same concept.  
For example, if a query interface does not take this into account when processing search terms, then 
its search results will be incomplete.  Although this is an important problem, it is a relatively simple 
one to address, however, and developers of text IR systems have tended to solve it with query 
expansion enabled by controlled vocabularies containing synonym lists or classification hierarchies.   
 A query expansion-enabled interface will take as input a given search term, look for synonyms in the 
controlled vocabulary, and return documents that match either the search term or any of its 
synonyms.   More sophisticated query expansion-enabled interfaces use controlled vocabularies that 
incorporate classification hierarchies in addition to synonym lists.   This type of interface uses a 
hierarchy of superordinate and subordinate relationships to conduct more thorough query expansion 
operations.  For example, if a user enters a search on the term "dog," such an interface might not 
only return documents that match the term "dog" but also documents that match terms subordinate 
to "dog" in a classification hierarchy, such as "golden retriever" or "border collie."  If the text 
collection contains documents in multiple languages, the controlled vocabulary can allow query 
expansion to include an international element by allowing for multilingual synonym lists and 
classification hierarchies.  Overall, the impact of simple as well as more sophisticated query 
expansion-enabled search tends to be more complete search results and a better search experience 
for the user.   
 Despite the relative theoretic ease with which one can use query expansion to address the problem 
of multiple terms referring to the same concept, the fact remains that constructing synonym lists and 
classification hierarchies is an onerous, manual task.  However, recent work out of Northwestern 
University in Illinois and out of Monash University in Australia reveals creative ways to conduct 
query expansion without first having to construct controlled vocabularies. 
 

Reference-Based Query Expansion 
 
Bradshaw, Scheinkman, and Hammond of Northwestern University's Intelligent Information Lab 
point out that people do not always submit unambiguous search queries to information retrieval 


Draft 0.5 ­ Charles H. Heenan                                                                        5


systems.  Citing studies on the searching behavior of digital library users, Bradshaw, et al. note that 
"people rarely use features of [a] query interface such as the Boolean operator "and" or phrase 
delimiters such as quotation marks to indicate how they intend query words to be grouped 
together."  Moreover, they note that people "rarely form queries of longer than three words" even 
though more detailed queries are often necessary to get highly-specific search results.  Consequently, 
in their view it is short-sighted for existing indexing systems to assume that searchers will submit 
accurate, unambiguous queries when the evidence indicates that they will not. 
 In response to this problem, Bradshaw, et al., have come up with a creative way to index documents 
so that a query will yield high-quality search results even if the query terminology is imprecise:  
research documents are indexed according to how they have been referenced in other articles.  This 
approach is based on the observation that, in research papers, the text "surrounding a citation (the 
reference) is usually a concise description of the information the cited document provides."  Using 
references in this way is a powerful approach to indexing documents because "references pair 
concise, on-point descriptions of information with the documents that contain that information."  
Consequently, an information system that enables query expansion by incorporating document-
reference information "is much better equipped to deal with the brief and often incomplete way 
people typically describe an information need."  Such a system can provide more accurate, relevant 
search results even to short queries "because a few words is often enough to eliminate from 
consideration many irrelevant documents that would be retrieved by standard retrieval techniques 
based on content."  Such a system also has the advantage that generating the reference-based 
indexes it requires for query expansion is a process that can be automated.  
 
Despite the virtues of reference-based query expansion, there remain a number of limitations to the 
idea.  First, it seems that a system like the one Bradshaw, et al., propose will be limited to 
conceptually homogenous text archives.  A system whose approach to indexing depends upon the 
way authors of documents cite other documents requires as much; otherwise, there is likely to be 
insufficient cross-referencing of documents for the citation index to be of benefit.  In this light, it 
makes sense that Bradshaw and her colleagues chose an archive containing only computer science 
research articles as the underlying text for their system.  Second, even if a reference-based approach 
to query expansion could work for conceptually heterogeneous document collections, the fact 
remains that more recent articles will tend to be under-indexed as compared to older articles.  For a 
period of time, any newly-published article will not have been cited by any other authors, although 
the article itself will contain citations to earlier work.  Presumably, if a newly-published article 
addresses a topic on which others have published before, then the existing index of cross-references 
may succeed in returning the new article in response to queries for which it is relevant.  However, 
authors of newly-published articles that also break new conceptual ground may have to wait until 
their papers are cited by others before their work is fully incorporated into a cross-reference index.  
Nevertheless, reference-based query expansion represents an important research contribution to the 
field of text-based information retrieval. 
  

Multilingual Query Expansion 
 
Chau and Yeh of Monash University's School of Business Systems also look at the information 
retrieval problem that is created when more than one term refers to the same (or similar) conceptual 
content.  In this case, Chau and Yeh are interested in the problem as it applies to multilingual 


Draft 0.5 ­ Charles H. Heenan                                                                         6


heterogeneous document collections as opposed to highly specific text archives like the one used in 
the Bradshaw study.  Yet, like Bradshaw et al., Chau and Yeh explore query expansion as a possible 
solution. 
 The application of query expansion to a multilingual corpus is appropriate due to the problems 
searchers tend to face when looking for resources that are not in their own language.  Unless a 
searcher is bi-lingual, it can be difficult "to formulate [a] query specifying an information need by 
producing appropriate keywords" in another language.  Chau and Yeh add that native users of Asian 
languages face additional difficulties even when they are able conceptually to specify their 
information need because "most Asian characters, such as Chinese, cannot be composed easily and 
directly from the computer keyboard."  To deal with both the difficulty in choosing search terms 
and the difficulty of entering eastern ideograms on western keyboards, Chau and Yeh propose an 
explorative approach to searching in which an information-seeker will browse through a map, 
directory, or hierarchy of concepts that are normalized to the information-seeker's native language.  
The user of such an interface submits a query by clicking on a concept of interest and the system 
returns results by showing the documents that populate that concept category.  Of course, the 
documents that are returned may be in any number of other languages besides the searcher's native 
tongue. 
 
While the formulation and submission of a query in this system occurs at the moment a user clicks 
into a concept category, the groundwork for multilingual expansion of that query occurs well in 
advance.  Chau and Yeh's approach to multilingual query expansion requires preprocessing the 
document collection so that multilingual content can be grouped into appropriate concept categories 
ahead of time.  This preprocessing uses "the co-occurrence statistics of a set of multilingual 
keywords extracted from a parallel corpus. " (A parallel corpus is a collection of documents 
containing identical text written in multiple languages.)  The reason for calculating these co-
occurrence statistics is that "semantically related multilingual keywords representing similar concepts 
tend to co-occur in similar patterns (i.e. similar inter- and intra-document frequency) within a parallel 
corpus."  By analyzing these statistics, "multilingual keywords extracted from a parallel corpus [can 
be] sorted into keyword clusters (concept classes)."  Once these keyword clusters have been 
identified, "each...cluster is given a concept label in each language involved."   
 On balance, the approach to multilingual query expansion outlined by Chau and Yeh is compelling.  
The authors make a point of addressing the fact that there is an "inexact correspondence between 
keywords across languages" due to cultural or linguistic differences.  They acknowledge that, as a 
result, a "one-to-one mapping of a keyword and its foreign counterparts may not always be 
possible."  So rather than attempting to make perfect matches between terms in one language and 
those in another, Chau and Yeh focus on clusters of relationships in the expectation that those 
clusters will be of value to the information seeker.  To the extent that this approach makes it easier 
for speakers of Asian languages to formulate queries and to pose them to the system, Chau and 
Yeh's expectation appears to be appropriate.   
 However, the authors have not addressed the problem of how concept clusters can be labeled 
accurately and efficiently.  Even within one language, the question of assigning concepts to 
categories can be a drawn-out manual process full of subjectivity.  If an automated concept-to-
category assignment tool is used, then the process for creating assignment rules can itself become 
drawn-out and subjective.  When dealing with multilingual text collections, the difficulty of placing 
concepts in categories and of labeling those categories becomes even greater.  At the same time, the 


Draft 0.5 ­ Charles H. Heenan                                                                        7


likelihood that a monolingual end-user will be able to distinguish between high- and low-quality 
concept-labels decreases precisely because a typical user knows only one of the languages being 
used.  One avenue for future research on multilingual text retrieval could be to explore how to 
develop high-quality concept names for multilingual concept clusters in an efficient manner.  
Another avenue for future research could be how to strengthen the monolingual end user's relative 
inability to assess the quality/accuracy of concept labels.   

Section 2 

An Approach to the Polysemy Problem:  Context Vectors and Context Distance 
 
Query expansion is a simple and productive approach to the problem created when multiple terms 
refer to the same concept.  Unfortunately, an equivalently simple approach does not exist for the 
opposite case that arises when morphologically identical terms refer to separate concepts.   
 
To solve the polysemy problem in information retrieval requires the disambiguation of word 
meanings when separate ideas are expressed by the same term.  A common example of this 
circumstance is the word "bank," which can refer to a river bank, a bank of public telephones, and a 
place that stores money.  If an information seeker submits a query of "bank," the difficulty is how to 
enable the search system to determine what type of "bank" is meant.  This sort of ambiguity has 
direct implications for query expansion as well, because in one case the query should expand to 
include synonyms such as "shore" or "edge" while in another case the synonym list should include 
"financial institution" or "investment house."  
 Developers of some early text information retrieval systems chose simply to ignore the polysemy 
problem.  These early systems would return all documents deemed "relevant" to the query, where 
relevance is based upon strict word similarity.  While this may yield many relevant documents, they 
are likely to be buried among other documents that do contain the search term but that are 
irrelevant on a semantic level.  More importantly, defining relevance according to strict word 
similarity means some documents that are relevant will not be returned because they do not contain 
the specific search term.  The result of ignoring the polysemy problem in this way is both low 
precision and low recall.  This is problematic on both counts:  low precision creates difficulties in 
separating the wheat from the chaff, so to speak, in the list of returned documents, while low recall 
is precisely the sort of problem that query expansion is meant to offset.   
 
Query expansion does not hold the answer: although recall would improve, precision would go 
through the floor if a system were to expand a query on "bank" to include all synonyms of all the 
various senses of that word (ie, synonyms of "bank" as in "river bank" and synonyms of "bank" as 
in "financial institution," and...and...).  Recent research on the use of context distance for word 
sense disambiguation holds great promise.   
 
The work of Jing and Tzoukerman of Columbia University and Bell Labs, respectively, suggests one 
solution for the issue of polysemous terms. Starting from the assumption that a given word or 
phrase has a dominant meaning in a given document, they then "represent this meaning in the form 
of a context vector."  These context vectors are based on "all occurrences of the same word in [a] 
document" and are derived from the terms that occur within a window of 10 words surrounding the 
target word.  The more frequently a given word or phrase appears within the window of a given 


Draft 0.5 ­ Charles H. Heenan                                                                      8


target word, the stronger a signifier that word is when it comes to sense disambiguation.  For 
example, if the term "savings and loan" always occurs within the 10-word context window for 
"bank," there is a strong likelihood that the bank in question is the financial institution type and not 
shore-of-a-river type. For each term within these context windows, a weight is assigned based on the 
frequency of occurrence.   
 For example, Figure 1 shows an example of the target word "bank" and its corresponding context 
vector.  Note that the words "savings," "million," and "loan," etc, help "to disambiguate the target 
word `bank' as the money bank rather than the river bank." 
  
  
  
  
  
Figure 1:  Context vector for the target word bank.  The weight following each term in the context 
vector indicates the importance of that term in the vector. 
 On the basis of the context vector in Figure 1, an information retrieval engine that is responding to 
a query on the term "river bank" can adjust so as to return only exact boolean matches and to 
exclude matches on the more general term "bank."   
 However, context vectors alone may not be sufficient to determine whether two morphologically 
related (or even identical) target words are semantically related.  First, an approach is needed for 
evaluating how closely related a given pair of context vectors may be.  Since context vectors are 
composed of individual terms, Jing and Tzoukerman achieve this by focusing on the "level of 
mutual information between words in context vectors."  For Jing and Tzoukerman, this mutual 
information level is signified by term co-occurrence frequency.  If the terms in one context vector 
have strong co-occurrence relationships with the terms in another context vector, then the 
respective target words (regardless of morphology) are more likely to be semantically related. Figure 
2 shows a table of word pairs with varying levels of co-occurrence strength (corpus relevance). 
  
  
  
  
  
 
          Figure 2:  Word pairs and co-occurrence strength for each pair (corpus relevance).   
 


Draft 0.5 ­ Charles H. Heenan                                                                      9


The calculation of word pair co-occurrence strength (corpus relevance) makes it possible to calculate 
the distance between context vectors even when the terms in those vectors are not the same.  For 
example, "the word `bank' may occur with the word `money' in one context, and with the word 
`loan' in [another.]  If [one] can capture the close relatedness of `money' and `loan', [one] can deduce 
that `bank' probably has similar meanings in the two occurrences."  Jing and Tzoukerman observe 
that "a model which relies on exact word repetition will fail in this case since it will miss the relations 
between `money' and `loan.'"  Figure 3 shows an example of just such a case.  Note that the only 
shared term in the context vectors is "loan."  Despite this, the strong corpus relevance between the 
terms in each context vector is sufficient to indicate that the two target words concern the same 
topic. 
 
  
  
  
  
           Figure 3:  "Bank" and "Banks" - Morphologically distinct, but semantically linked. 
Figure 4, on the other hand, shows an archetypal polysemy problem:  two morphologically identical 
instances of the word "bank."  Are they conceptually identical?  Or are they semantically as different 
as if they were morphologically unrelated?  The Jing ­ Tzoukerman approach allows us to say that, 
although the two terms are morphologically identical, they are conceptually distinct.   
 
  
  
  
  
            
                                  Figure 4:  A bank is not always a bank. 
 Jing and Tzoukerman's work is an important contribution to a central problem in information 
retrieval:  how to find an optimal balance between precision and recall.  On the one hand, one could 
maximize recall for a given query simply by returning the set of all records in the document 
repository.  The fact that this results in abysmal precision makes it non-sensical.  On the other hand, 
attempts at maximizing precision must have some way of dealing with polysemy.  Otherwise, either 
those attempts will fail or recall will suffer.  In short, precision and recall are two sides of the same 
coin.  The goal is to find an optimal balance between them.  Jing and Tzoukerman show us that 
query expansion and sense disambiguation can take us a long way towards this goal. 
 
The remainder of this paper discusses new developments in search interfaces, including the 
categorization of search results and the categorization of databases as opposed to text content. 
 


Draft 0.5 ­ Charles H. Heenan                                                                        10


Section 3 

Search Interfaces:  The Categorization of Search Results 
 
The synonymy and polysemy problems pertain to the task of query fulfillment in that, to be good, a 
search engine must respond to a query by returning a list of documents with the maximum quantity 
of relevant records and the minimum quantity of irrelevant records.  Yet, there exists a separate set 
of problems that pertain to the user interface for viewing these search results.  Typically, search 
results are presented in the form of a ranked list, broken down so that only 10 or 20 are viewable on 
a given web page. Even if precision and recall are optimized, a list of search results will contain some 
documents that are not useful for the searcher and others that are useful.  The list of search results is 
likely to contain subsets of documents that are similar, or that are related to the search query in a 
similar way.  If precision and recall are not optimized (as is more commonly the case), then the list 
of search results will also contain irrelevant documents scattered among the relevant ones.   
 Susan Dumais of Microsoft Research and Hao Chen of UC Berkeley have conducted research into 
alternatives to the traditional ranked list display of search results.  They have found that users are 
able to find documents more efficiently when search results are organized into topical categories 
than when they are presented with a standard ranked list.   
 
Dumais and Chen tasked the study participants with finding documents via a traditional list interface  
as seen in Figure 5, and then by means of category-style interfaces, one of which is shown in Figure 
6. 
 





Draft 0.5 ­ Charles H. Heenan                                                                       11


 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
                            Figure 5:  A ranked list interface for search results. 
 





Draft 0.5 ­ Charles H. Heenan                                                            12


                                                                                               
 
        Figure 6:  A category-based interface for the same search results as shown in Figure 5. 
 
Dumais and Chen used four variations on category interfaces like the one shown in Figure 6 and 
three variations on list interfaces like the one shown in Figure 5.  In every case, users were more 
efficient at locating information through a category interface than through a list interface.  The 
relative advantage of a category-based interface was even greater for "difficult" searches as opposed 
to "easy" ones (See Figure 7). 





Draft 0.5 ­ Charles H. Heenan                                                                        13


                                                                                       
     Figure 7:  Mean log time to complete tasks for easy and difficult queries for each interface type. 
 
Interestingly, as of April 2002, only one high-profile commercial search engine appears to have 
incorporated categorization of search results into its user interface.  Teoma.com is a web search 
engine that went live to the public in early 2002.  Figure 8 shows the Teoma interface after it has 
completed a search on the term "Knowledge Management."  At the bottom-left of the screen is the 
standard ranked list of search results.  However, at the top right of the screen is a section labeled 
"Refine ­ Suggestions to narrow your search."  Although Teoma packages the links in this section as 
suggestions for query refinement, they function as subcategories within the domain of knowledge 
management.     





                          Figure 8:  Teoma.com user interface for search results. 



Draft 0.5 ­ Charles H. Heenan                                                                       14


During Teoma's beta release, the user interface even used the Windows Explorer "folder" 
iconography to represent these links explicitly as categories and subcategories within the realm of 
"Knowledge Management."  It is unclear why they switched metaphors, but the fact remains that 
clicking on links in the "Refine" section of the Teoma interface will yield a subset of documents 
from the primary search as well as a new list of links (sub-subcategories) for further refinement.  
Regardless of what metaphor is used to represent the idea of categorization of search results, in the 
future it is likely that other search websites will follow Teoma's lead and incorporate categorization 
of search results into the user interface. 

Search Interfaces:  The Incorporation of Subjective "Expert" Opinion 
 
Besides research on using categorization for making search results more accessible, there is research 
from Intel Corporation on how the use of "expert" opinion can facilitate interdisciplinary search.  
John Light, of Intel, published a paper in 1997 in which he discusses search technology and some of 
the assumptions underlying the then-state-of-the-art search systems.  One of his observations is that 
"text retrieval is currently very Aristotelian.  That is, answers are judged as either right or wrong."  
This raises problems when there is a high degree of speciation within general fields of inquiry, 
because the same terms can convey radically different meanings between disciplines.  The 
consequence of this for searchers is greater difficulty in finding material outside of one's own 
domain of specialization.  This, Light adds, is problematic because   
 
        "some of the most interesting and important searching being done today is 
        across disciplines.  Whether it is done by someone who is a novice or expert 
        in his own discipline, these searches are in a space where the searcher doesn't 
        really know or understand the vocabulary.  Historically, some of our greatest 
        inventions have resulted from connecting disparate disciplines, so supporting 
        searches in foreign domains is critically important.  Our current search 
        methods, which rely heavily on the user's ability to pick individual words, 
        make that hard." 
         
One of the solutions to this problem is for search systems to turn the binary, Aristotelian right-or-
wrong approach to search on its head by incorporating the knowledge of subjective domain experts.  
Light proposes "a largely automated system that uses expert information that is provably and 
intentionally subjective."  He adds that "the application of a human viewpoint is an additional 
advantage to the system, not a drawback" and that "one way to look at the expert contribution is as 
that of an editor of a publication."   
 
Light envisions experts as fulfilling a number of roles.  Two such roles are topic identification and 
vocabulary definition.  According to Light, experts would need to identify a "large list of narrow 
topics within [a given] document set."  These topics could then be used by non-experts to construct 
queries themselves.  In addition, Light argues that experts would need to be responsible for the 
creation of "a description of the vocabulary used to discuss each topic," where the topic is described 
"by a list of words or phrases that are specific to the topic." 
 One question that arises from Light's idea that expert knowledge could be used to improve search is 
that of labor:  Who is going to spend the time necessary to create these lists of topics and domain 
specific vocabulary definitions?  As it turns out, countless individual weblog developers have been 


Draft 0.5 ­ Charles H. Heenan                                                                         15


doing just that on a voluntary basis for some time.  In May, 1999, the online news site, Salon.com, 
described weblogs as "personal personal web sites operated by individuals who compile 
chronological lists of links to stuff that interests them, interspersed with information, editorializing 
and personal asides. A good weblog is updated often, in a kind of real-time improvisation, with 
pointers to interesting events, pages, stories and happenings elsewhere on the Web. New stuff piles 
on top of the page; older stuff sinks to the bottom."  Although there is little standardization from 
one weblog to the next and there is no guarantee that some set of weblogs has rigorously defined 
specific vocabularies, weblogs do represent a tremendous amount of quasi-expert information on 
increasingly narrow topical niches.   
 Since weblogs tend to have a common format, it should be possible for search engines to harvest 
this information.  The result would be that weblog developers will have unknowingly filled-in for the 
role of "editor" that John Light argues can improve the quality of web search.  Again, as with the 
idea of using categorization for organizing search results, few commercial search engines are taking 
advantage of weblog information in a way that would fulfill Light's vision.  Yet, again, it is 
Teoma.com that is leading the way.   
 
When a user submits a query through Teoma's search interface, Teoma looks for weblogs and other 
pages that contain lists of links that deal with the user's query.  Links to these list-of-links pages are 
shown at the bottom right of the Teoma search results page, under the heading "Resources ­ Link 
collections from experts and enthusiasts"  (See Figure 8).  Figure 9 shows the page that is listed first 
under the Resources heading in Figure 8.  It is a list of links to pages dealing with the original search 
term, "Knowledge Management."  Although it is impossible to verify the qualifications of any given 
"expert" or "enthusiast" who has created a list of links page, the idea of incorporating such pages 
into a search interface is a good one and--in at least some cases--it does add value. 
  





Draft 0.5 ­ Charles H. Heenan                                                                        16


                                                                                                           
  Figure 9:  A list-of-links page that was included by Teoma.com in response to a search on the term 
                                      "Knowledge Management." 
 

Search Interfaces:  The Deep Web 
 
While the categorization of search results and the incorporation of "expert" information does add 
value to search interfaces, the fact remains that traditional web content (content directly accessible 
through links) represents only a fraction of the information on the Internet. Recent studies indicate 
that traditional, static web makes up two billion pages of the Internet.  While that is a sizable figure, 
it pales in comparison to the 500 billion pages that are estimated to exist on the "hidden," or "deep" 
web.  Deep web pages reside in web-connected databases and are only accessible through the 


Draft 0.5 ­ Charles H. Heenan                                                                       17


mediation of a query interface.  These web-based interfaces to databases dynamically generate a list 
of links in response to searches entered by users.  The problem is that "traditional search engines 
cannot handle such interfaces...." As a result, they "ignore the content of these resources, since [the 
search engines only work by taking] advantage of the static link structure of the web to "crawl" and 
index web pages."   
 There do exist a number of sites that are focused on addressing the problem that is presented by the 
deep web.  For example, Invisibleweb.com (Figure 10) and Searchengineguide.com are two manual 
categorization efforts in which databases are grouped under topical headings.  A click into a category 
such as "education" will yield a list of sites through which one can access database search interfaces.  
Through these interfaces, one can "Find a Teacher," "Find a College," or even "Find a School 
District."   
 





                                                                                                          
                                                        
     Figure 10:  Home page for InvisibleWeb.com, "The Search Engine of Search Engines."  The main 
           portion of the page contains a manually populated classification of online databases.   
 Unfortunately, categorizing online databases manually can be just as time consuming--if not more 
so--than categorizing online documents, particularly because online databases do not offer 



Draft 0.5 ­ Charles H. Heenan                                                                      18


unmediated access to their content.  In response to this difficulty, one research effort out of 
Columbia University points the way towards a more efficient method of categorizing online 
databases based upon their content.   
 Ipeirotis, Gravano, and Sahami frame the problem by relating their experiences searching for 
documents with the keyword "cancer" on the PubMed medical database from the National Library 
of Medicine.  A manual query of the PubMed database for "cancer" yielded "1,301,269 matches, 
corresponding to high-quality citations to medical articles."  However, since these documents are 
dynamically generated in response to a query, they are not "`crawlable' by traditional search engines."  
For example, using the same query of "cancer" via websearch engine Alta Vista to find pages in the 
PubMed site "returns only 19,893 matches.  This number not only is much lower than the number 
of PubMed matches reported above, but...the pages returned by AltaVista are links to other pages 
on the PubMed site, not to articles in the PubMed database."  In short, traditional web queries will 
not work for accessing information in deep web repositories such as the PubMed database.   
 
Ipeirotis, et al., have developed a creative, automated approach for approximating what a database is 
"about" through the use of query probes and the evaluation of the results from each probe.  If a 
database returns many documents in response to a query about "cancer," but returns zero 
documents in response to a query about "NHL hockey," that information can be used to help 
decide whether to classify the database as being about healthcare/medicine or about sports.  The 
more query probes that are submitted, the more refined and accurate will be the ultimate 
classification of the database itself.   
 
On balance, the approach of using query probes seems to be an effective innovation for categorizing 
databases without manual intervention.  It is likely that this approach will be highly effective when 
applied to narrow, topically focused databases.  The only drawback to this approach is that 
heterogeneous databases (ones that contain roughly equal numbers of documents about a range of 
topics) may pose a greater categorization challenge because of lower variation in the database's 
response to different query probes.  
 

Conclusion 
 
Academic research into information retrieval systems is proceeding apace.  New user interfaces for 
effectively conveying search results have moved from the research lab to the "live" web, and this 
flow of innovation seems unlikely to fade.  New approaches to the problems of synonymy and 
polysemy are pushing the frontiers of retrieval algorithms to new levels of effectiveness.   
 However, at the same time there exist a number of fundamental questions about how the 
"effectiveness" of a retrieval algorithm should be defined, and therefore evaluated.  The traditional 
criteria for a retrieval system have been precision and recall.  A system that is precise will return a 
very low percentage of irrelevant documents given a specified query or classification rule.  Yet, while 
the documents that are returned by such a system will tend to be on-topic, there is no guarantee that 
those documents represent anything more than a small percentage of all the on-topic documents in 
the search database.  On the other side of the coin, a system that has high levels of recall can be 
expected to return a significant percentage of all the documents in the database that are on-topic to a 
given search query.  Yet this increase in recall almost always comes at the expense of precision.  


Draft 0.5 ­ Charles H. Heenan                                                                         19


Ultimately, the optimal relationship between an information retrieval system's precision and recall is 
likely to vary depending upon the application domain and upon the needs of the system's users.   
 
If different search and categorization algorithms set the balance between precision and recall 
differently, clearly some algorithms will not be appropriate for some information seekers' needs.  It 
is important for information seekers to be aware of the variation that exists among search and 
categorization approaches, and to understand which approach is right for a given information need.  
In some cases, an information seeker may need a recall-oriented tool.  In others, exhaustiveness is 
less important and a precision-oriented algorithm may be more appropriate.  In the end, users of 
search services should keep in mind that what goes on behind the query submission box varies 
widely from site to site and that this variation has an impact upon search results.  One must not be 
lulled into an Internet-enabled laziness with respect to information retrieval.  Information seekers 
wishing to be thorough should employ a range of search tools rather than one favorite engine.  
When viewing search results (or categorization results), they should be just as mindful of what is not 
returned as they are of what is.  And in some cases, they should even consider making a trip to the 
library of a local research university or other institution.  After all, not everything is digital or 
available electronically.  Not everything has been indexed by search engines or categorization 
schemes.  At least, not yet. 
 
 
 

References: 
 

Information Visualization 
 Information visualization is a broad research area.  In this paper, only some of the visualization 
research has been discussed ­ namely the use of categorization for optimizing the usability of search 
interfaces.  That research is cited under a separate heading, below.  Nonetheless, the following 
papers are noteworthy and readers should consider consulting them to gain a wider context on the 
field of information visualization.  Of particular note is the work of Peter Pirolli, et al., on 
information scent. 
 
 Au, Peter; Carey, Matthew; Sewraz, Shalini; Guo, Yike; Ruger, Stefan. "New Paradigms in 
Information Visualization" Proceedings of the 23rd Annual International ACM SIGIR Conference on Research 
and Development in Information Retrieval,  Pages 307 ­ 309. July 2000.   
 Chi, Ed H.; Pitkow, James; Mackinlay, Jock; Pirolli, Peter; Gossweiler, Rich; Card, Stuart. 
"Visualizing the Evolution of Web Ecologies" Proceedings of the Conference on Human Factors in 
Computing Systems,  Pages 400 ­ 407.  1998. 
 Graham, Martin; Kennedy, Jessie B.; Hand, Chris. "A Comparison of Set-Based and Graph-Based 
Visualisations of Overlapping Classification Hierarchies" Proceedings of the Working Conference on 
Advanced Visual Interfaces,  Pages 41 ­ 50.  2000. 
 


Draft 0.5 ­ Charles H. Heenan                                                                          20


Kreuseler, Matthias; Schumann, Heidrun. "Information Visualization Using a New Focus+Context 
Technique in Combination with Dynamic Clustering of Information Space" Proceedings of the 1999 
Workshop on New Paradigms in Information Visualization and Manipulation in Conjunction with the Eighth 
ACM International Conference on Information and Knowledge Management, Pages 1 ­ 5. 1999. 
 
Light, John. "A Distributed, Graphical, Topic-Oriented Search System" Proceedings of the Sixth 
International Conference on Information and Knowledge Management, Pages 285 ­ 292. 1997. 
 Miller, Nancy E.; Wong, Pak Chung; Brewster, Mary; Foote, Harlan. "TOPIC ISLANDS ­ A 
Wavelet-Based Text Visualization System"  Proceedings of the Conference on Visualization, Pages 189-196. 
1998. 
 Pirolli, Peter; Card, Stuart; Van Der Wege, Mija. "The Effect of Information Scent on Searching 
Information: Visualizations of Large Tree Structures" Proceedings of the Working Conference on Advanced 
Visual Interfaces, Pages 161 ­ 172.  2000. 
 Shneiderman, Ben; Feldman, David; Rose, Ann; Ferre Grau, Xavier.  "Visualizing Digital Library 
Search Results with Categorical and Hierarchical Axes" Proceedings of the Fifth ACM Conference on 
Digital Libraries, Pages 57 ­ 66. 2000. 
  
Categorization of Search Results 
 Recent work out of Microsoft Research indicates that categorization of search results, as opposed to 
simple ranked results, facilitates information retrieval.  The following papers discuss this topic in 
detail.   
 Borner, Katy.  "Extracting and Visualizing Semantic Structures in Retrieval Results for Browsing"  
Proceedings of the Fifth ACM Conference on Digital Libraries, Pages 234 ­ 235.  2000. 
 
Chen, Hao; Dumais, Susan. "Bringing Order to the Web:  Automatically Categorizing Search 
Results"  Proceedings of the CHI 2000 Conference on Human Factors in Computing Systems, Pages 145 ­ 152.  
2000. 
 
Dumais, Susan; Chen, Hao. "Hierarchical Classification of Web Content"  Proceedings of the 23rd 
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pages 256 
­ 263.  July 2000. 
 
  
Categorization of an Information Space for Browsing 
 
Chaffee, Jason; Gauch, Susan. "Personal Ontologies for Web Navigation" Proceedings of the Ninth 
International Conference on Information and Knowledge Management, Pages 227 ­ 234. 2000. 
 


Draft 0.5 ­ Charles H. Heenan                                                                       21


Geffner, S; Agrawal, D; El Abbadi, A; Smith, T. "Browsing Large Digital Library Collections Using 
Classification Hierarchies" Proceedings of the Eighth International Conference on Information and Knowledge 
Management, Pages 195 ­ 201.  1999. 
 Graham, Martin; Kennedy, Jessie B.; Hand, Chris. "A Comparison of Set-Based and Graph-Based 
Visualisations of Overlapping Classification Hierarchies" Proceedings of the Working Conference on 
Advanced Visual Interfaces,  Pages 41 ­ 50.  2000 (cross-referenced with above). 
 

 

Approaches to Information Retrieval 
 The following papers discuss interesting avenues of research in information retrieval as a whole.  
Several of these papers are discussed in greater detail in the body of this essay.  Of the ones that 
were not discussed, the a primary theme is the use of software agents as retrieval facilitators.  The 
news article "Use the Blog, Luke" is also of particular interest. 
 
Belkin, Nicholas J.; Croft, Bruce W. "Information Filtering and Information Retrieval: Two Sides of 
the Same Coin?" Communications of the ACM, Volume 35, Issue 12, Pages 29 ­ 38. December 1992. 
 Chau, Michael; Zeng, Daniel; Chen, Hinchun. "Personalized Spiders for Web Search and Analysis" 
Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries, Pages 79 ­ 87. June 2001. 
 
Dorre, Jochen; Gerstl, Peter; Seiffert, Roland. "Text Mining: Finding Nuggets in Mountains of 
Textual Data" Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and 
Data Mining, Pages 398 ­ 401. 1999. 
 
Ipeirotis, Panagiotis; Gravano, Luis; Sahami, Mehran. "Probe, Count, and Classify:  Categorizing 
Hidden-Web Databases" Proceedings of the 2001 ACM SIGMOD International Conference on Management of 
Data, Pages 67 ­ 78. May 2001. 
 
Jing, Hongyan; Tzoukerman, Evelyne. "Information Retrieval Based on Context Distance and 
Morphology" Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and 
Development in Information Retrieval, Pages 90 ­ 96. August 1999. 
 
Johnson, Steven, Use the blog, Luke 
(http://www.salon.com/tech/feature/2002/05/10/blogbrain/print.html), May 2002. 
 Lam, Wai; Lai, Kwok-Yin. "A Meta-Learning Approach for Text Categorization" Proceedings of the 24th 
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pages 303 
­ 309. September 2001. 
 Menczer, Filippo; Belew, Richard K. "Adaptive Information Agents in Distributed Textual 
Environments" Proceedings of the Second International Conference on Autonomous Agents, Pages 157 ­ 164. 
1998. 
 


Draft 0.5 ­ Charles H. Heenan                                                                           22


Plaisant, Catherine; Shneiderman, Ben; Doan, Khoa; Bruns, Tom. "Interface and Data Architecture 
for Query Preview in Networked Information Systems" ACM Transactions on Information Systems, 
Volume 17, Issue 3, Pages 320 ­ 341. July 1999. 
 Singh, Lisa; Scheuermann, Peter; Chen, Bin. "Generating Association Rules from Semi-Structured 
Documents Using an Extended Concept Hierarchy" Proceedings of the Sixth International Conference on 
Information and Knowledge Management, Pages 193 ­ 200. 1997. 
 Stuckenschmidt, Heiner; van Harmelen, Frank. "Ontology-Based Metadata Generation from Semi-
Structured Information" Proceedings of the International Conference on Knowledge Capture, Pages 163 ­ 170. 
2001. 
 Tansley, Robert; Bird, Colin; Hall, Wendy; Lewis, Paul; Weal, Mark. "Automating the Linking of 
Content and Concept" Proceedings of the Eighth ACM International Conference on Multimedia, Pages 445 ­ 
447. 2000. 
 Voss, Angi; Nakata, Keiichi; Juhnke, Marcus. "Concept Indexing" Proceedings of the International ACM 
SIGGROUP Conference on Supporting Group Work, Pages 1 ­ 10. 1999. 
 
Wong, Kam-Fai; Song, Dawei; Bruza, Peter; Cheng, Chun-Hung. "Application of Aboutness to 
Functional Benchmarking in Information Retrieval" ACM Transactions on Information Systems, Volume 
19, Issue 4, Pages 337 ­ 370. October 2001. 





Draft 0.5 ­ Charles H. Heenan                                                                        23