18-22 May 1996
Marriott Mission Valley Hotel, San Diego, CA
T. R. Girill
Lawrence Livermore National Laboratory
P.O. Box 808, L-560
Livermore, CA 94550 USA
trg@llnl.gov
Clement H. Luk
Department of Computer Science
California State University, Chico
Chico, CA 95929-0410 USA
luk@ecst.csuchico.edu
This paper advocates an often-neglected search-support technique, approximate or "fuzzy" matching of user search terms. When properly deployed, fuzzy matching can significantly enhance the benefits of other, more common approaches to end-user answer retrieval from online reference collections. We compare crude with more sophisticated approximation techniques to explain how astute fuzzy-match software can convert many different near-miss situations (such as those involving faulty prefixes or suffixes, character misplacement, nonstandard word stems, or unanticipated redescription of concepts) into more adequate results. We also suggest practical ways to overcome fuzzy matching's own major drawbacks (namely, problems with search speed, search imprecision, and misinterpretation of search results). The resulting analysis clarifies how to deploy fuzzy matching for maximum effectiveness. We conclude that appropriate fuzzy matching enables more frequent, more flexible search success than do ordinary retrieval-improvement techniques used without it.
Online documentation spreads along a spectrum from very focused help passages deeply integrated into a particular application (Duffy, Palmer, and Mehlenbacher, 1992) to large, general, digital libraries of reference material potentially relevant to answering many questions in many contexts (Girill and Tull, 1988). At the latter end of the spectrum, effective search tools are vital if users--both casual end users and busy help-desk consultants--are to reliably yet easily locate the answer passages they need (Girill, Griffin, and Jones, 1991). We have a working prototype search interface for online documentation that incorporates fuzzy matching (syntactic approximation) of search terms. Its behavior convinces us that this feature enhances the searching of large reference databases, even though this application has not been a primary focus of the literature on fuzzy matching techniques in the past.
Recent publications on syntactic approximation or fuzzy matching most often emphasize office-automation situations and the automatic indexing of business papers, legal briefs, or other office memos (e.g., Kent, Sacks-Davis, and Ramamohanarao, 1990; Faloutsos and Christodoulakis, 1987). This is the context of document preparation (or database construction) rather than the context of answer search. And the implied goal of fuzzy matching tools here is creating text collections without attentive human intervention; i.e., replacing human with artificial intelligence to minimize processing time or reduce collection-development costs. Our work suggests, on the other hand, that fuzzy matching can be applied jointly with semantic techniques (such as extensive aliasing) to improve performance during the search process itself, not just during presearch text preparation. We believe this search-time approach is best viewed as amplifying the native intelligence of the human searcher (by promoting more flexible pattern recognition) rather than replacing the human role (Wegner, 1983).
Anthropologist Bonnie Nardi, after a study of how reference librarians add value to the information gathering that occurs among professionals at work, proposed that "electronic search agents may be usefully modeled on reference librarians" (Nardi, 1994; Nardi, 1993). The sections below show how astute inclusion of fuzzy matching in a search interface to an online reference collection pursues this anthropological hint more thoroughly than do interfaces without it. Enriching a literal search by including a tolerance for imprecise input, through the use of appropriate approximations, is one way every seasoned reference librarian intuitively adds value to the search process.
Fuzzy-match software captures that professional intuition in algorithms that let even unwitting end users add (some of) the same value to every search they run. Of course, trivial approximation yields trivial improvement (discussed below), and approximation alone is insufficient for search success. But we believe a closer look at the problems it solves supports our conclusion that fuzzy matching at search time is a key enabling technique that maximizes the value to end users of the reference collection searched and of the usual semantic techniques jointly applied to search that collection.
The next section examines four typical failures of literal-only search systems and shows how fuzzy matching can overcome each one. We then look at the potential problems fuzzy search systems encounter and discuss the practical solutions to them that our prototype has suggested. The resulting analysis reveals the practical value of fuzzy-match searching as a user services tool.
Omitting or misstating prefixes, suffixes, or plurals is the most simple case of a syntactic flaw that prevents the otherwise successful match of a user's search term and a database descriptor. The many common and meaningful variations on the stem COMPIL, for example, such as
[pre]compil|e
|er
|ers
|ing
|ation
show why this pitfall is so prevalent.
Here, the easiest approximate-match response is also well known: stemming or filters ("wildcards"). This most basic fuzzy matching technique appears in two forms.
(A) User Activated: allow the user when constructing the search to overtly specify with special filter characters how to truncate the front, tail, or other parts of the search term to increase the number of hits (mostly by eliminating prefixes or suffixes where unintended variations are greatest). The UNIX metacharacters are one simple example (e.g., Sobell, 1995, pp. 108,435).
(B) System Activated: use automatic stemming rules or a built-in dictionary to detect the stem of search terms and database descriptors before comparing them for matches. This form works whether or not users remember to invoke it, but they consequently have less control over the search. The WAIS stemming algorithms are one simple example (Liu, et al., 1994, pp. 111, 115, 116).
Although stemming and filtering address well the specific problem of mismatches caused just by faulty prefixes or suffixes, they are inherently limited as fuzzy matching techniques. Each can generate only proper substrings of the original search term. Syntactic flaws lurking within such substrings can still prevent otherwise desired matches unless they are overcome by additional, more subtle approximation techniques.
Missing, extra, transposed, or erroneous characters, especially inside the stem of a search term, pose a more challenging approximation problem. Examples faced by our prototype system include
v e k t e r i s a t i o n
F o u r i e r t r n a s f o r n
f u o r e i r t r a n s f o r m
f o r i e a t r a n s f r m
l e g e n d e r p o l i n o m i a l
As these examples show, multiple character misplacements often occur in
the same search term. Simple mistyping, spelling errors, faulty
memory of unfamiliar technical terms, and input from nonnative speakers
all contribute to these syntactic problems, sometimes simultaneously.
These problems are frequent: 80% of user syntactical errors involve
character misplacement (Ho, 1995). But filters and autostemming seldom
suffice to overcome such misplacements because the flaws are usually not
localized in the word parts discarded or ignored by those two
techniques alone.
Getting practical benefit from fuzzy matching thus demands more elaborate, more comprehensive approximation techniques than the two most common ones. First, we need to compare not literal terms but rather abstract representations ("signatures") of search terms and database descriptors. These representations should be specific enough to preserve important semantic similarities yet coarse enough ("fuzzy signatures") to ignore (most) character misplacement flaws sprinkled through the compared terms. For example, the fuzzy signatures (bit strings) for `vectorization' and for `Fourier transform' should allow them to be properly distinguished, but the signatures for `vectorization' and for `vekterisation' should allow their dominant syntactic similarities to yield an approximate match that would otherwise be missed.
Second, good systems must also allow users to adjust at search time the coarseness with which these abstractions (signatures) are themselves compared. User control of the signature fuzziness is impractical (database signatures are generated in advance). But user control of the fuzziness with which signatures are compared is both possible and desirable. Different fuzzy-comparison levels (different percentages of overlapping 1 (one) bits) will be appropriate in different verbal contexts. When there are few near terms, for example, letting `vekterisation' match `factorization' as well as `vectorization' may be acceptable. When the term space is more crowded with interesting near neighbors to `vectorization' this matching may be much too coarse, yielding too many irrelevant hits.
There is more than one way to construct plausible syntactic approximations (fuzzy signatures) of search terms and database descriptors. We have tried several approaches in different prototype systems, sometimes noting unexpected side effects (such as varying sensitivity to term length, or to length differences between compared terms). This is not the place to defend one specific, signature-generating approximation algorithm over others, but rather to stress the value they all add to mere stemming and filters as worthwhile retrieval-enhancing techniques.
Highly relevant descriptors (and hence related answer passages) sometimes occur in families most members of which would be missed by a searcher who only retrieved exact matches. For example, in the Boeing Computer System extended mathematics (subroutine) library are many subroutine families with names such as
hssqft
hdsqft
hcsqft
hzsqft
which solve sparse linear equations using (respectively) single precision,
double precision, complex, and double complex arithmetic.
Theoretically, filtering could find all of these terms if a user happened
to request h*sqft, but all other filter variations would fail because of
how the family names are constructed. Likewise, no natural language
stems are present from which to trim prefixes or suffixes automatically.
So here again practical retrieval enhancement (finding all four hits by
name) results from, but only from, deploying a sophisticated fuzzy
matching technique of the kind described in the previous section (along
with the usual semantic search mechanisms).
Even with extensive aliasing, some serendipity hits present themselves only after aggressive fuzzy matching. For example, using our prototype system a search for "foreign countries" yielded a hit on "sensitive countries" (the descriptor for the relevant list of countries) even at the default level of approximation coarseness. These beneficial syntactic surprises are another, admittedly less predictable, way in which fuzzy matching enhances standard searches.
The primary problem with fuzzy-match search systems is that their very strength--an increased number of matches--causes side-effect weaknesses. Without compensatory measures, for example, they do not scale well; they can run very slowly on large text databases.
Although a hash function is used to generate the signatures that enable fuzzy matching, the comparison of a search-term signature with database-descriptor signatures cannot be accelerated by using a hash table. For exact-match searches, one can use a hash function to compute a (bucket) address for storing each database term, then locate the matching bucket quickly at search time by applying the same hash function to the user's search term. The speed of this approach is independent of the number of terms searched, and varies only with the hash function, bucket size, and overflow handling.
Fuzzy matching, however, requires finding all the terms that "almost" match the search term (over a specified range of approximation). Such syntactically varied terms may not all fall into the same hash bucket, or even into adjacent buckets. So comparison with every (database-descriptor) signature must occur, a process whose time increases along with the number of signatures. Unless this comparison is accelerated, the resulting search may grow too slow for practical use, a problem encountered in our earlier prototypes (Girill, Griffin, and Jones, 1991).
Signature trees provide one possible solution. "A signature tree is formed by grouping signatures and then superimposing (ORing) them to form a new `super'signature at the next higher level in the tree" (Kotamari and Tharp, 1990, p. 80). Matching a search-term signature to a high level in such a tree is a necessary condition for matching it to any lower levels, so irrelevant branches can be quickly pruned from further searching. This scheme for limiting additional signature comparisons to only those branches that have high-level matches depends for its practical value, however, on a low degree of signature saturation (filling with 1s). The hash function, signature length, and signature grouping all affect how quickly a tree level saturates, and a level whose (super)signatures are (almost) all 1s provides (little or) no help in pruning the tree.
A related response to slow fuzzy-match searches that avoids this saturation problem is partitioning of the database signatures (without a hierarchy). One can group the text signatures, for example, by the number of bits turned on (1s bits), and then compute the number of bits turned on in the search-term signature. Exact matches fall only in the group with the same number of bits turned on (n) as the search term; increasingly coarse approximate matches lie in the groups with n+/-1 turned on, n+/-2 turned on, etc. The amount of search acceleration this provides depends on how uniformly the signatures spread among the bits-on groups. Our prototype experimented with variations on this approach to accelerate fuzzy searches (Ho, 1995).
Finally, on machines with more than one CPU, multitasking can help compensate for search slowness. Our database of descriptors drew from several independent sources. So we searched the signature sets for each source in parallel, as separate tasks running simultaneously. The search software then merged the parallel results for sorted presentation to the user (see the interface comments below).
Precision is the percentage of search hits that are relevant to the search term. Irrelevant hits were retrieved in error. A recent trend in user service admits the inevitability of such errors and stresses error recovery instead of error avoidance, with a focus on guides, prompts, and help messages that "aim to avoid the consequences of errors, rather than the errors themselves" (Oatey, 1994, p. 7) This is certainly appropriate for fuzzy-match systems, because the very fuzziness that overcomes syntactical flaws in the input introduces some syntactically close but irrelevant hits into the output set. The more coarse the approximations used to build signatures of the database terms in advance, and the more coarse the comparison of search-term and database signatures at search time, the lower is the precision of the fuzzy searches that result.
One way to compensate for this imprecision is to let users adjust the fuzziness or coarseness of the signature comparisons each time they supply a search term. User-specifiable match fuzziness gives a sense of control that psychologically balances the extra, unwanted terms retrieved by searching with approximations. It also lets users adapt their search precision to the perceived density of the search space. The (dense) math-library case (cited above) probably calls for modest match fuzziness, for instance, while for other terms in a sparser space (e.g., "disk space preallocation") much higher match-fuzziness levels with much lower precision are more appropriate.
As Hildreth notes, for the user interface of any information-retrieval system to be adequate, it must "facilitate the semantically demanding cognitive tasks of user comprehension [of hits] and decision making [among them]" (Hildreth, 1995, p. 7). Interpreting the results of a fuzzy search can be confusing, especially for those unfamiliar with how the approximation(s) to their search term were built. Hence, offering fuzzy matching in the context of an astute user interface that combines it with known, strong interpretation-support features is vital for user acceptance.
Our prototype interface, for instance, signals the waiting user that their search is multitasking, then ranks the merged hits in relevance order from exact matches on aliased terms through exact matches on nonaliased terms and into increasingly distant approximations. The degree of approximation and the source document name appear with each listed hit. And the length of the whole hit list is controllable by the user (from just the best few to many hundreds if found and wanted). These task-related cues and adjustability show again how fuzzy matching, deployed jointly with other techniques, yields beneficial results not possible by using those techniques alone.
Fuzzy matching of user search terms is a very helpful retrieval- enabling technique. It allows other techniques to perform better because if implemented astutely, using an approximate-signature approach, it overcomes the most common syntactical problems that otherwise thwart retrieval. And, if deployed jointly with search-space partitioning, user-adjustable approximation levels, and a supportive interface, fuzzy matching's potential problems can themselves be avoided.
This work was performed in part under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract W-7405-Eng-48. We also acknowledge the thoughtful collaboration of Lai Fong Ho on parts of this project.