ASIS MID-YEAR MEETING          18-22 May 1996           Marriott Mission Valley Hotel, San Diego, CA

Return to program


Fuzzy Matching as a Retrieval-Enabling Technique for Digital Libraries

T. R. Girill
Lawrence Livermore National Laboratory
P.O. Box 808, L-560
Livermore, CA 94550 USA
trg@llnl.gov

Clement H. Luk
Department of Computer Science
California State University, Chico
Chico, CA 95929-0410 USA
luk@ecst.csuchico.edu

Abstract

This paper advocates an often-neglected search-support technique, approximate or "fuzzy" matching of user search terms. When properly deployed, fuzzy matching can significantly enhance the benefits of other, more common approaches to end-user answer retrieval from online reference collections. We compare crude with more sophisticated approximation techniques to explain how astute fuzzy-match software can convert many different near-miss situations (such as those involving faulty prefixes or suffixes, character misplacement, nonstandard word stems, or unanticipated redescription of concepts) into more adequate results. We also suggest practical ways to overcome fuzzy matching's own major drawbacks (namely, problems with search speed, search imprecision, and misinterpretation of search results). The resulting analysis clarifies how to deploy fuzzy matching for maximum effectiveness. We conclude that appropriate fuzzy matching enables more frequent, more flexible search success than do ordinary retrieval-improvement techniques used without it.

Introduction

Online documentation spreads along a spectrum from very focused help passages deeply integrated into a particular application (Duffy, Palmer, and Mehlenbacher, 1992) to large, general, digital libraries of reference material potentially relevant to answering many questions in many contexts (Girill and Tull, 1988). At the latter end of the spectrum, effective search tools are vital if users--both casual end users and busy help-desk consultants--are to reliably yet easily locate the answer passages they need (Girill, Griffin, and Jones, 1991). We have a working prototype search interface for online documentation that incorporates fuzzy matching (syntactic approximation) of search terms. Its behavior convinces us that this feature enhances the searching of large reference databases, even though this application has not been a primary focus of the literature on fuzzy matching techniques in the past.

Recent publications on syntactic approximation or fuzzy matching most often emphasize office-automation situations and the automatic indexing of business papers, legal briefs, or other office memos (e.g., Kent, Sacks-Davis, and Ramamohanarao, 1990; Faloutsos and Christodoulakis, 1987). This is the context of document preparation (or database construction) rather than the context of answer search. And the implied goal of fuzzy matching tools here is creating text collections without attentive human intervention; i.e., replacing human with artificial intelligence to minimize processing time or reduce collection-development costs. Our work suggests, on the other hand, that fuzzy matching can be applied jointly with semantic techniques (such as extensive aliasing) to improve performance during the search process itself, not just during presearch text preparation. We believe this search-time approach is best viewed as amplifying the native intelligence of the human searcher (by promoting more flexible pattern recognition) rather than replacing the human role (Wegner, 1983).

Anthropologist Bonnie Nardi, after a study of how reference librarians add value to the information gathering that occurs among professionals at work, proposed that "electronic search agents may be usefully modeled on reference librarians" (Nardi, 1994; Nardi, 1993). The sections below show how astute inclusion of fuzzy matching in a search interface to an online reference collection pursues this anthropological hint more thoroughly than do interfaces without it. Enriching a literal search by including a tolerance for imprecise input, through the use of appropriate approximations, is one way every seasoned reference librarian intuitively adds value to the search process.

Fuzzy-match software captures that professional intuition in algorithms that let even unwitting end users add (some of) the same value to every search they run. Of course, trivial approximation yields trivial improvement (discussed below), and approximation alone is insufficient for search success. But we believe a closer look at the problems it solves supports our conclusion that fuzzy matching at search time is a key enabling technique that maximizes the value to end users of the reference collection searched and of the usual semantic techniques jointly applied to search that collection.

The next section examines four typical failures of literal-only search systems and shows how fuzzy matching can overcome each one. We then look at the potential problems fuzzy search systems encounter and discuss the practical solutions to them that our prototype has suggested. The resulting analysis reveals the practical value of fuzzy-match searching as a user services tool.

Characteristic Cases

Faulty Prefixes or Suffixes

Omitting or misstating prefixes, suffixes, or plurals is the most simple case of a syntactic flaw that prevents the otherwise successful match of a user's search term and a database descriptor. The many common and meaningful variations on the stem COMPIL, for example, such as


          [pre]compil|e
                     |er
                     |ers
                     |ing
                     |ation

show why this pitfall is so prevalent.

Here, the easiest approximate-match response is also well known: stemming or filters ("wildcards"). This most basic fuzzy matching technique appears in two forms.

(A) User Activated: allow the user when constructing the search to overtly specify with special filter characters how to truncate the front, tail, or other parts of the search term to increase the number of hits (mostly by eliminating prefixes or suffixes where unintended variations are greatest). The UNIX metacharacters are one simple example (e.g., Sobell, 1995, pp. 108,435).

(B) System Activated: use automatic stemming rules or a built-in dictionary to detect the stem of search terms and database descriptors before comparing them for matches. This form works whether or not users remember to invoke it, but they consequently have less control over the search. The WAIS stemming algorithms are one simple example (Liu, et al., 1994, pp. 111, 115, 116).

Although stemming and filtering address well the specific problem of mismatches caused just by faulty prefixes or suffixes, they are inherently limited as fuzzy matching techniques. Each can generate only proper substrings of the original search term. Syntactic flaws lurking within such substrings can still prevent otherwise desired matches unless they are overcome by additional, more subtle approximation techniques.

Character Misplacement

Missing, extra, transposed, or erroneous characters, especially inside the stem of a search term, pose a more challenging approximation problem. Examples faced by our prototype system include


          v e k t e r i s a t i o n
          F o u r i e r  t r n a s f o r n
          f u o r e i r  t r a n s f o r m
          f o r i e a   t r a n s f r m
          l e g e n d e r  p o l i n o m i a l

As these examples show, multiple character misplacements often occur in the same search term. Simple mistyping, spelling errors, faulty memory of unfamiliar technical terms, and input from nonnative speakers all contribute to these syntactic problems, sometimes simultaneously. These problems are frequent: 80% of user syntactical errors involve character misplacement (Ho, 1995). But filters and autostemming seldom suffice to overcome such misplacements because the flaws are usually not localized in the word parts discarded or ignored by those two techniques alone.

Getting practical benefit from fuzzy matching thus demands more elaborate, more comprehensive approximation techniques than the two most common ones. First, we need to compare not literal terms but rather abstract representations ("signatures") of search terms and database descriptors. These representations should be specific enough to preserve important semantic similarities yet coarse enough ("fuzzy signatures") to ignore (most) character misplacement flaws sprinkled through the compared terms. For example, the fuzzy signatures (bit strings) for `vectorization' and for `Fourier transform' should allow them to be properly distinguished, but the signatures for `vectorization' and for `vekterisation' should allow their dominant syntactic similarities to yield an approximate match that would otherwise be missed.

Second, good systems must also allow users to adjust at search time the coarseness with which these abstractions (signatures) are themselves compared. User control of the signature fuzziness is impractical (database signatures are generated in advance). But user control of the fuzziness with which signatures are compared is both possible and desirable. Different fuzzy-comparison levels (different percentages of overlapping 1 (one) bits) will be appropriate in different verbal contexts. When there are few near terms, for example, letting `vekterisation' match `factorization' as well as `vectorization' may be acceptable. When the term space is more crowded with interesting near neighbors to `vectorization' this matching may be much too coarse, yielding too many irrelevant hits.

There is more than one way to construct plausible syntactic approximations (fuzzy signatures) of search terms and database descriptors. We have tried several approaches in different prototype systems, sometimes noting unexpected side effects (such as varying sensitivity to term length, or to length differences between compared terms). This is not the place to defend one specific, signature-generating approximation algorithm over others, but rather to stress the value they all add to mere stemming and filters as worthwhile retrieval-enhancing techniques.

Nonstandard Word Stems

Highly relevant descriptors (and hence related answer passages) sometimes occur in families most members of which would be missed by a searcher who only retrieved exact matches. For example, in the Boeing Computer System extended mathematics (subroutine) library are many subroutine families with names such as


          hssqft
          hdsqft
          hcsqft
          hzsqft

which solve sparse linear equations using (respectively) single precision, double precision, complex, and double complex arithmetic. Theoretically, filtering could find all of these terms if a user happened to request h*sqft, but all other filter variations would fail because of how the family names are constructed. Likewise, no natural language stems are present from which to trim prefixes or suffixes automatically. So here again practical retrieval enhancement (finding all four hits by name) results from, but only from, deploying a sophisticated fuzzy matching technique of the kind described in the previous section (along with the usual semantic search mechanisms).

Unanticipated Redescription

Even with extensive aliasing, some serendipity hits present themselves only after aggressive fuzzy matching. For example, using our prototype system a search for "foreign countries" yielded a hit on "sensitive countries" (the descriptor for the relevant list of countries) even at the default level of approximation coarseness. These beneficial syntactic surprises are another, admittedly less predictable, way in which fuzzy matching enhances standard searches.

Potential Problems

Search Acceleration

The primary problem with fuzzy-match search systems is that their very strength--an increased number of matches--causes side-effect weaknesses. Without compensatory measures, for example, they do not scale well; they can run very slowly on large text databases.

Although a hash function is used to generate the signatures that enable fuzzy matching, the comparison of a search-term signature with database-descriptor signatures cannot be accelerated by using a hash table. For exact-match searches, one can use a hash function to compute a (bucket) address for storing each database term, then locate the matching bucket quickly at search time by applying the same hash function to the user's search term. The speed of this approach is independent of the number of terms searched, and varies only with the hash function, bucket size, and overflow handling.

Fuzzy matching, however, requires finding all the terms that "almost" match the search term (over a specified range of approximation). Such syntactically varied terms may not all fall into the same hash bucket, or even into adjacent buckets. So comparison with every (database-descriptor) signature must occur, a process whose time increases along with the number of signatures. Unless this comparison is accelerated, the resulting search may grow too slow for practical use, a problem encountered in our earlier prototypes (Girill, Griffin, and Jones, 1991).

Signature trees provide one possible solution. "A signature tree is formed by grouping signatures and then superimposing (ORing) them to form a new `super'signature at the next higher level in the tree" (Kotamari and Tharp, 1990, p. 80). Matching a search-term signature to a high level in such a tree is a necessary condition for matching it to any lower levels, so irrelevant branches can be quickly pruned from further searching. This scheme for limiting additional signature comparisons to only those branches that have high-level matches depends for its practical value, however, on a low degree of signature saturation (filling with 1s). The hash function, signature length, and signature grouping all affect how quickly a tree level saturates, and a level whose (super)signatures are (almost) all 1s provides (little or) no help in pruning the tree.

A related response to slow fuzzy-match searches that avoids this saturation problem is partitioning of the database signatures (without a hierarchy). One can group the text signatures, for example, by the number of bits turned on (1s bits), and then compute the number of bits turned on in the search-term signature. Exact matches fall only in the group with the same number of bits turned on (n) as the search term; increasingly coarse approximate matches lie in the groups with n+/-1 turned on, n+/-2 turned on, etc. The amount of search acceleration this provides depends on how uniformly the signatures spread among the bits-on groups. Our prototype experimented with variations on this approach to accelerate fuzzy searches (Ho, 1995).

Finally, on machines with more than one CPU, multitasking can help compensate for search slowness. Our database of descriptors drew from several independent sources. So we searched the signature sets for each source in parallel, as separate tasks running simultaneously. The search software then merged the parallel results for sorted presentation to the user (see the interface comments below).

Search Precision

Precision is the percentage of search hits that are relevant to the search term. Irrelevant hits were retrieved in error. A recent trend in user service admits the inevitability of such errors and stresses error recovery instead of error avoidance, with a focus on guides, prompts, and help messages that "aim to avoid the consequences of errors, rather than the errors themselves" (Oatey, 1994, p. 7) This is certainly appropriate for fuzzy-match systems, because the very fuzziness that overcomes syntactical flaws in the input introduces some syntactically close but irrelevant hits into the output set. The more coarse the approximations used to build signatures of the database terms in advance, and the more coarse the comparison of search-term and database signatures at search time, the lower is the precision of the fuzzy searches that result.

One way to compensate for this imprecision is to let users adjust the fuzziness or coarseness of the signature comparisons each time they supply a search term. User-specifiable match fuzziness gives a sense of control that psychologically balances the extra, unwanted terms retrieved by searching with approximations. It also lets users adapt their search precision to the perceived density of the search space. The (dense) math-library case (cited above) probably calls for modest match fuzziness, for instance, while for other terms in a sparser space (e.g., "disk space preallocation") much higher match-fuzziness levels with much lower precision are more appropriate.

Result Interpretation

As Hildreth notes, for the user interface of any information-retrieval system to be adequate, it must "facilitate the semantically demanding cognitive tasks of user comprehension [of hits] and decision making [among them]" (Hildreth, 1995, p. 7). Interpreting the results of a fuzzy search can be confusing, especially for those unfamiliar with how the approximation(s) to their search term were built. Hence, offering fuzzy matching in the context of an astute user interface that combines it with known, strong interpretation-support features is vital for user acceptance.

Our prototype interface, for instance, signals the waiting user that their search is multitasking, then ranks the merged hits in relevance order from exact matches on aliased terms through exact matches on nonaliased terms and into increasingly distant approximations. The degree of approximation and the source document name appear with each listed hit. And the length of the whole hit list is controllable by the user (from just the best few to many hundreds if found and wanted). These task-related cues and adjustability show again how fuzzy matching, deployed jointly with other techniques, yields beneficial results not possible by using those techniques alone.

Conclusions

Fuzzy matching of user search terms is a very helpful retrieval- enabling technique. It allows other techniques to perform better because if implemented astutely, using an approximate-signature approach, it overcomes the most common syntactical problems that otherwise thwart retrieval. And, if deployed jointly with search-space partitioning, user-adjustable approximation levels, and a supportive interface, fuzzy matching's potential problems can themselves be avoided.


Acknowledgments

This work was performed in part under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract W-7405-Eng-48. We also acknowledge the thoughtful collaboration of Lai Fong Ho on parts of this project.

References

Duffy, T. M, Palmer, J. E., and Mehlenbacher, B. (1992).
Online help: design and evaluation. Norwood, NJ: Ablex Publishing. 260 pp.
Faloutsos, C. and Christodoulakis, S. (1987).
Description and performance analysis of signature file methods for office filing. ACM transactions on office information systems, 5 (3): 237-257.
Girill, T. R. and Tull, C. G. (1988).
Comparative access rates for online documentation and human consulting. In Proceedings of the ASIS 51st Annual Meeting (pp. 48-53). Medford, NJ: Learned Information, Inc.
Girill, T. R., Griffin, T., and Jones, R. B. (1991).
Extended subject access to hypertext online documentation. Journal of the american society for information science, 42 (6); 414-426.
Hildreth, C. (1995).
The GUI OPAC: approach with caution. Public-access computer systems review, 6 (5): 6-18.
Ho, Lai-Fong. (1995).
A fuzzy string matching technique for text retrieval. M.S. thesis, California State University, Chico. 68 pp.
Kent, A., Sacks-Davis, R., and Ramamohanarao, K. (1990).
A signature file scheme based on multiple organizations for indexing very large text databases. Journal of the american society for information science, 41 (7); 508-534.
Kotamarti, U. and Tharp. A. L. (1990).
Accelerating text searching through signature trees. Journal of the american society for information science, 41 (2): 79-86.
Liu, C. Peek, J., Jones, R., Buus, B., and Nye, A. (1994).
Managing internet information services. Sebastopol, CA: O'Reilly and Associates. 630 pp.
Nardi, B. (1993).
A small matter of programming: perspectives on end user computing. Cambridge, MA: MIT Press. 162 pp.
Nardi, B. (1994).
Toward a diverse information ecology. Tenth Annual Apple Library Users Group Meeting (June 28, 1994). Miami, FL.
Oatey, M. (1994).
Trial and error versus instructions. Communicator, 4 (February), 7-8.
Sobell, M. G. (1995).
Hands-on unix: a practical guide. Redwood City, CA: Benjamin/Cummings Publishing. 832 pp.
Wegner, P. (1983).
Paradigms of information engineering. In F. Machlup and U. Mansfield (Eds.), The study of information: interdisciplinary messages (pp. 163-176). New York: John Wiley.