Chinese Kid IT World: Information retrieval: Retrieval Evaluation

Retrieval Evaluation

•1.Motivation
•2.Precision and recall
•3.Single value measures
•4.Reference collections

•Most systems are evaluated on the basis of their time and space performance.
•For example, in the case of database management systems:
–Time: How long is the response time to queries.
–Space: How much storage is required for index structures, etc.
•In an information retrieval system, where there is no guarantee that answers satisfy the requests as intended by the user, we must also consider:
–Retrieval performance: how good is the answer.

•Precision
–The ability to retrieve top-ranked documents that are mostly relevant.
•Recall
–The ability of the search to find all of the relevant items in the corpus.

•Total number of relevant items is sometimes not available:
–Sample across the database and perform relevance judgment on these items.
–Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant set.

Effect of enlarging/reducing answers on precision and recall:
•Enlarge answer:
–Recall: Can only improve (denominator remains constant).
–Precision: Unpredictable; will improve (worsen) if the precision in the documents added is better (worse) than the current precision.
•Reduce answer:
–Recall: Can only worsen (denominator remains constant).
–Precision: Unpredictable; will improve (worsen) if the precision in the documents removed is worse (better) than the current precision.
•When documents are ranked, and are added/removed according to their rankings, then precision will improve (worsen) when documents are added (removed).

Optimization options:
•Optimize recall => large answers, lower precision.
–Tuning a system to optimize recall, would normally result in larger answers with decreased precision.
–Extreme case: Retrieve the entire collection.
•Optimize precision => small answers, lower recall.
–Tuning a system to optimize precision, would normally result in smaller answers with decreased recall.
–Extreme case: Retrieve only one or two items.

Estimating the precision and recall of a given answer:
•Precision: Easier to estimate.
–An expert scan the answer and determines the documents that are relevant to the query.
•Recall: Harder to estimate.
–In a small collection, an expert may be able to scan the entire collection to determine the complete set of relevant documents.
–In a large collection (e.g., the WWW), the complete set of relevant documents might never be known; it could be estimated by the using a variety of systems (e.g., search engines) and deciding that a document is relevant if it is included in a majority of answers.

Interpolating a Recall/Precision Curve

•Interpolate a precision value for each standard recall level:
–rj Î{0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
–r0 = 0.0, r1 = 0.1, …, r10=1.0
•The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level between the j-th and (j + 1)-th level:

•Average Precision at Seen Relevant Documents:
–Measure the precision each time a new relevant document is retrieved.
–Average these precision values.
–In the previous example, 5 relevant documents were retrieved overall, hence 5 precision values are averaged: (1+0.67+0.3+0.36+0.33)/5 =0.53.
•R-precision:
–Let r be the total number of relevant documents (i.e., r is the size of the ideal answer).
–Consider the r top-ranked documents.
–The R-precision measure is defined as the precision of this set.
–In the ideal case, all these documents will be relevant and P = R = 1.
–In the previous example, r = 10 and there were 3 relevant documents among the top 10, hence R-precision is 3/10 = 0.3.

•Typically average performance over a large set of queries.
•Compute average precision at each standard recall level across all queries.
•Plot average precision/recall curves to evaluate overall system performance on a document/query corpus.

Subjective Relevance Measure

•Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware.
–Ability to find new information on a topic.
•Coverage Ratio: The proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search.
–Relevant when the user wants to locate documents which they have seen before (e.g., the budget report for Year 2000).

Other Factors to Consider

•User effort: Work required from the user in formulating queries, conducting the search, and screening the output.
•Response time: Time interval between receipt of a user query and the presentation of system responses.
•Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials.
•Collection coverage: Extent to which any/all relevant items are included in the document corpus.

Experimental Setup for Benchmarking

•Analytical performance evaluation is difficult for document retrieval systems because many characteristics such as relevance, distribution of words, etc., are difficult to describe with mathematical precision.
•Performance is measured by benchmarking. That is, the retrieval effectiveness of a system is evaluated on a given set of documents, queries, and relevance judgments.
•Performance data is valid only for the environment under which the system is evaluated.

BENCHMARK

Benchmarking - The Problems

•Performance data is valid only for a particular benchmark.
•Building a benchmark corpus is a difficult task.
•Benchmark web corpora are just starting to be developed.
•Benchmark foreign-language corpora are just starting to be developed.

Early Test Collections

•Previous experiments were based on the SMART collection which is fairly small. (ftp://ftp.cs.cornell.edu/pub/smart)
Collection Number Of Number Of Raw Size
Name Documents Queries (Mbytes)
CACM 3,204 64 1.5
CISI 1,460 112 1.3
CRAN 1,400 225 1.6
MED 1,033 30 1.1
TIME 425 83 1.5
•Different researchers used different test collections and evaluation techniques.

The TREC Benchmark

• TREC: Text REtrieval Conference (http://trec.nist.gov/)
Originated from the TIPSTER program sponsored by
Defense Advanced Research Projects Agency (DARPA).
• Became an annual conference in 1992, co-sponsored by the
National Institute of Standards and Technology (NIST) and
DARPA.
• Participants are given parts of a standard set of documents
and TOPICS (from which queries have to be derived) in
different stages for training and testing.
• Participants submit the P/R values for the final document
and query corpus and present their results at the conference.

The TREC Objectives

• Provide a common ground for comparing different IR
techniques.
–Same set of documents and queries, and same evaluation method.
• Sharing of resources and experiences in developing the
benchmark.
–With major sponsorship from government to develop large benchmark collections.
• Encourage participation from industry and academia.
• Development of new evaluation techniques, particularly for
new applications.
–Retrieval, routing/filtering, non-English collection, web-based collection, question answering.

TREC Advantages
•Large scale (compared to a few MB in the SMART Collection).
•Relevance judgments provided.
•Under continuous development with support from the U.S. Government.
•Wide participation:
–TREC 1: 28 papers 360 pages.
–TREC 4: 37 papers 560 pages.
–TREC 7: 61 papers 600 pages.
–TREC 8: 74 papers.

TREC Tasks
•Ad hoc: New questions are being asked on a static set of data.
•Routing: Same questions are being asked, but new information is being searched. (news clipping, library profiling).
•New tasks added after TREC 5 - Interactive, multilingual, natural language, multiple database merging, filtering, very large corpus (20 GB, 7.5 million documents), question answering.

Characteristics of the TREC Collection
•Both long and short documents (from a few hundred to over one thousand unique terms in a document).
•Test documents consist of:
WSJ Wall Street Journal articles (1986-1992) 550 M
AP Associate Press Newswire (1989) 514 M
ZIFF Computer Select Disks (Ziff-Davis Publishing) 493 M
FR Federal Register 469 M
DOE Abstracts from Department of Energy reports 190 M

More Details on Document Collections
•Volume 1 (Mar 1994) - Wall Street Journal (1987, 1988, 1989), Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990)
•Volume 2 (Mar 1994) - Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990)
•Volume 3 (Mar 1994) - San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992)
•Volume 4 (May 1996) - Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).
•Volume 5 (Apr 1997) - Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).

Chinese Kid IT World

Saturday, September 20, 2008

Information retrieval: Retrieval Evaluation

1 comment: