Saturday, September 20, 2008

IInformation retrieval: Query Language

•Keyword-based querying
–Basic queries
–Boolean queries
–Weighted queries
•Queries in weighted systems
•Pattern matching
•Natural language
•Structured queries
•Query protocols


Keyword-based Querying

•Queries are combinations of words
•The document collection is searched for documents that contain these words
•Word queries are intuitive, easy to express and provide fast ranking
•The concept of word must be defined:
–A word is a sequence of letters terminated by a separator (period, comma, blank, etc)
–Definition of letter and separator is flexible; e.g.: hyphen could be defined as a letter or as a separator
–Usually, “trivial words” (such as “a”, “the”, “or”, “of”) are ignored


Basic Queries

Single-word queries: A query is a single word
•Simplest form of query
•All documents that include this word are retrieved
•Documents may be ranked by the frequency of this word in the document.

Phrase queries: A query is a sequence of words treated as a
single unit. It is also called “literal string” or “exact phrase”
query.
•Phrase is usually surrounded by quotation marks.
•All documents that include this phrase are retrieved
•Usually separators (commas, colons, etc) and “trivial words” (e.g. “a”, “the” or “of”) in the phrase are ignored
•In effect, this query is for a set of words that must appear in sequence.
•Allows users to specify a context and thus gain precision
•Example: “The Lord of The Rings”

Multiple-word queries: A query is a set of words (or phrase)
•Two interpretations:
–A document is retrieved if it includes any of the query words
–A document is retrieved if it includes each of the query words
•Documents may be ranked by the number of query words they contain:
–A document containing n query words are ranked higher than a document containing m <> intersection
–Or -> union
–Except -> difference

•The use of except prevents creation of very large answers: not B will compute all documents that do not include B (complement), whereas A except B limits the universe to the documents that include A.
•Precedence: except, and, or; use parentheses to override; process left-to-right among operators with the same precedence.
•Example:
–computer or server except mainframe
•select all documents that discuss computers, or documents that discuss servers but do not discuss mainframes
–(computer or server) except mainframe
•Select all documents that discuss computers or servers, do not select any documents that discuss mainframes
–computer except (server or mainframe)
•Select all documents that discuss computers, and do not discuss either servers or mainframes.

•Classical Boolean systems do not rank documents: a document either satisfies the query (and is retrieved) or it does not satisfy the query (and is not retrieved).
•The Boolean formalism is not simple for users without training in mathematics


Weighted Queries

Weighted multiple-word queries: Each of the words is
assigned a different weight, expressing the relative
importance of the word within the request.
•A query is then a set of word-weight pairs: (, ), …, (, )
•The ranking of a document is the sum of the weights for the query words that it satisfies
•Example:
–Query: (A, 0.8), (B, 0.5), (C, 0.3)
–Document 1: (A, B, D)
–Document 2: (A, C, D)
–Ranking of Document 1: 0.8 + 0.5 = 1.3
–Ranking of Document 2: 0.8 + 0.3 = 1.1
–Each document includes two words from the query, but Document 1 is ranked higher because it includes more important words

Weighted Boolean queries: Each word in a Boolean query is
associated with the weight.
•Example: and ( or )
–A document with A and B satisfies this query better than a document with A and C (without such weights, both documents satisfy the query equally)

1 comment:

Unknown said...

I think that you may be interested in another application that quickly eliminates data corruption issues in database files, please take a look at how to open a pst database tool and let me know what do you think