Information Retrieval
Given a query, search for the most relevant document among a knowledge base.
Search Engines are IR systems.
3 Problems:
- How to represent the query?
- How to store a knowledge base?
- How to search efficiently and accurately?
Options
- SQL
- Max-similarity Search (find the doc that's the most similar to the query)
- Represent query: text-based document
- Store: Vectorized documents
- Search: Compute similarity score beween query and each doc, return the one with highest score
- Similartiy: Cosine Distance
- Semantic Doc2Vec
- Also max-similarity search
- Goal: train a document encoder
- Compare the similarity between 2 encoded doc
Vectorization: tf.idf
tf.idf is a traditional method to vectorize the documents.
Weighing words in documents
- Term Freq : num of word in doc
- Usually dampen by if
- Document Freq : num docs word appears
- Inverse document frequency: maximize specificity , where is the total number of docs
- This gives full weight to words that occur in 1 doc, zero to words in all docs
- Inverse document frequency: maximize specificity , where is the total number of docs
- Collection Freq : total num of
Eval Retrieval Systems
- Precision
- Among the positives (chosen), how many are relevant/correct/true
- Precision and recall - Wikipedia
- Recall
- fraction of the relevant documents that are successfully retrieved
- Among the True documents in ground truth, what fraction is selected
- F-score
- Measure of test accracy, calculated from precision and recall
- Precision @ k
- performance metric used to evaluate the effectiveness of a recommendation system, search engine or information retrieval system
- measures the proportion of relevant items among the top k recommended items to a user
- number of relevant items (good predictions) in top k predictions
There are tradeoffs between these metrics.
Tradeoffs
TODO