Information Retrieval

Given a query, search for the most relevant document among a knowledge base.

Search Engines are IR systems.

3 Problems:

Options

SQL
Max-similarity Search (find the doc that's the most similar to the query)
- Represent query: text-based document
- Store: Vectorized documents
- Search: Compute similarity score beween query and each doc, return the one with highest score
  - Similartiy: Cosine Distance
Semantic Doc2Vec
1. Also max-similarity search
2. Goal: train a document encoder $E$
3. Compare the similarity between 2 encoded doc

tf.idf is a traditional method to vectorize the documents.

Weighing words in documents

Term Freq $tf_{ij}$ $t f_{ij}$ : num of word $w_i$ $w_{i}$ in doc $d_j$ $d_{j}$
1. Usually dampen by $tf_{dampen}=1+\log(tf)$ if $tf>0$
Document Freq $df_i$ $d f_{i}$ : num docs word $w_i$ $w_{i}$ appears
1. Inverse document frequency: maximize specificity $idf_i=\log(\frac{D}{df_i})$ $i d f_{i} = lo g (\frac{D}{d f _{i}})$ , where $D$ $D$ is the total number of docs
  1. This gives full weight to words that occur in 1 doc, zero to words in all docs
Collection Freq $cf_i$ : total num of $w_i$

Precision
- $\frac{TP}{TP+FP}=\frac{TP}{P}$
- Among the positives (chosen), how many are relevant/correct/true
- Precision and recall - Wikipedia
Recall
- $\frac{TP}{TP+FN}=\frac{TP}{T}$
- fraction of the relevant documents that are successfully retrieved
- Among the True documents in ground truth, what fraction is selected
F-score
- Measure of test accracy, calculated from precision and recall
- $F_1=2\cdot\frac{precision\cdot recall}{precision+recall}$
Precision @ k
- performance metric used to evaluate the effectiveness of a recommendation system, search engine or information retrieval system
- measures the proportion of relevant items among the top k recommended items to a user
- number of relevant items (good predictions) in top k predictions

There are tradeoffs between these metrics.

TODO