Table of Contents

TODO: A lot of these are offline, and some are online, and the Ranking section should actually discuss Ranking + Scoring

Ranking

Ranking queries to items is a fairly large part of a search system, otherwise you’re just returning a big list of documents when some might be much more relevant to user

Ranking is typically the second stage of a recommender system, and sometimes it’s online meaning a new ranking calculations per query, offline where we can calculate results and look them up per query, or somewhere inbetween

Ranking is typically used for experiments, explainable results, or something different that we want to calculate specifically on each Item that might be more resource intensive than Candidate Generation methods, and other systems will combine everything into one single API call and bypass the different layers


Count Based Heuristics

One of the most common ranking / scoring methodologies is using the uniqueness of a word based on specific counts (heuristics) - the word “and” is not very unique, and if it shows up in a document we won’t really care. Another word like “aardvark” is fairly unique and not used that often, so it would be more unique

TF-IDF

TF-IDF means Term Frequency Inverse Document Frequency, and it’s a fairly simple scoring mechanism for computing the uniqueness of a Term (word) across Documents. Most of the calculations are done offline, and for a query we use our lookup table to find Documents.

The TF-IDF score is calculated as: [ \text{TF-IDF} = \text{TF} \times \text{IDF} ] [ \text{TF-IDF}(\text{Term}, \text{Doc}) = \text{count}(\text{term in Doc}) \times \log\left(\frac{\text{count}(\text{docs})}{\text{count}(\text{docs containing term})}\right) ]

TF IDF Meaning TFIDF
High High This word is common among all documents, and this document, so it’s just a generally common word Fairly normal score - around mean value
High Low This word is rare throughout the other documents, but comes up a lot in this document, so it must be reasonably relevant for this document High score
Low High This word is a common word, and it’s not even showing up much in this document Low Score
Low Low This word isn’t very common, but it’s also not apart of this document, so it’s not very relevant Low but closer to mean

General Architecture of TFIDF

BM25


Probabilistic Models

Bayesian Proba

Decision Trees

We cover Decision Trees in depth elsewhere, but they are useful for taking many of our Document, Query-Document, and User features into consideration when we want to predict some general category, but they can’t really be used on predicting specific videos

Decision Tree’s roles are typically to predict relevance scores (pointwise) or to optimize ranking orders based on features (listwise). We can incorporate them in Learning To Rank to optimize ranking orders

Logistic Regression

TODO: Logistic regression

Learn To Rank


Graph

I can really only think of PageRank off the top of my head, and in PageRank the Features are taken from a graph linkage structure, but it doesn’t need to be ran on a graph engine. Most of the time we calculate PageRank scores offline, and then for a specific query we use PageRank as a feature along with other ranking mechanisms.

Page Rank simply helps us calculate “page importance”, but we still need to compare a query to web page terms and themes by doing lookups based on Document and Query Embedding Similarity

Page Rank

An algorithm used by Google Search to rank web pages. It measures the importance of a page based on the number and quality of links to it. PageRank can be used anywhere, not just in Ranking, but makes sense to put it here

It outputs a probability distribution used to represent the likelihood that a person will arrive at any particular page through clicking on random links, similar to probability distribution of the “Wikipedia game” where to try to get to Page B from Page A by visiting referenced links starting at A

PageRank can be solved using iterations and traversing web links using the power iteration method, but it can also be solved analytically if we had an adjacency matrix of all links between pages $M$, where $M[i][j]$ represents the link from page $j$ to page $i$

The iterative approach using sum over in-nodes is typically preferred for large graphs because it avoids the computational cost of matrix inversion, and reduces amount that needs to be stored

DNN For Ranking

TODO: Most use cases today use DNN or GBDT for Ranking