Choosing a vector index for a real world problem is not obvious, there is usually a trade-off amongst memory, speed and accuracy. Different vector indexing techniques are optimized for different concerns. In this post I will attempt to provide a high level overview of various popular indexes for vector search. It doesn’t dive deep into the specifics of indexing algorithms, but it will help you understand advantages and drawbacks of each technique and how a particular index can be fine-tuned for speed-accuracy trade-off.

To understand this guide you should already know What is Similarity Search.

Vector Index

Tl;dr

The performance of each technique highly depends on the dataset structure as well as the configuration of the parameter values. However, they do have some characteristics that can be generalized. AWS has published benchmarks comparing HNSW and IVF-PQ in this blog. These two popular indexes are often used in various production applications.

Index	Memory Requirement	Requires Training	Search Time	Recall
Flat Index	Linear with dataset	No	Increases linearly with dataset	Highest; Exact match
Inverted File Index	Linear with dataset	Yes	Reduces search scope with clustering; much faster than Flat.	Accuracy can be moderate to high due to approximation of search scope.
Product Quantization	Compression techniques reduces memory requirement significantly	Yes	Has linear component w.r.t. dataset size, but still faster than Flat due to compression.	Accuracy is low due to vector compression.
Hierarchical Navigable Small Worlds	Requires most amount of memory amongst all algorithms	No	One of the fastest search techniques. Lograthimic w.r.t. dataset size.	Only second to Flat Index in terms of accuracy. Almost 1.0 in most cases
IVF-PQ	PQ reduces the memory requirement significantly	Yes	Search is very fast due to clustering as well as compression	Accuracy can be moderate because of limited search scope and compression.

Flat Index

Flat index stores the vectors in the dataset as-is in a flat data structure like map or a list. It does not modify the original vectors. During Search, it calculates distance between a given (query) vector and all other vectors in the index. Since there is no clustering or approximation involved, the search accuracy of a flat index is very high. However, it does not scale well with the number of vectors in terms of search-speed. The performance goes down linearly as more vectors are added to the index.

Inverted File Index (IVF)

Unlike flat index which searches the entire dataset, IVF consists of search scope reduction through clustering of vectors. Each vector in the dataset is assigned to a cluster (or bucket or centroid). During Search, the query vector is also matched with a bucket and the further search is performed within that bucket. This algorithm has two crucial parameters: nlist and nprobe. We can use this formula to roughly estimate the memory requirement of an IVF index:

1.1 * (((4 * d) * num_vectors) + (4 * nlist * d)) bytes

d = vector dimension.

nlist = The parameter that controls the number of buckets created for the index. Increasing nlist will increase the search speed, but effect index speed.

nprobe = The parameter that controls how many neighboring buckets are searched for query. Increasing nprobe would improve search quality but would effect search speed.

Product Quantization

Quantization is a technique to compress data into smaller space. This is different from dimensionality reduction. Typically in dimensionality reduction, the higher dimension vector are reduced to smaller dimensions. But Quantization targets the scope of values in the vector rather than its dimension. Lets say the vector has float values ranging from 0.0 → 110.0. Quantization uses clustering to assign vectors to a centroid. The discrete of set of centroids effectively reduces the vector space into a finite set of possible values. Product Quantization is a popular implementation of this technique.

Similar to IVF, Product Quantization is an optimization technique. PQ optimizes the distance calculation by compressing the vectors in the index. To compress the vectors, PQ first splits each vector into equally-sized sub-vectors. The sub-vectors are assigned a centroid and the original vector is just replaced with the id of the centroid. There are two crucial parameters for this algorithm:

m is number of sub-vectors that the original vector is broken into.
mbits is number of bits to encode each sub-vector with.

PQ reduces the memory requirements significantly, but it also reduces the recall (search accuracy) by a huge margin compared to other algorithms. This is a major drawback of PQ. The memory usage of PQ index can be calculated with this formula:

 k^{1/m} * d

k = number of centroids
d = Vector Dimension
m = number of sub-vectors a vector is split into.

Hierarchical Navigable Small World Graphs

Hierarchical Navigable Small World Graphs is one of the most popular and best performing vector search techniques at present. It is a graph of vectors where the vertices (vectors) are connected to their neighbors with an edge. The HNSW algorithm builds hierarchy of graphs to search for the nearest neighbors of a query (vector).

The bottom layer of the graph consists of all the vectors. The layer above it consists of a random subset of vectors in the bottom layer. This process is repeated until the top layer has only one vector. During search, the algorithm starts with the single vector in the top layer and then navigates the graph based on its nearest neighbors in each layer until it reaches the bottom layer.

You can tune three parameters for HNSW:

m = The maximum number of edges a vector will get in a graph. The higher this number is, the more memory the graph will consume, but the better the search approximation may be.
ef_search = The size of the queue of the candidate nodes to visit during traversal. When a node is visited, its neighbors are added to the queue to be visited in the future. When this queue is empty, the traversal will end. A larger value will increase search latency, but may provide better search approximation.
ef_construction = Very similar to ef_search. When a node is to be inserted into the graph, the algorithm will find its m edges by querying the graph with the new node as the query vector. This parameter controls the candidate queue size for this traversal. A larger value will increase index latency, but may provide a better search approximation.

For datasets with high dimensionality HNSW are one of the best performing indexes and it doesn’t need any training. However, they consume a significant amount of memory. We can use this formula to compute the memory requirement for a HNSW graph:

1.1 * (4 * d + 8 * m) * num_vectors

m is defined above
d = vector dimension
num_vector = number of vectors in the dataset.

Composite Index

Vector indexes are composable. Several indexing techniques can be combined to achieve desired output with acceptable trade-offs in cost, speed or accuracy.

IndexIVFFlat

Partitions the index into multiple buckets. The given (query) vector is first matched to a bucket, then IndexFlatL2 search is performed in this bucket. So the scope of the search is significantly reduced, this means the query time improves but the accuracy of the results go down. Since this method clusters the vectors into multiple buckets, it requires an additional training step.

IndexIVFPQ

Like IndexIVFFlat, IndexIVFPQ partitions the index into nlist buckets and compresses vectors using PQ. This results in faster query time and less memory consumption compared to IndexFlatL2. However, it comes at a cost of reduced search accuracy. This index also requires training. We can use this formula to calculate the memory requirement for IVF-PQ index:

1.1 * ((((m/8) * m + overhead_per_vector) * num_vectors) + (4 * nlist * d) + (2 ^m * 4 * d)

IndexHNSWFlat

A HNSW graph stored in a flat list (just like the Flat index) with no other vector compression or quantization.

Choosing an index

Choosing an index is mainly a trade between these four factors:

Query speed
Recall or search accuracy
Memory (or in simpler terms, $$)
Index speed

The following guide is based on the guidelines published by Facebook on how to choose an index. It depends on several real-world factors. This is only supposed to help narrow down your search.

IF Exact Results (Recall is 1.0):
    Flat Index
IF RAM is not a concern OR small dataset:
    HNSW; adjust m & ef_search for speed-accuracy trade-off.
IF RAM is not a concern OR big dataset:    
    IVFPQ; nlist=1024, adjust m, k_factor for PQ, nprobe for IVF.
    IVFPQ is faster than HNSW; but HNSW has better accuracy and index time.
IF RAM is somewhat a concern:
    IVFFlat; adjust nprobe for speed-accuracy trade-off.
IF RAM is quite important:
    OPQ -> Dimensionality reduction + PQ.

Each approach above can be combined with a clustering algorithm to improve search time
at the cost of increased index time. Pick the clustering method:

IF number of vectors < 1M:
   IVF; K should be b/w 4*sqrt(N)and16*sqrt(N). N is number of vectors.
   K is number of clusters.Needs b/w 30*K and256*K vectors for training.
IF number of vectors 1M-10M:
   IVF65536_HNSW32; needs b/w 30*65536and256*65536 vectors for training.
IF number of vectors 10M-100M:
    IVF262144_HNSW32; needs b/w 30*262144and256*262144 vectors for training.
IF number of vectors 10M-100M:
    IVF1048576_HNSW32; needs b/w 30*1048576and256* 1048576 vectors for training.

Chirag Agrawal

Choosing a vector index