-
Notifications
You must be signed in to change notification settings - Fork 143
Open
Description
Current issues with dynamic updates to PQ quantization
The support in JVector for updating a Product Quantizer (PQ) is quite limited today. JVector only supports PQ updates through the ProductQuantization.refine. This is less than ideal for multiple reasons:
- The current implementation, applying one iteration of Lloyd's algorithm, provides no guarantees of working. In theory, each iteration of Lloyd's algorithm should lower the quantization error, but this is only true under the assumption that we eep the underlying data constant.
- The ability to update the quantizer is limited, because we do not have a way of removing deleted vectors (the centroids in Lloyd's algorithm carry "memory" of past iterations).
Additionally, there is not a real way of tracking the accuracy of the quantization in JVector.
Proposal
- Change the k-means implementation from Lloyd's algorithm to the mini-batch variant. This will allow to update the PQ codebook efficiently and with a small memory footprint.
- The prior change enables a Dynamic Product Quantization (DPQ) scheme, which is a more flexible way of updating the PQ codebook, supporting insertions and deletions, which are hard to handle with a partial update using Lloyd's algorithm. This would require adding a new deletion capability to the mini-batch algorithm. The change in the code is straightforward.
- Add mechanisms in JVector to track the accuracy of the quantizer. The only natural solution for this is to use the quantization (or reconstruction) error.
- Using the quantization error, we can implement a two quantizers approach. The vectors are quantized with a reference PQ. Additionally, we constantly update a DPQ over time (by applying insertions and deletions to it). The DPQ carries the most up-to-date representation of the data.
- When we detect a meaningful difference between the quantization error of the reference PQ and the DPQ, the reference PQ has become "stale" and is discarded. Then, the DPQ becomes the reference PQ (i.e., we stop updating it), we re-quantize the vectors, and the process starts over.
- To compute the quantization error, we need a sample of vectors. The sampling strategy is TBD. A possible solution is to additionally track the standard deviation of the k-means clusters as aprt of the mini-batch update and directly sample from the hard mixture of Gaussians.
Implementation
- Add a MiniBatchKMeans implementation that has functions to add and remove vectors. Except for these new functions, this implementation should be a drop-in replacement for the current k-means implementation.
- Add a DynamicProductQuantizer class that uses the MiniBatchKMeans implementation to update the PQ codebook, and supports dynamic updates.
- Add a function to the VectorCompressor interface to compute the quantization error given a set of vectors.
- Add a function to the VectorCompressor that returns a sample of vectors, following the underlying statistical model (either an actual sample from real data or one generated from the model).
- Add a TwoQuantizersCompressor class that implements the two-quantizers approach described above.
Notes:
- In the future, this model could be applied as a simple way to apply model selection to the quantizer (either by changing the type of quantizer, or by altering its parameters). At this point, this is purely speculative and is left for future work.
- These proposed changes enable the implementation of these capabilities (tracking error, updating a quantizer over time, and re-quantizing vectors) in downstream projects. See this issue in the OpenSearch plugin.
Metadata
Metadata
Assignees
Labels
No labels