Introducing new features to optimize vector index size, including Matryoshka Representation Learning (MRL), scalar quantization, binary quantization and oversampling.
In our continued effort to equip developers and organizations with advanced search tools, we are thrilled to announce the launch of several new features in the latest Preview API for Azure AI Search. These enhancements are designed to optimize vector index size and provide more granular control and understanding of your search index to build Retrieval-Augmented Generation (RAG) applications.
MRL Support for Quantization
Matryoshka Representation Learning (MRL) is a new technique that introduces a different form of vector compression, which complements and works independently of existing quantization methods. MRL enables the flexibility to truncate embeddings without significant semantic loss, offering a balance between vector size and information retention.
This technique works by training embedding models so that information density increases towards the beginning of the vector. As a result, even when using only a prefix of the original vector, much of the key information is preserved, allowing for shorter vector representations without a substantial drop in performance.
OpenAI has integrated MRL into their 'text-embedding-3-small' and 'text-embedding-3-large' models, making them adaptable for use in scenarios where compressed embeddings are needed while maintaining high retrieval accuracy. You can read more about the underlying research in the official paper [1] or learn about the latest OpenAI embedding models in their blog.
Storage Compression Comparison
Table 1.1 below highlights the different configurations for vector compression, comparing standard uncompressed vectors, Scalar Quantization (SQ), and Binary Quantization (BQ) with and without MRL. The compression ratio demonstrates how efficiently the vector index size can be optimized, yielding significant cost savings. You can find more about our Vector Index Size Limits here: Service limits for tiers and skus - Azure AI Search | Microsoft Learn.
Table 1.1: Vector Index Size Compression Comparison
Configuration |
*Compression Ratio |
Uncompressed |
- |
SQ |
4x |
BQ |
28x |
**MRL + SQ (1/2 and 1/3 truncation dimension respectively) |
8x-12x |
**MRL + BQ (1/2 and 1/3 truncation dimension respectively) |
64x – 96x |
Note: Compression ratios depend on embedding dimensions and truncation. For instance, using “text-embedding-3-large” with 3072 dimensions truncated to 1024 dimensions can result in 96x compression with Binary Quantization.
*All compression methods listed above, may experience slightly lower compression ratios due to overhead introduced by the index data structures. See "Memory overhead from selected algorithm" for more details.
**The compression impact when using MRL depends on the value of the truncation dimension. We recommend either using ½ or 1/3 of the original dimensions to preserve embedding quality (see below)
Quality Retainment Table:
Table 1.2 provides a detailed view of the quality retainment when using MRL with quantization across different models and configurations. The results indicate the impact on Mean NDCG@10 across a subset of MTEB datasets, showing that high levels of compression can still preserve up to 99% of search quality, particularly with BQ and MRL.
Table 1.2: Impact of MRL on Mean NDCG@10 Across MTEB Subset
Model Name |
Original Dimension |
MRL Dimension |
Quantization Algorithm |
No Rerank (% Δ) |
Rerank 2x Oversampling (% Δ) |
OpenAI text-embedding-3-small |
1536 |
512 |
SQ |
-2.00% (Δ = 1.155) |
-0.0004% (Δ = 0.0002) |
OpenAI text-embedding-3-small |
1536 |
512 |
BQ |
-15.00% (Δ = 7.5092) |
-0.11% (Δ = 0.0554) |
OpenAI text-embedding-3-small |
1536 |
768 |
SQ |
-2.00% (Δ = 0.8128) |
-1.60% (Δ = 0.8128) |
OpenAI text-embedding-3-small |
1536 |
768 |
BQ |
-10.00% (Δ = 5.0104) |
-0.01% (Δ = 0.0044) |
OpenAI text-embedding-3-large |
3072 |
1024 |
SQ |
-1.00% (Δ = 0.616) |
-0.02% (Δ = 0.0118) |
OpenAI text-embedding-3-large |
3072 |
1024 |
BQ |
-7.00% (Δ = 3.9478) |
-0.58% (Δ = 0.3184) |
OpenAI text-embedding-3-large |
3072 |
1536 |
SQ |
-1.00% (Δ = 0.3184) |
-0.08% (Δ = 0.0426) |
OpenAI text-embedding-3-large |
3072 |
1536 |
BQ |
-5.00% (Δ = 2.8062) |
-0.06% (Δ = 0.0356) |
Table 1.2 compares the relative point differences of Mean NDCG@10 when using different MRL dimensions (1/3 and 1/2 from the original dimensions) from an uncompressed index across OpenAI text-embedding models.
Key Takeaways:
- 99% Search Quality with BQ + MRL + Oversampling: Combining Binary Quantization (BQ) with Oversampling and Matryoshka Representation Learning (MRL) retains 99% of the original search quality in the datasets and embeddings combinations we tested, even with up to 96x compression, making it ideal for reducing storage while maintaining high retrieval performance.
- Flexible Embedding Truncation: MRL enables dynamic embedding truncation with minimal accuracy loss, providing a balance between storage efficiency and search quality.
- No Latency Impact Observed: Our experiments also indicated that using MRL had no noticeable latency impact, supporting efficient performance even at high compression rates.
For more details on how MRL works and how to implement it, visit the MRL documentation.
Targeted Vector Filtering
Targeted Vector Filtering allows you to apply filters specifically to the vector component of hybrid search queries. This fine-grained control ensures that your filters enhance the relevance of vector search results without inadvertently affecting keyword-based searches.
Sub-Scores
Sub-Scores provide granular scoring information for each recall set contributing to the final search results. In hybrid search scenarios, where multiple factors like vector similarity and text relevance play a role, Sub-Scores offer transparency into how each component influences the overall ranking.
Text Split Skill by Tokens
The Text Split Skill by Tokens feature enhances your ability to process and manage large text data by splitting text based on token countsThis gives you more precise control over passage (chunk) length, leading to more targeted indexing and retrieval, particularly for documents with extensive content.
For any questions or to share your feedback, feel free to reach out through our Azure Search · Community
Getting started with Azure AI Search
- Learn more about Azure AI Search and about all the latest features.
- Want to chat with your data? Check out VoiceRAG!
- Start creating a search service in the Azure Portal, Azure CLI, the Management REST API, ARM template, or a Bicep file.
- Learn about Retrieval Augmented Generation in Azure AI Search.
- Explore our preview client libraries in Python, .NET, Java, and JavaScript, offering diverse integration methods to cater to varying user needs.
- Explore how to create end-to-end RAG applications with Azure AI Studio.
References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147{2205.13147}