Blog Post

AI - Azure AI services Blog
5 MIN READ

Azure AI Search October Updates: Nearly 100x Compression with Minimal Quality Loss

fsunavala-msft's avatar
Oct 08, 2024

Introducing new features to optimize vector index size, including Matryoshka Representation Learning (MRL), scalar quantization, binary quantization and oversampling.

In our continued effort to equip developers and organizations with advanced search tools, we are thrilled to announce the launch of several new features in the latest Preview API for Azure AI Search. These enhancements are designed to optimize vector index size and provide more granular control and understanding of your search index to build Retrieval-Augmented Generation (RAG) applications.

MRL Support for Quantization

Matryoshka Representation Learning (MRL) is a new technique that introduces a different form of vector compression, which complements and works independently of existing quantization methods. MRL enables the flexibility to truncate embeddings without significant semantic loss, offering a balance between vector size and information retention.

This technique works by training embedding models so that information density increases towards the beginning of the vector. As a result, even when using only a prefix of the original vector, much of the key information is preserved, allowing for shorter vector representations without a substantial drop in performance.

OpenAI has integrated MRL into their 'text-embedding-3-small' and 'text-embedding-3-large' models, making them adaptable for use in scenarios where compressed embeddings are needed while maintaining high retrieval accuracy. You can read more about the underlying research in the official paper [1] or learn about the latest OpenAI embedding models in their blog.

Storage Compression Comparison

Table 1.1 below highlights the different configurations for vector compression, comparing standard uncompressed vectors, Scalar Quantization (SQ), and Binary Quantization (BQ) with and without MRL. The compression ratio demonstrates how efficiently the vector index size can be optimized, yielding significant cost savings. You can find more about our Vector Index Size Limits here: Service limits for tiers and skus - Azure AI Search | Microsoft Learn.

 

Table 1.1: Vector Index Size Compression Comparison

 

Configuration

*Compression Ratio

Uncompressed

-

SQ

4x

BQ

28x

**MRL + SQ (1/2 and 1/3 truncation dimension respectively)

8x-12x

**MRL + BQ (1/2 and 1/3 truncation dimension respectively)

64x – 96x

 

Note: Compression ratios depend on embedding dimensions and truncation. For instance, using “text-embedding-3-large” with 3072 dimensions truncated to 1024 dimensions can result in 96x compression with Binary Quantization.

*All compression methods listed above, may experience slightly lower compression ratios due to overhead introduced by the index data structures. See "Memory overhead from selected algorithm" for more details.

**The compression impact when using MRL depends on the value of the truncation dimension. We recommend either using ½ or 1/3 of the original dimensions to preserve embedding quality (see below)

Quality Retainment Table:

Table 1.2 provides a detailed view of the quality retainment when using MRL with quantization across different models and configurations. The results indicate the impact on Mean NDCG@10 across a subset of MTEB datasets, showing that high levels of compression can still preserve up to 99% of search quality, particularly with BQ and MRL.

 

Table 1.2: Impact of MRL on Mean NDCG@10 Across MTEB Subset

Model Name

Original Dimension

MRL Dimension

Quantization Algorithm

No Rerank (% Δ)

Rerank 2x Oversampling (% Δ)

OpenAI text-embedding-3-small

1536

512

SQ

-2.00% (Δ = 1.155)

-0.0004% (Δ = 0.0002)

OpenAI text-embedding-3-small

1536

512

BQ

-15.00% (Δ = 7.5092)

-0.11% (Δ = 0.0554)

OpenAI text-embedding-3-small

1536

768

SQ

-2.00% (Δ = 0.8128)

-1.60% (Δ = 0.8128)

OpenAI text-embedding-3-small

1536

768

BQ

-10.00% (Δ = 5.0104)

-0.01% (Δ = 0.0044)

OpenAI text-embedding-3-large

3072

1024

SQ

-1.00% (Δ = 0.616)

-0.02% (Δ = 0.0118)

OpenAI text-embedding-3-large

3072

1024

BQ

-7.00% (Δ = 3.9478)

-0.58% (Δ = 0.3184)

OpenAI text-embedding-3-large

3072

1536

SQ

-1.00% (Δ = 0.3184)

-0.08% (Δ = 0.0426)

OpenAI text-embedding-3-large

3072

1536

BQ

-5.00% (Δ = 2.8062)

-0.06% (Δ = 0.0356)

Table 1.2 compares the relative point differences of Mean NDCG@10 when using different MRL dimensions (1/3 and 1/2 from the original dimensions) from an uncompressed index across OpenAI text-embedding models.

 

Key Takeaways:

  • 99% Search Quality with BQ + MRL + Oversampling: Combining Binary Quantization (BQ) with Oversampling and Matryoshka Representation Learning (MRL) retains 99% of the original search quality in the datasets and embeddings combinations we tested, even with up to 96x compression, making it ideal for reducing storage while maintaining high retrieval performance.
  • Flexible Embedding Truncation: MRL enables dynamic embedding truncation with minimal accuracy loss, providing a balance between storage efficiency and search quality.
  • No Latency Impact Observed: Our experiments also indicated that using MRL had no noticeable latency impact, supporting efficient performance even at high compression rates.

For more details on how MRL works and how to implement it, visit the MRL documentation.

Targeted Vector Filtering

Targeted Vector Filtering allows you to apply filters specifically to the vector component of hybrid search queries. This fine-grained control ensures that your filters enhance the relevance of vector search results without inadvertently affecting keyword-based searches.

Sub-Scores

Sub-Scores provide granular scoring information for each recall set contributing to the final search results. In hybrid search scenarios, where multiple factors like vector similarity and text relevance play a role, Sub-Scores offer transparency into how each component influences the overall ranking.

Text Split Skill by Tokens

The Text Split Skill by Tokens feature enhances your ability to process and manage large text data by splitting text based on token countsThis gives you more precise control over passage (chunk) length, leading to more targeted indexing and retrieval, particularly for documents with extensive content.

For any questions or to share your feedback, feel free to reach out through our  Azure Search · Community

Getting started with Azure AI Search

 

References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024).
Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147{2205.13147}

Updated Nov 26, 2024
Version 6.0
No CommentsBe the first to comment