-
Notifications
You must be signed in to change notification settings - Fork 3k
Getting started
For the following, we assume Milvus is installed. We provide code examples in Python and Node. The code can be run by copy/pasting it,so that you can quickly experience using Milvus to achieve vector similarity retrieval.
Before using Milvus, you need to some glossary about Milvus:
-
collection: A collection in Milvus is equivalent to a table in a relational database management system (RDBMS). In Milvus, collections are used to store and manage entities.
-
partition: A partition is a division of a collection. Milvus supports dividing collection data into multiple parts on physical storage. This process is called partitioning, and each partition can contain multiple segments.
-
entity: An entity consists of a group of fields that represent real world objects. Each entity in Milvus is represented by a unique row ID.
-
Field: Fields are the units that make up entities. Fields can be structured data (e.g., numbers, strings) or vectors.
-
vector: A vector represents the features of unstructured data. It is usually converted by an AI or ML model. A vector comes in the form of a numeric array of high dimensions. Each vector represents an object.
Each entity can only contain one vector in the current version of Milvus.
To run insert and search in Milvus, we need two matrices:
-
xb
for the database, that contains the vectors that must be inserted to Milvus collection, and that we are going to search in it. Its size is nb-by-d -
xq
for the query vectors, for which we need to find the nearest neighbors. Its size is nq-by-d. If we have a single query vector, nq=1. In the following examples we are going to work with vectors that are drawn form a uniform distribution in d=128 dimensions.
In Python
import numpy as np
d = 128 # dimension
nb = 100000 # database size
nq = 1000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32').tolist()
xq = np.random.random((nq, d)).astype('float32').tolist()
In node
const d=128;
const nb=100000;
const nq=1000;
const entities = Array.from({ length: nb }, () => ({
[FIELD_NAME]: Array.from({ length: nq }, () => Math.floor(Math.random() * nb)),
}));
const xq = Array.from({ length: d }, () => Math.floor(Math.random() * nq));
To use Milvus, you need to connect Milvus server first.
In Python
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
connections.connect(host='localhost', port='19530')
In node
import { MilvusClient } from "@zilliz/milvus2-sdk-node";
const milvusClient = new MilvusClient("localhost:19530");
Before inserting data into Milvus, you need to create a collection in Milvus and know some Milvus glossary as follows:
- collection: A collection in Milvus is equivalent to a table in a relational database management system (RDBMS). In Milvus, collections are used to store and manage entities.
- entity: An entity consists of a group of fields that represent real world objects. Each entity in Milvus is represented by a unique row ID.
You can customize row IDs. If you do not configure manually, Milvus automatically assigns row IDs to entities. If you choose to configure your own customized row IDs, note that Milvus does not support row ID de-duplication for now. Therefore, there can be duplicate row IDs in the same collection.
- filed: Fields are the units that make up entities. Fields can be structured data (e.g., numbers, strings) or vectors.
In Python
collection_name = "hello_milvus"
default_fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=d)
]
default_schema = CollectionSchema(fields=default_fields, description="test collection")
print(f"\nCreate collection...")
collection = Collection(name= collection_name, schema=default_schema)
In node
const collection_name = "hello_milvus"
const params = {
collection_name: collection_name,
fields: [
{
name: "vector",
description: "vector field",
data_type: DataType.FloatVector,
type_params: {
dim: d,
},
},
{
name: "id",
data_type: DataType.Int64,
autoID: true,
is_primary_key: true,
description: "",
},
],
};
await milvusClient.collectionManager.createCollection(params);
After creating a collection in Milvus, we need to insert data into the collection.
In Python
print(f"\nInsert data")
mr = collection.insert([xb])
# show the number of the entities that insert into Milvus
print(collection.num_entities)
# view the id that Milvus auto genarate
print(mr.primary_keys)
In node
await milvusClient.dataManager.insert({{
collection_name: collection_name,
fields_data: entities,
});
# flush data to disk.
await milvusClient.dataManager.flush({ collection_names: [collection_name] });
Before searching, you need to load the data from disk to memory.
Warning:
In current release, data to be load must be under 70% of the total memory resources of all query nodes to reserve memory resources for execution engine.
In Python
collection.load()
In node
await milvusClient.collectionManager.loadCollection({
collection_name: collection_name,
});
The basic search operation that can be performed on a collection is the k-nearest-neighbor search, ie. for each query vector, find its k nearest neighbors in the database.
The result object can be used as a 2-D array. results[i] (0 <= i < len(results)) represents topk results of i-th query vector, and results[i][j] (0 <= j < len( results[i] )) represents j-th result of i-th query vector.
In Python
top_k = 5
search_params = {"metric_type": "L2"}
results = collection.search(xq,anns_field="vector", param=search_params, limit=top_k)
print("id: {}, distance: {}".format(results[0].ids, results[0].distances))
In node
const top_k = 5;
const searchParams = {
anns_field: "vector",
topk: top_k,
metric_type: "L2",
params: JSON.stringify({ nprobe: 10 }),
};
await milvusClient.dataManager.search({
collection_name: COLLECTION_NAME,
// partition_names: [],
expr: "",
vectors: [[1, 2, 3, 4, 5, 6, 7, 8]],
search_params: searchParams,
vector_type: 100, // Float vector -> 100
});
This topic describes how to conduct a query.
In addition to vectors, Milvus supports data types such as boolean, integers, floating-point numbers, and more.
A query is a search on all existing data. In Milvus, you can run a query which will return all the results that meet your specified requirements. Use boolean expression to specify the requirements.
In Python
expr = "id in [2,4,6,8]"
output_fields = ["id", "vector"]
res = collection.query(expr, output_fields)
# check the results
sorted_res = sorted(res, key=lambda k: k['id'])
print(sorted_res)
In Node
await milvusClient.dataManager.query({
collection_name: COLLECTION_NAME,
expr: "id in [2,4,6,8]",
output_fields: ["id", "vector"],
});