A vector database is a database that stores data as high-dimensional vectors, mathematical representations of the original data, e.g., their features or attributes. These representations are generated by applying some form of transformation or embedding function, such as an AI model, to the original data. The transformation or embedding function converts data from its original form (text, images, audio, video, etc.) to an array of numerical values expressing the location of a floating point along different dimensions. Each vector has a specific number of dimensions (ranging from tens to thousands), depending on the dataset.
The growth of vector embeddings generated by AI models (natural language processing, large language models, computer vision, etc.) has led to the greater use of vector databases. Embeddings have many attributes or features that are challenging to manage. These features are critical to understanding patterns, relationships, and underlying structures present in the data. Vector databases are purpose-built for managing vector embeddings and offer significant advantages over traditional scalar-based databases and standalone vector indexes. Vector databases can handle the complexity and scale of the data, improving performance, scalability, and flexibility, as well as introducing advanced features. such as semantic information retrieval and long-term memory, to AI models.
The main advantage of vector databases is enabling fast similarity search and the retrieval of data based on their vector distance or similarity. Rather than using traditional methods of querying databases based on exact matches or predefined criteria, vector databases allow users to find the most similar or relevant data based on their semantic or contextual meaning. Examples could include the following:
- Images that are similar to a given image based on their visual content and style
- Documents that are similar to a given document based on their topic and sentiment
- Products that are similar to a given product based on their features and ratings
Similarity search and retrieval requires the use of a query vector representing the desired information or features. Query vectors can be derived from the same type of data as the stored vectors (e.g., using an image to query an image database) or from different types of data (e.g., using text to query an image database). Then, the database needs to use a similarity measure to calculate how close vectors are in terms of vector space. Typical metrics used are cosine similarity, euclidean distance, hamming distance, and jaccard index. Similarity search results are often returned as a ranked list of vectors based on similarity score.