Meet ContextCheck: Our Open-Source Framework for LLM & RAG Testing! Check it out on Github!

in Blog

November 29, 2023

How Do Vector Databases Work in Domain-Specific Modeling?

Author:




Edwin Lisowski

CSO & Co-Founder


Reading time:




14 minutes


The vector database market is predicted to grow to $4.3 billion in 2028, up from $1.5 billion in 2023. [1] There’s also been a surge in the number of vector database startups seeking to streamline the technology even further.

This comes as no surprise considering the unique data management approach required by emerging AI technologies like machine learning and large language models, which rely heavily on semantic information to understand the context and maintain a long-term memory they can draw on when executing complex tasks.

This guide will delve into the intricacies of vector databases, exploring everything from what they are, how they work, their various features, and the benefits they provide.

What are vector databases?

A vector, in its simplest form, is a mathematical object that represents direction and magnitude in space. However, it takes a more complex approach when it comes to data science.

In data science, vectors serve as fundamental building blocks for representing complex information in a machine-understandable form.

Therefore, a vector database can be described as a type of repository or database that stores and manages unstructured data such as images, text, or audio in high dimensional vectors (vector embeddings) to make it easy to find and retrieve similar items.

What is a Vector Database?

 

This unique approach to data management comes in handy in the development of machine learning, natural language processing, large language models, and generative AI models which require the ability to contextualize and process large amounts of data.

By giving AI models the ability to access vector embeddings, vector databases give the models the necessary semantics to have human-like long-term memory processing capabilities, thus enabling them to recall and draw on information for executing complex tasks.

NOTE: Vector embeddings are numerical representations of a particular image, word, or any other piece of data. Vector databases determine the similarity between vectors by measuring the distance between each vector embedding.

How do vector databases work?

Vector databases work by employing algorithms to query and index vector embeddings. These algorithms allow approximate nearest neighbor (ANN) search via quantization, hashing, or graph-based search. [2]

The ANN retrieves information by finding the query’s nearest neighbor vector, which requires less computational resources than a KNN search (known as true k nearest neighbor algorithm). [3]

Here’s a typical database pipeline:

How do vector databases work?

Indexing

A vector database uses various algorithms to index vectors by mapping them onto a given data structure through quantization, hashing, or graph-based techniques, thus enabling faster searches. Here’s how the various indexing techniques work:

  • Hashing: Vector databases use hashing algorithms like the locality-sensitive hashing (LSH) algorithm due to its suitability for nearest-neighbor searches. The LSH algorithm works by hashing a query onto a table and then comparing it to a set of vectors on the same table to determine similarity. This way, the algorithm is better able to generate speedy results.
  • Quantization: Quantization techniques like product quantization (PQ) break vectors up into smaller parts, use code to represent the smaller parts, and then put the parts back together, resulting in a code representation of vectors and their components. When assembled, the code representation of vectors is referred to as a codebook. When queried, a vector database using quantization techniques breaks each query into code and then matches it against the codebook to find the most similar code, thereby generating results.
  • Graph-based: graph-based indexing techniques utilize algorithms like the Hierarchical Navigable Small World (HNSW) algorithm, which uses nodes to represent vectors. [4] The HNSW algorithm clusters the nodes and draws edges or lines between similar nodes, thus creating hieratical graphs. When a vector database launches a query, the graph-based algorithm navigates the graph hierarchy to find nodes containing similar vectors to the query vector.

Discover the untapped capabilities of vector databases through the expertise of our Generative AI development company. Let our professionals walk you through the seamless integration of vector databases for optimal AI performance.

Querying

Upon receiving a query, vector databases compare the query to indexed vectors to determine the nearest vector neighbors. To achieve this, the vector database relies on mathematical methods called similarity measures.

The different types of similarity measures include:

  • Cosine similarity: The cosine similarity method establishes similarity on a range of -1 to 1. Essentially, cosine similarity measures the angle between two vectors in a space. This way, it is better able to determine which vectors are identical (represented by 1), diametrically opposed (represented by -1), or orthogonal (represented by 0). [5]
  • Euclidean distance: This mathematical method determines similarity on a range of 0 to infinity. It achieves this by measuring the straight line distance between vectors. This way, identical vectors are represented by 0, while greater values (anything higher than 1) represent a greater difference between vectors.
  • Dot product: The dot product method measures similarity on a range of minus infinity to infinity. It achieves this by measuring the product of the magnitude of two vectors and the cosine of the angle between them. Any vectors pointing away from each other are assigned a negative value, those pointing towards each other get a positive value, and orthogonal vectors are assigned 0.

Post-Processing

Also called post-filtering, post-processing is the final step in a vector database pipeline. Here, the vector database uses different similarity measures to re-rank the nearest neighbors. Essentially, the database filters each query’s nearest neighbors identified in the search based on their metadata.

That said, some vector databases take a different approach by applying filters before running a vector search. In that case, the process is referred to as preprocessing or pre-filtering.

Database operations

Vector databases come equipped with a set of core components and capabilities that increase their effectiveness in high-scale production settings.

Some of the core components of such a database include:

Performance and fault tolerance

Performance and fault tolerance go hand in hand. Typically, the more data there is, the higher the number of nodes required, which, unfortunately, increases the possibility of errors and failures. These errors could be due to anything from hardware failures to network failures and other technical bugs.

In a bid to ensure high performance and fault tolerance, vector databases apply sharding and replication techniques. Here’s how they work:

  • Sharding: sharding typically entails partitioning the data across multiple nodes. This can be achieved through several methods. For instance, data can be partitioned based on the similarity of different clusters such that similar vectors are stored in the same partition. When the vector database makes a query, the query is sent to all shards, where results are retrieved and combined in a process known as a ‘scatter gather’ pattern.
  • Replication: Replication involves creating multiple copies of the data across multiple nodes. The existence of multiple copies of the same data ensures that there’s always a replacement for the data whenever a node fails.

The process employs two main consistency models: eventual consistency and strong consistency. The former, eventual consistency, allows for temporary inconsistencies between the various copies, which improves availability and reduces latency. On the downside, the presence of inconsistencies in the data may result in conflicting results and even data loss.

Strong consistency, on the other hand, requires that all copies of the data are updated before a write operation is considered complete, thereby providing greater consistency. However, it may also result in latency.

Monitoring

Effective monitoring is vital to manage and maintain a vectore database. A robust monitoring system can track important aspects of the vector database, including performance, health, and overall status.

By constantly monitoring the database, developers can effectively detect potential problems, optimize performance, and ensure smooth production operations. Some of the most notable aspects of vector database monitoring include:

  • Resource usage: By monitoring resource usage aspects of the vector database, such as memory, CPU, disk space, and network activity, developers can effectively identify potential issues and resource constraints that could negatively impact the database’s performance.
  • Query performance: Query performance issues can include everything from throughput, latency, and error rates. This may indicate potential systemic issues that, if not addressed, could lead to systemic issues.
  • System health: Monitoring system health typically includes checking the status of the replication process, individual nodes, and other crucial components.

Access-control

Access control is a crucial aspect of data security. It ensures that only authorized personnel can view, interact with, or modify sensitive data stored within the vector database. As such, the process typically entails managing and regulating user access to data and resources.

Some of the most notable benefits of implementing proper access-control measures include:

  • Data protection: According to a CISCO data privacy benchmark report, 95% of business executives recognize privacy as a business necessity. [6] Data protection is even more vital when it comes to AI applications dealing with sensitive and confidential information. As such, vector databases need to implement strict access control mechanisms to help safeguard data from unauthorized access and potential breaches.
  • Accountability and auditing: The primary purpose of accountability and auditing access control mechanisms is to help organizations maintain a record of user activities within the vector database. The information gathered is crucial for auditing processes such that when breaches happen, organizations can trace back any unauthorized access or modifications.
  • Compliance: Any organization dealing with sensitive information, including those in the finance and healthcare sectors is subject to strict data privacy regulations. By implementing robust access control, organizations can effectively comply with set regulations, thereby protecting themselves from legal and financial repercussions.
  • Scalability and flexibility: All organizations seek to grow and expand. As this happens, their access control mechanisms may need to change to allow seamless modification and expansion of user permissions. This ultimately ensures that the organization’s data security remains intact throughout its growth.

Backups and collections

Despite all the measures put in place to ensure the smooth operation of vector databases, there’s always the possibility that the system may encounter data loss or critical failure. However, most databases offer the ability to regularly create backups that can be relied upon when anything happens.

Depending on the organization’s requirements and infrastructure, these backups can be stored on cloud-based storage services or external storage systems, thus effectively ensuring the safety and recoverability of data.

In case of data loss or corruption, organizations can use these backups to restore their databases to a previous state, thereby minimizing downtime and impact on the overall system.

APIs and SDKs

A vector database wouldn’t be as effective without an easy-to-use API. By implementing a user-friendly interface, developers can simplify the development of high-performance vector search applications.

Additionally, programming language-specific SDKs make it easier for developers to interact with the database in their applications. Ultimately, developers are better able to concentrate on the system’s specific use cases such as semantic text search, hybrid search, generative question answering, product recommendations, or image similarity search, without having to worry about underlying infrastructure complexities.

Features and advantages of vector databases

Vector databases take a different approach to storing and retrieving data as opposed to traditional relational databases. While the latter stores complex data in tables, vector databases are specially engineered to manage high-dimensional vector data.

These vectors can represent anything from a snippet of a song to a product image on an e-commerce site. The following features make vector databases uniquely suitable for handling multidimensional data.

Flexibility in handling different types of data

Data such as images, text, sounds, and other complex, multidimensional data types can’t be represented in rows and columns. Vector databases are able to overcome these constraints by:

  • Representing complex data as vectors: By transforming different data types into vectors, such databases can effectively handle anything from a symphony snippet to a picture of a cat.
  • Customizable distance metrics: Vector databases allow customization of distance metrics depending on the nature of the data. For instance, vector databases can customize data into different metrics like Euclidean distance and cosine similarity, thus ensuring accurate and relevant results.
  • Multidimensional searching: vector databases can handle the complexities of multidimensional searching regardless of the number of dimensions, thus providing versatile solutions for various applications.

Speed and efficiency in searching

Vector databases present the fastest way to store, search, and retrieve complex, multidimensional data. Here’s how they do it:

  • Indexing techniques: By using techniques like quantization and clustering, vector databases are able to make operations much faster.
  • Scalability: Vector databases leverage distribution systems that enable them to grow with organizational data requirements, thus ensuring consistent performance even as demand increases.
  • Algorithms designed for vectors: With algorithms like nearest neighbor search, vector databases can quickly and effectively sift through tons of data points to find the most relevant matches.

Integration with machine learning and AI applications

Vector databases don’t function as standalone systems. They are part of broader ecosystems that typically include machine learning and AI. Here’s how their integration with these systems forms a symbiotic relationship:

  • Feeding AI models: The mere fact that vector databases store and manage vast amounts of data makes them an invaluable source of data for AI models. This comes in handy in AI operations, including advanced analytics, decision-making, and advanced analytics.
  • Real-time insights: When combined with AI-based technologies like machine learning, vector databases can provide real-time insights, enabling organizations to respond swiftly to demands, trends, and potential issues.

Cost of ownership

One of the greatest hindrances to developing and maintaining LLMs is the cost implications of training foundational models from scratch and fine-tuning them for specific applications. Using vector databases significantly reduces the cost and speed of inference foundational models, thus serving as a cost-effective alternative to traditional methods.

Long term memory of LLMs

With vector databases, organizations don’t necessarily have to develop foundational models. Instead, they can start with general-purpose models like Google’s Fla or Meta’s Llama-2 models and then feed the model their own data through a vector database to enhance the model’s output. This approach also works well for enhancing the output of AI applications.

Boosting other AI capabilities

One of the greatest benefits of vector databases for AI applications is providing the ability to leverage existing models across large datasets by facilitating efficient access and retrieval of data for real-time operations.

Like with organic (human) brains, vector databases provide the foundation for memory recall. As such, AI is broken down into memory recall (vector databases), cognitive functions (large language models), neurological pathways (data pipelines), specialized memory engrams, and encodings (Vector embedding).

By working synergistically, these processes allow AI to learn, grow, and access information seamlessly. Therefore, just like human memory, vector databases hold all the memory engrams and provide the model’s ‘cognitive functions’ with the ability to recall information.

Similarly, generative AI processes can access large datasets, correlate the data efficiently, and use the data to make contextual decisions on what comes next.

Learn more about Generative AI Implementation: A step-by-step guide

Final thoughts on vector databases

Vector databases are poised to revolutionize the development of AI and ML models. As technology continues to evolve with the creation of better, more powerful embeddings, we expect to see the development of new techniques and algorithms to manage these embeddings.

The decade might also see the development of hybrid databases intended to combine the power of vector databases with traditional relational databases as an answer to the growing need for scalable, efficient databases.

References

[1] Marketsandmarkets.com. Vector Database Market. URL: https://t.ly/G6ufT. Accessed on November 24, 2023
[2] Towardsdatascience.com. Comprehensive guide to Approximate Neighbors Algorithms. URL: https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6. Accessed on November 24, 2023
[3] IBM.com. What is the k-nearest neighbors Algorithms. URL: https://t.ly/rxGLY. Accessed on November 24, 2023
[4] Towardsdatascience.com. Similarity Search, Part 4: Hierarchical Navigable Small World (HNSW). URL: https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37. Accessed on November 24, 2023
[5] Cosine Similarity. URL: https://t.ly/qe8xG. Accessed on November 24, 2023
[6] Cisco.com. Privacy’s Growing Importance and Impact. URL: https://t.ly/90_j-. Accessed on November 24, 2023



Category:


Generative AI