We’re all aboard the hype train with AI, GPT models, and the whirlwind of change they're bringing to our tech landscape. If you’re anything like me—a tech enthusiast with a knack for diving headfirst into new tools—you’ve probably found yourself knee-deep in projects that sounded simple at first but turned out to be gateways to entirely new realms of knowledge.
My latest adventure? It started with a seemingly little quest (I’ll ban “little” from my vocabulary moving forward 😶) to build an "Ask questions to your PDF" tool, which led me down the rabbit hole to discover the world of "Vector Databases”.
This article won’t be a technical deep dive, but rather an introduction to the main concepts around Vector database.
Alright, let's get down to brass tacks. If you've dabbled in backend development, you've likely played around with Relational Databases (hello 👋, MySQL and PostgreSQL) or even flirted with Document-based databases (MongoDB, Redis, we’re looking at you).
But guess what? That was so 2023.
The cool kids are now jamming with the new quarterback of the database school: Vector Databases.
Back to the basics: in the realm of vector databases, we've got "vector" and "database" (mind-blowing revelation, right?). Let's decode these terms with the help of our trusty sidekick, ChatGPT, without having to open your 'latin-grec' dictionary.
A vector is a sequence of numbers that represents data in a high-dimensional space. Each number in the sequence corresponds to a dimension and its value represents the magnitude or position along that dimension.
So, a vector is just a sequence of numbers representing data in a multi-dimensional matrix, pretty clear.
Note that in the illustration below we’re talking about Embedding vectors, which are just vectors generated by an embedding model (we’ll see this later).
Source: https://milvus.io/blog/2021-10-10-milvus-helps-analyze-videos.md
Another way to phrase it would be to say that a vector is representation of a complex data (words, sentences, pdf, image, audio, video…) into a numerical form often referred to as an embedding.
These vectors are typically high-dimensional and are used to encode semantic information about the items they represent.
Well, you know what a database is, you’re using one daily, and it will be the same in the context of a Vector Database.
A database is an organized collection of structured data, typically stored on a server/computer. Databases are designed to efficiently store, retrieve, and manipulate data, and they are used in a wide variety of applications, from simple record-keeping to complex business intelligence systems.
Imagine a world where vectors and databases come together in perfect harmony.
That’s what vector databases are all about.
They store and manage high-dimensional vector embeddings (we'll circle back to this term, promise) to perform efficient retrieval and similarity searches.
Unlike their relational or NoSQL cousins, vector databases are all about dealing with embeddings, giving them the upper hand in similarity querying, recommendation systems, and semantic searches.
Note that contrary to a Relational Database, the data you save does not follow a schema structure you’d have pre-defined. It stores vector embeddings, generated by embedding models out of unstructured data.
Vector databases are not just about storing data; they're about unlocking possibilities. They shine in scenarios where traditional databases struggle, offering a unique ability to understand and query data like images, text, videos, and sounds.
Here are a few standout use cases:
Before diving deeper into the contrasts, let's briefly recap traditional databases.
These databases, whether relational (SQL) or non-relational (NoSQL), excel at handling structured data. They organize data into predefined formats, like tables or documents, making it easy to perform precise, condition-based queries.
They're great for when your data fits nicely into rows and columns and when your queries are straightforward.
But here's the kicker: What if your data isn't just numbers and strings? What if it's more complex, like images, videos, or audio clips? That's where traditional databases start to sweat.
Source: https://pynomial.com/2021/10/open-source-vector-databases-overview/
Let’s take one example that fits perfectly the RDS data model.
If you need to store a list of products an e-commerce website is selling in a very simplistic approach you’ll have a table “products” in which you store your with a name, description, photo, price and available stock.
You’ll also have a table with your customers and a table with your order and order lines.
All these columns will be interconnected using foreign keys and you’ll be able to query things like:
But how would go about:
The answer is … meeeeeeeh.
This is where vector databases strut onto the stage. With the ability to handle complex, unstructured data like images, audio, and text, vector databases allow you to ask questions that were previously unthinkable.
You could do things that are impossible with traditional databases like querying:
Here’s where the sorcery happens. You’ve got complex data—let’s say, images. The first step is to transform these images into something the database can work with: vectors. This process involves an embedding model, a kind of alchemy that takes your image and distills it into a high-dimensional vector representing its essence.
Complex Data to Model to Vector Embedding
Your first step, will be to convert these images into vectors.
A vector embedding is a vector representation of your complex data (an image in this case), capturing features describing the image in a numerical format.
This transformation involves an embedding model that takes input data and converts it into a high-dimensional vector, capturing essential features and nuances.
Source: https://www.pinecone.io/learn/vector-database/
When querying a vector database you’ll want to determine the similarity between the image you input and the ones stored in your vector database.
Where in traditional databases you’d usually query something very specific like "Hey, do you have this?", here you're asking, "Can you find me data that's similar to this?"
Querying involves finding vectors in the database that are close to the query vector.
This involves mathematical concepts like cosine similarity or Euclidean distance—fancy ways of figuring out how close or far apart data points are in that high-dimensional space.
This illustration explains it quite well and to add context to it let’s take below example and follow the steps that lead us to a query result.
Let’s say you want to query a vector database of images:
1️⃣ Convert your search query (the image in this case) using the embedding model
2️⃣ The embedding model will return an embedding vector representing your input
3️⃣ This vector will be used as the parameter to your database query
4️⃣ The database engine will use mathematical techniques (cosine similarity…) to find vectors that are “close” to your input.
Most of these layers are there to be used, you don’t have to worry about creating a cosine similarity algorithm when performing a query or figuring out how to convert an image into an embedding vector.
As explained above, Embedding models are at the heart of converting complex data into vectors.
These models, trained on large datasets, learn to capture the essence of the data in a numerical form. Popular models include:
These models can be thought of as translators, converting various data types into a language (vectors) that vector databases can understand and manipulate.
Have a look at https://huggingface.co/, which is the go-to models directory, to get a sense of the size of this new universe.
Now that you understand the core principle, and want to develop your next-gen AI-powered application, you’ll be looking for a vector database provider!
While I haven’t tested them all and the list is not exhaustive, I’ve compiled a list of the major players knowing this market is moving fast and new players are popping up frequently.
Traditional databases with support of Vector Search
Dedicated vector database providers:
I hope this article helped you understand the core concept behind Vector Databases!
Free 7-day trial. No credit card required.
Have a question? Need help getting started?
Get in touch via chat or at [email protected]