Vector Databases: Core Principles

SimpleBackups founder

Laurent Lemaire

Co-founder, SimpleBackups

February 28, 2024

We’re all aboard the hype train with AI, GPT models, and the whirlwind of change they're bringing to our tech landscape. If you’re anything like me—a tech enthusiast with a knack for diving headfirst into new tools—you’ve probably found yourself knee-deep in projects that sounded simple at first but turned out to be gateways to entirely new realms of knowledge.

My latest adventure? It started with a seemingly little quest (I’ll ban “little” from my vocabulary moving forward 😶) to build an "Ask questions to your PDF" tool, which led me down the rabbit hole to discover the world of "Vector Databases”.

This article won’t be a technical deep dive, but rather an introduction to the main concepts around Vector database.

Table of Contents

What is a vector database?

Vector what?

Alright, let's get down to brass tacks. If you've dabbled in backend development, you've likely played around with Relational Databases (hello 👋, MySQL and PostgreSQL) or even flirted with Document-based databases (MongoDB, Redis, we’re looking at you).
But guess what? That was so 2023.
The cool kids are now jamming with the new quarterback of the database school: Vector Databases.

Back to the basics: in the realm of vector databases, we've got "vector" and "database" (mind-blowing revelation, right?). Let's decode these terms with the help of our trusty sidekick, ChatGPT, without having to open your 'latin-grec' dictionary.

Vector AKA My Data structure:

A vector is a sequence of numbers that represents data in a high-dimensional space. Each number in the sequence corresponds to a dimension and its value represents the magnitude or position along that dimension.

So, a vector is just a sequence of numbers representing data in a multi-dimensional matrix, pretty clear.

Note that in the illustration below we’re talking about Embedding vectors, which are just vectors generated by an embedding model (we’ll see this later).

Vector Embeddings

Source: https://milvus.io/blog/2021-10-10-milvus-helps-analyze-videos.md

Another way to phrase it would be to say that a vector is representation of a complex data (words, sentences, pdf, image, audio, video…) into a numerical form often referred to as an embedding.
These vectors are typically high-dimensional and are used to encode semantic information about the items they represent.

“Database” AKA where my data sleeps

Well, you know what a database is, you’re using one daily, and it will be the same in the context of a Vector Database.

A database is an organized collection of structured data, typically stored on a server/computer. Databases are designed to efficiently store, retrieve, and manipulate data, and they are used in a wide variety of applications, from simple record-keeping to complex business intelligence systems.

The Power Couple: Vector and Database

Imagine a world where vectors and databases come together in perfect harmony.
That’s what vector databases are all about.

They store and manage high-dimensional vector embeddings (we'll circle back to this term, promise) to perform efficient retrieval and similarity searches.

Unlike their relational or NoSQL cousins, vector databases are all about dealing with embeddings, giving them the upper hand in similarity querying, recommendation systems, and semantic searches.

Note that contrary to a Relational Database, the data you save does not follow a schema structure you’d have pre-defined. It stores vector embeddings, generated by embedding models out of unstructured data.

Use cases and applications for vector databases

Vector databases are not just about storing data; they're about unlocking possibilities. They shine in scenarios where traditional databases struggle, offering a unique ability to understand and query data like images, text, videos, and sounds.

Here are a few standout use cases:

  1. Semantic Search: Unlike keyword-based searches, semantic search understands the context and meaning behind queries, delivering more relevant results. For instance, in e-commerce, customers can find products through image search or by describing the item in natural language.
  2. Recommendation Systems: By analyzing user behavior and preferences through vector embeddings, these databases can power sophisticated recommendation engines, suggesting content, products, or services with uncanny relevance.
  3. Fraud Detection: In financial services, vector databases can analyze transaction patterns to identify anomalies that may indicate fraud, leveraging the subtle similarities between fraudulent transactions.
  4. Natural Language Processing (NLP): From chatbots to sentiment analysis, vector databases enable applications to process and understand human language, facilitating more natural and effective interactions.


The Leap from Traditional Databases

Traditional Databases: A Quick Recap

Before diving deeper into the contrasts, let's briefly recap traditional databases.
These databases, whether relational (SQL) or non-relational (NoSQL), excel at handling structured data. They organize data into predefined formats, like tables or documents, making it easy to perform precise, condition-based queries.

They're great for when your data fits nicely into rows and columns and when your queries are straightforward.

But here's the kicker: What if your data isn't just numbers and strings? What if it's more complex, like images, videos, or audio clips? That's where traditional databases start to sweat.

Differences between Databases and Vector Databases

  1. Data Representation: Traditional databases handle structured data well, but vector databases excel with unstructured, complex data by converting it into a numerical format that captures its essence.
  2. Search Capability: Vector databases can perform similarity searches, finding items that are "close" to a query in the high-dimensional space. Traditional databases can't natively understand or search based on the "similarity" of content.
  3. Flexibility: Traditional databases require a predefined schema, which can limit flexibility. Vector databases, dealing with numerical vectors, inherently support a more dynamic range of data types and structures.
  4. Scalability for Complex Queries: As data complexity grows, traditional databases might struggle with performance. Vector databases, with their specialized indexing and search algorithms, can efficiently scale to handle complex, high-dimensional searches.

"SimpleBackups Database Types"

Source: https://pynomial.com/2021/10/open-source-vector-databases-overview/

Let’s take one example that fits perfectly the RDS data model.

If you need to store a list of products an e-commerce website is selling in a very simplistic approach you’ll have a table “products” in which you store your with a name, description, photo, price and available stock.
You’ll also have a table with your customers and a table with your order and order lines.

All these columns will be interconnected using foreign keys and you’ll be able to query things like:

  • ✅ List the orders of customer Acme
  • ✅ Show the available stock for product B
  • ✅ Show the top 5 best-selling products of this year

But how would go about:

  • ❓Listing all blue products (without a “color” column existing in your products table)
  • ❓Listing the products that are looking similar to the last product you purchased
  • ❓Recommend a similar article based on product description

The answer is … meeeeeeeh.

This is where vector databases strut onto the stage. With the ability to handle complex, unstructured data like images, audio, and text, vector databases allow you to ask questions that were previously unthinkable.

You could do things that are impossible with traditional databases like querying:

  • ✅ List me the podcasts in which people talk about science
  • ✅ Show me pictures of black dogs
    -✅ Create a summary of this pdf file


How Vector Databases work?

Transforming Data into Vectors

Here’s where the sorcery happens. You’ve got complex data—let’s say, images. The first step is to transform these images into something the database can work with: vectors. This process involves an embedding model, a kind of alchemy that takes your image and distills it into a high-dimensional vector representing its essence.

Complex Data to Model to Vector Embedding

Your first step, will be to convert these images into vectors.

A vector embedding is a vector representation of your complex data (an image in this case), capturing features describing the image in a numerical format.

This transformation involves an embedding model that takes input data and converts it into a high-dimensional vector, capturing essential features and nuances.

Querying Vector database

Source: https://www.pinecone.io/learn/vector-database/

How do you query a Vector database?

When querying a vector database you’ll want to determine the similarity between the image you input and the ones stored in your vector database.

Where in traditional databases you’d usually query something very specific like "Hey, do you have this?", here you're asking, "Can you find me data that's similar to this?"

Querying involves finding vectors in the database that are close to the query vector.
This involves mathematical concepts like cosine similarity or Euclidean distance—fancy ways of figuring out how close or far apart data points are in that high-dimensional space.

This illustration explains it quite well and to add context to it let’s take below example and follow the steps that lead us to a query result.

Let’s say you want to query a vector database of images:

1️⃣ Convert your search query (the image in this case) using the embedding model
2️⃣ The embedding model will return an embedding vector representing your input
3️⃣ This vector will be used as the parameter to your database query
4️⃣ The database engine will use mathematical techniques (cosine similarity…) to find vectors that are “close” to your input.

Most of these layers are there to be used, you don’t have to worry about creating a cosine similarity algorithm when performing a query or figuring out how to convert an image into an embedding vector.

What is an Embedding Model and how does it work?

As explained above, Embedding models are at the heart of converting complex data into vectors.
These models, trained on large datasets, learn to capture the essence of the data in a numerical form. Popular models include:

  • Word2Vec and GloVe for text, focusing on capturing semantic meanings of words in vector form.
  • BERT and OpenAI's models offer more advanced text understanding, considering context and subtleties in language.
  • For images, models like CNNs (Convolutional Neural Networks) are used to extract features and encode them as vectors.

These models can be thought of as translators, converting various data types into a language (vectors) that vector databases can understand and manipulate.

Have a look at https://huggingface.co/, which is the go-to models directory, to get a sense of the size of this new universe.

Pick the right Vector Database Providers

Now that you understand the core principle, and want to develop your next-gen AI-powered application, you’ll be looking for a vector database provider!

While I haven’t tested them all and the list is not exhaustive, I’ve compiled a list of the major players knowing this market is moving fast and new players are popping up frequently.

Traditional databases with support of Vector Search

Dedicated vector database providers:

  • Milvus: An open-source vector database designed for scalability and performance, supporting both real-time and batch processing.
    Self-hosted - Open Source
  • Weaviate: Another open-source option, known for its semantic search capabilities and ease of integration.
    Managed - Self-hosted - Open Source
  • Pinecone: A managed service that focuses on simplicity and performance, with a straightforward pricing model.
    Managed - Closed Source
  • Chroma: Open Source
  • Qdrant: Self-hosted - Open Source

I hope this article helped you understand the core concept behind Vector Databases!



Back to blog

Stop worrying about your backups.
Focus on building amazing things!

Free 7-day trial. No credit card required.

Have a question? Need help getting started?
Get in touch via chat or at [email protected]

Customer support with experts
Security & compliance
Service that you'll love using