Page 1
Last updated
Over 80 percent of the data in the world is ‘unstructured’ such as social media posts, images videos and audio data.
Unstructured data cannot easily fit into ‘Relational Databases’ as Relational Databases designed for storing structured data. Let's take an Image as an example.
If you store an image in a relational database, your will have to manually assign keywords or tags to the image if you want to search for it or search similar images. Because from the pixel values alone we cannot really search for similar images and the same holds true for unstructured text ‘blobs ‘or audio and video data. (A 'blob' in a relational database stands for Binary Large OBject. It's a data type used to store large amounts of binary data directly in the database. Blobs are used to store data that doesn't fit well into standard data types, such as images, audio files, videos, or large documents.)
So we either have to assign tags or attributes to it or we can find a different representation to store the data and this brings us to vector embeddings and Vector databases.
In short, a Vector Database Indexes and stores Vector embeddings for fast retrieval and similarity search.
Let's look at those two important components; first it uses clever algorithms to calculate the Vector embeddings. This is done by Machine Learning models.
A vector embedding is just a list of numbers that represents the data in a different way. For example you can calculate an embedding for a single word, a whole sentence or an image. And now we have numerical data that the computer can understand.
Once we have Vectors, we can find similar vectors by calculating the distances and doing a nearest neighbor search. So we can easily find similar items.
For simplicity let’s display an example as a 2d case (but in reality of course vectors can have hundreds of Dimensions).
But just storing the data as vector embeddings is not enough. Performing a query across thousands of vectors based on its distance metric is extremely slow. And this is why vectors also need to be indexed.
The indexing process is the second key element of a vector database. An index is a data structure that facilitates the search process so the indexing step Maps the vectors to a new data structure that will enable faster searching.
(There is an entire research field on its own and different ways to calculate indexes exist).
We can use Vector databases to enable Large Language Models (LLMs) with Long-Term Memory. This is for example what you can easily implement with Langchain. We can use it for Semantic Search when we need to search not for exact string matches but rather based on the meaning or context of our question.
We can also use it for Similarity Search for images audio or video data. For example, ‘find a similar image to this one’, and we don't need to use any keywords or text to describe the image.
We can use a vector database as a ranking and Recommendation Engine. For example for online retailers it can be used to suggest items similar to past purchases of a customer since we can simply identify the nearest neighbors of the image in our database.
There are a number of vector databases available for example we have;
Pinecone
Weaviate
Chroma
Redis
Qdrant
Milvus
Vespa.Ai
Vector embeddings for an image are numerical representations of the image's content and features in a high-dimensional vector space. These embeddings are typically generated using machine learning models, often deep neural networks, that have been trained on large datasets of images.
Here's a brief overview of image Vector Embeddings:
1. Purpose of Vector Embeddings: They allow computers to "understand" and process images mathematically, enabling various tasks like image classification, similarity search, and content-based image retrieval.
2. Process: An image is passed through a pre-trained neural network, which extracts relevant features and condenses them into a fixed-length vector.
3. Dimensionality: The resulting vector usually has hundreds or thousands of dimensions, each representing some aspect of the image.
4. Properties:
o Similar images tend to have similar vector representations (close together in the vector space).
o The embeddings capture semantic information, not just pixel-level details.
5. Applications:
o Image search and retrieval
o Face recognition
o Visual similarity comparison
o Image clustering
o Transfer learning for other computer vision tasks
6. Common models:
o Convolutional Neural Networks (CNNs) like ResNet, VGG, or Inception are often used as base models.
o More specialized models like CLIP (Contrastive Language-Image Pre-training) can create embeddings that align with both visual and textual information.
Vector embeddings are powerful because they transform complex visual information into a format that's easy for machines to process and compare. This enables a wide range of applications in computer vision and beyond.
When we talk about "features" in the context of image vector embeddings, we're referring to something more abstract and high-level than individual pixels. Let’s break this down further:
1. Beyond Pixels:
While the input to the neural network is indeed composed of pixels, the "features" we're talking about are not raw pixel values. Instead, they are more complex patterns and characteristics learned by the network.
2. Hierarchical feature extraction:
o Lower levels: The initial layers of a neural network might detect simple features like edges, corners, or color gradients.
o Middle levels: As we go deeper, the network combines these simple features to recognize more complex patterns like textures or simple shapes.
o Higher levels: The deepest layers can identify very abstract features that might correspond to entire objects or scenes.
3. Learned features: These features are not hand-designed but learned by the network during training. The network figures out which features are most useful for its task (e.g., classification).
4. Dimensionality reduction: The process of creating an embedding often involves reducing the dimensionality of the data. For example, an image might start as millions of pixel values (e.g., 1000x1000x3 for an RGB image), but end up as a vector of just 512 or 2048 numbers.
5. Information compression: The "condensing" process is about keeping the most important information while discarding less relevant details. It's similar to how a human might describe an image - we don't list every pixel, but instead mention key elements.
6. Non-linear transformations: The conversion from pixel data to high-level features involves many non-linear transformations, allowing the network to capture complex relationships in the data.
7. Spatial invariance: Unlike raw pixels, these features often have some level of spatial invariance, meaning they can recognize patterns regardless of exact location in the image.
So when we say “the network extracts relevant features and condenses them into a fixed-length vector," we mean it's taking the raw pixel data, processing it through multiple layers to recognize increasingly complex and abstract patterns, and then summarizing all of this information into a compact, fixed-size representation that captures the essence of the image's content.
This process allows the embedding to represent semantic content rather than just visual appearance, which is why these embeddings are so powerful for tasks like image recognition and similarity search.