Home / linux / chromaDB

Chroma DB Filtering

Filtering in Chroma DB fundamentally differs from traditional SQL-based filtering due to its emphasis on vector similarity and flexible metadata querying. While SQL databases rely on structured schemas and declarative logic to retrieve exact matches, Chroma DB is designed for unstructured data and semantic search, making it well-suited for AI-driven applications.

Chroma DB supports two primary types of filtering:

Filter Type	Description	Comparison to SQL
Metadata Filtering	Filters based on document metadata, such as `"topic": "history"` or `"date": "2023-01-15"`.	Similar to SQL `WHERE` clauses, but more flexible and can be combined with vector search.
Document Filtering	Filters based on document content using keyword presence (e.g., `$contains`, `$not_contains`).	Comparable to SQL's `CONTAINS` or `LIKE` operators, but more powerful when integrated with vector search.

This dual-filtering approach allows Chroma DB to support complex, context-aware queries that go beyond the capabilities of traditional relational databases.

💡 Document filtering in Chroma DB is also referred to as full text search.

🏷️ Metadata Filtering in Chroma DB

Metadata filtering in Chroma DB can be performed by using the where parameter inside the .query(), .get(), or .delete() methods.

For basic metadata matches, where you want to find only equivalent matches, use the following syntax:

where={"key": "value"}

For instance, the following code will get only those documents within collection where the metadata key is exactly equal to value :

collection.get((where = { key: "value" }));

Similar syntax can be used within .query() or .delete() methods.

For defining even more complex filters, Chroma DB supports the following metadata filtering operators:

$eq - equal to (string, int, float) $ne - not equal to (string, int, float) $gt - greater than (int, float) $gte - greater than or equal to (int, float) $lt - less than (int, float) $lte - less than or equal to (int, float)

These operators can be applied using the following syntax, using $eq as an example:

where={"key": {"$eq": "value"}}

Note that the above is identical to the following:

where={"key": "value"}

In other words, not providing an operator is equivalent to using the $eq operator.

Combining different filters can be achieved using the $and and $or logical operators as follows:

collection.get(
  (where = {
    $and: [{ key: { $eq: "value1" } }, { key: { $ne: "value2" } }],
  }),
);

The above would get only the documents where key is equal to value1 and key is not equal to value2 from collection. Similar syntax can be used for $or as well as the .query() and .delete() methods.

Finally, to use lists in filters, use the $in and $nin operators, where we note that $nin stands for "not in":

where={"key": {"$nin":["value1", "value2"]}}

The above code will find all documents where key is not equal to either value1 or value2.

📝 Document Filtering in Chroma DB

Document filtering in Chroma DB can be performed by supplying $contains or $not_contains to the where_document parameter inside the .query(), .get(), or .delete() methods using the following syntax:

where_document={"$contains":"value"}

The above would find all documents that contain value in the text of the document.

Moreover, note that you can combine multiple document filters using the $and and $or document operators in an analogous way to the metadata filters.

A full example of metadata and document filtering in Chroma DB The following presents a full example of metadata and document filtering in Chroma DB.

🔧 Setup The following commands import chromadb and its embedding utilities, then create an embedding function object used to generate vector embeddings:

import chromadb
from chromadb.utils import embedding_functions
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

📂 Creating Collections Collections help organize your data in Chroma DB.

🏗️ Create a Collection

client = chromadb.Client()
collection = client.create_collection(
    name="filter_demo",
    metadata={"description": "Used to demo filtering in ChromaDB"},
    configuration={
        "embedding_function": ef
    }
)
print(f"Collection created: {collection.name}")
# Collection created: filter_demo

➕ Adding Documents to Collections Use add to insert documents with optional metadata.

collection.add(
    documents=[
        "This is a document about LangChain",
        "This is a reading about LlamaIndex",
        "This is a book about Python",
        "This is a document about pandas",
        "This is another document about LangChain"
    ],
    metadatas=[
        {"source": "langchain.com", "version": 0.1},
        {"source": "llamaindex.ai", "version": 0.2},
        {"source": "python.org", "version": 0.3},
        {"source": "pandas.pydata.org", "version": 0.4},
        {"source": "langchain.com", "version": 0.5},
    ],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

🏷️ Filter using Metadata The following finds all documents where the source is "langchain.com":

collection.get(
    where={"source": {"$eq": "langchain.com"}}
)

output:

{'ids': ['id1', 'id5'],
 'embeddings': None,
 'documents': ['This is a document about LangChain',
  'This is another document about LangChain'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'source': 'langchain.com', 'version': 0.1},
  {'version': 0.5, 'source': 'langchain.com'}]}

The above produced correct output, but suppose we were only interested in LangChain documents with versions less than 0.3. The following finds all documents where the source is "langchain.com" with versions less than 0.3:

collection.get(
    where={
        "$and": [
            {"source": {"$eq": "langchain.com"}},
            {"version": {"$lt": 0.3}}
        ]
    }
)

Output:

{'ids': ['id1'],
 'embeddings': None,
 'documents': ['This is a document about LangChain'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'source': 'langchain.com', 'version': 0.1}]}

Now, let's make an even more complicated filtering rule, one that combines logical operators with lists in filters. The following retrieves all documents about LangChain and LlamaIndex with a version less than 0.3:

collection.get(
    where={
        "$and": [
            {"source": {"$in": ["langchain.com", "llamaindex.ai"]}},
            {"version": {"$lt": 0.3}}
        ]
    }
)

Output:

{'ids': ['id1', 'id2'],
 'embeddings': None,
 'documents': ['This is a document about LangChain',
  'This is a reading about LlamaIndex'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'source': 'langchain.com', 'version': 0.1},
  {'source': 'llamaindex.ai', 'version': 0.2}]}

📝 Filter using Document Content Suppose we wanted to find documents that include the word "pandas" in the text. The following performs a full text search for such documents:

collection.get(
    where_document={"$contains":"pandas"}
)

Output:

{'ids': ['id4'],
 'embeddings': None,
 'documents': ['This is a document about pandas'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'source': 'pandas.pydata.org', 'version': 0.4}]}

💡 Document filtering is case-sensitive in Chroma DB. Therefore, searching for "Pandas" will not find any documents.

🏷️ + 📝 Combine Metadata and Document Content Filters Of course, we can combine metadata and document filters. The following looks for all documents containing "LangChain" or "Python" with version numbers greater than 0.1:

collection.get(
    where={"version": {"$gt": 0.1}},
    where_document={
        "$or": [
            {"$contains": "LangChain"},
            {"$contains": "Python"}
        ]
    }
)

Output:

{'ids': ['id3', 'id5'],
 'embeddings': None,
 'documents': ['This is a book about Python',
  'This is another document about LangChain'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'version': 0.3, 'source': 'python.org'},
  {'source': 'langchain.com', 'version': 0.5}]}

Setting up `HNSW` in ChromaDB

# Setup
import chromadb
from chromadb.utils import embedding_functions
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
# Collection creation
client = chromadb.Client()
collection = client.create_collection(
    name="my_collection_name",
    metadata={"topic": "query testing"},
    configuration={
        "hnsw": {
            "space": "cosine",
            "ef_search": 100,
            "ef_construction": 100,
            "max_neighbors": 16
        },
        "embedding_function": ef
    }
)

The key configuration parameters are:

space: selects the distance metric. Possible options include:
- l2: squared L2 (Euclidean) distance (default)
- ip: inner (dot) product distance
- cosine: cosine distance
ef_search: the size of the candidate list used to search for nearest neighbors when a nearest neighbor search is performed. The default value is 100. Higher values improve both accuracy and recall, but at the cost of slower performance and increased computatuonal cost.
ef_construction: the size of the candidate list used to select neighbors when a node is inserted during index construction. The default value is 100. Higher values improve the quality of the index and accuracy, but at the cost of slower performance and increased memory usage.
max_neighbors: the maximum number of connections each node can have during construction. The defualt value is 16. Higher values lead to denser graphs that perform better during searches at the cost of higher memory usage and construction time.

We can categorize the performance-based parameters into two types:

ef_search directly controls the breadth of the search at query time, making it the most direct lever for search quality (recall) vs. query speed.
ef_construction and max_neighbors affect the quality of the built index. A higher-quality, denser index (achieved with higher ef_construction and max_neighbors) provides a better foundation for searches, potentially leading to better accuracy. However, this quality comes at the cost of significantly longer index build times and higher memory consumption during construction and for storing the index.

# add data
collection.add(
    documents=[
        "Giant pandas are a bear species that lives in mountainous areas.",
        "A pandas DataFrame stores two-dimensional, tabular data",
        "I think everyone agrees that pandas are some of the cutest animals on the planet",
        "A direct comparison between pandas and polars indicates that polars is a more efficient library than pandas.",
    ],
    metadatas=[
        {"topic": "animals"},
        {"topic": "data analysis"},
        {"topic": "animals"},
        {"topic": "data analysis"},
    ],
    ids=["id1", "id2", "id3", "id4"]
)
# Querying in Chroma DB
collection.query(
    query_texts=["cats"],
    n_results=10,
)
# output
{'ids': [['id3', 'id1', 'id2', 'id4']],
 'embeddings': None,
 'documents': [['I think everyone agrees that pandas are some of the cutest animals on the planet',
   'Giant pandas are a bear species that lives in mountainous areas.',
   'A pandas DataFrame stores two-dimensional, tabular data',
   'A direct comparison between pandas and polars indicates that polars is a more efficient library than pandas.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'topic': 'animals'},
   {'topic': 'animals'},
   {'topic': 'data analysis'},
   {'topic': 'data analysis'}]],
 'distances': [[0.7380143404006958,
   0.8351750373840332,
   0.8634340167045593,
   0.9299634695053101]]}

# query with filter
collection.query(
    query_texts=["polar bear"],
    n_results=1,
    where={'topic': 'animals'}
)
# output
'ids': [['id1']],
 'embeddings': None,
 'documents': [['Giant pandas are a bear species that lives in mountainous areas.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'topic': 'animals'}]],
 'distances': [[0.7096824645996094]]}

# query with filter
collection.query(
    query_texts=["polar bear"],
    n_results=1,
    where_document={'$not_contains': 'library'}
)
# output
'ids': [['id1']],
 'embeddings': None,
 'documents': [['Giant pandas are a bear species that lives in mountainous areas.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'topic': 'animals'}]],
 'distances': [[0.7096824645996094]]}

# query with filter
collection.query(
    query_texts=["polar bear"],
    n_results=1,
    where={'topic': 'animals'},
    where_document={'$not_contains': 'library'}
)
# output
'ids': [['id1']],
 'embeddings': None,
 'documents': [['Giant pandas are a bear species that lives in mountainous areas.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'topic': 'animals'}]],
 'distances': [[0.7096824645996094]]}

Project1: Similarity search

# setup environment
pip install chromadb==1.0.12
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install sentence-transformers==4.1.0

# Importing the necessary modules from the chromadb package:
# chromadb is used to interact with the Chroma DB database,
# embedding_functions is used to define the embedding model
import chromadb
from chromadb.utils import embedding_functions

# Define the embedding function using SentenceTransformers
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create a new instance of ChromaClient to interact with the Chroma DB
client = chromadb.Client()

# Define the name for the collection to be created or retrieved
collection_name = "my_grocery_collection"

# Define the main function to interact with the Chroma DB
def main():
    try:
        # Create a collection in the Chroma database with a specified name,
        # distance metric, and embedding function. In this case, we are using
        # cosine distance
        collection = client.create_collection(
            name=collection_name,
            metadata={"description": "A collection for storing grocery data"},
            configuration={
                "hnsw": {"space": "cosine"},
                "embedding_function": ef
            }
        )
        print(f"Collection created: {collection.name}")

        # Array of grocery-related text items with professional humor
        texts = [
            'fresh red apples',
            'organic bananas',
            'ripe mangoes',
            'whole wheat bread',
            'farm-fresh eggs',
            'natural yogurt',
            'frozen vegetables',
            'grass-fed beef',
            'free-range chicken',
            'fresh salmon fillet',
            'aromatic coffee beans',
            'pure honey',
            'golden apple',
            'red fruit'
        ]

        # Create a list of unique IDs for each text item in the 'texts' array
        # Each ID follows the format 'food_<index>', where <index> starts from 1
        ids = [f"food_{index + 1}" for index, _ in enumerate(texts)]

        # Add documents and their corresponding IDs to the collection
        # The `add` method inserts the data into the collection
        # The documents are the actual text items, and the IDs are unique identifiers
                # ChromaDB will automatically generate embeddings using the configured embedding function
        collection.add(
            documents=texts,
            metadatas=[{"source": "grocery_store", "category": "food"} for _ in texts],
            ids=ids
        )

        # Retrieve all the items (documents) stored in the collection
        # The `get` method fetches all data from the collection
        all_items = collection.get()
        # Log the retrieved items to the console for inspection
        # This will print out all the documents, IDs, and metadata stored in the collection
        print("Collection contents:")
        print(f"Number of documents: {len(all_items['documents'])}")


        # Function to perform a similarity search in the collection
        def perform_similarity_search(collection, all_items):
            try:
                # Define the query term you want to search for in the collection
                query_term = "apple"

                # Perform a query to search for the most similar documents to the 'query_term'
                results = collection.query(
                    query_texts=[query_term],
                    n_results=3  # Retrieve top 3 results
                )
                print(f"Query results for '{query_term}':")
                print(results)

                # Check if no results are returned or if the results array is empty
                if not results or not results['ids'] or len(results['ids'][0]) == 0:
                    # Log a message indicating that no similar documents were found for the query term
                    print(f'No documents found similar to "{query_term}"')
                    return

                print(f'Top 3 similar documents to "{query_term}":')
                # Access the nested arrays in 'results["ids"]' and 'results["distances"]'
                for i in range(min(3, len(results['ids'][0]))):
                    doc_id = results['ids'][0][i]  # Get ID from 'ids' array
                    score = results['distances'][0][i]  # Get score from 'distances' array
                    # Retrieve text data from the results
                    text = results['documents'][0][i]
                    if not text:
                        print(f' - ID: {doc_id}, Text: "Text not available", Score: {score:.4f}')
                    else:
                        print(f' - ID: {doc_id}, Text: "{text}", Score: {score:.4f}')
            except Exception as error:
                print(f"Error in similarity search: {error}")

        perform_similarity_search(collection, all_items)
    except Exception as error:  # Catch any errors and log them to the console
        print(f"Error: {error}")

if __name__ == "__main__":
    main()

# run app
python3.11 similarity_search.py
modules.json: 100%|████████████████████████████████| 349/349 [00:00<00:00, 2.49MB/s]
config_sentence_transformers.json: 100%|███████████| 116/116 [00:00<00:00, 1.41MB/s]
README.md: 10.5kB [00:00, 37.9MB/s]
sentence_bert_config.json: 100%|██████████████████| 53.0/53.0 [00:00<00:00, 611kB/s]
config.json: 100%|█████████████████████████████████| 612/612 [00:00<00:00, 7.73MB/s]
model.safetensors: 100%|████████████████████████| 90.9M/90.9M [00:00<00:00, 112MB/s]
tokenizer_config.json: 100%|████████████████████████| 350/350 [00:00<00:00, 133kB/s]
vocab.txt: 232kB [00:00, 83.6MB/s]
tokenizer.json: 466kB [00:00, 102MB/s]
special_tokens_map.json: 100%|██████████████████████| 112/112 [00:00<00:00, 912kB/s]
config.json: 100%|█████████████████████████████████| 190/190 [00:00<00:00, 2.41MB/s]
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Collection created: my_grocery_collection
Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given
Collection contents:
Number of documents: 14
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
Query results for 'apple':
{'ids': [['food_13', 'food_1', 'food_14']], 'embeddings': None, 'documents': [['golden apple', 'fresh red apples', 'red fruit']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'grocery_store', 'category': 'food'}, {'source': 'grocery_store', 'category': 'food'}, {'category': 'food', 'source': 'grocery_store'}]], 'distances': [[0.3824650049209595, 0.480892539024353, 0.5965152978897095]]}
Top 3 similar documents to "apple":
 - ID: food_13, Text: "golden apple", Score: 0.3825
 - ID: food_1, Text: "fresh red apples", Score: 0.4809
 - ID: food_14, Text: "red fruit", Score: 0.5965

Project2: Similarity employee data search

# Importing necessary modules from the chromadb package:
# chromadb is used to interact with the Chroma DB database,
# embedding_functions is used to define the embedding model
import chromadb
from chromadb.utils import embedding_functions

# Define the embedding function using SentenceTransformers
# This function will be used to generate embeddings (vector representations) for the data
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Creating an instance of ChromaClient to establish a connection with the Chroma database
client = chromadb.Client()

# Defining a name for the collection where data will be stored or accessed
# This collection is likely used to group related records, such as employee data
collection_name = "employee_collection"

# Defining a function named 'main'
# This function is used to encapsulate the main operations for creating collections,
# generating embeddings, and performing similarity search
def main():
    try:
        # Creating a collection using the ChromaClient instance
        # The 'create_collection' method creates a new collection with the specified configuration
        collection = client.create_collection(
            # Specifying the name of the collection to be created
            name=collection_name,
            # Adding metadata to describe the collection
            metadata={"description": "A collection for storing employee data"},
            # Configuring the collection with cosine distance and embedding function
            configuration={
                "hnsw": {"space": "cosine"},
                "embedding_function": ef
            }
        )
        print(f"Collection created: {collection.name}")

        # Defining a list of employee dictionaries
        # Each dictionary represents an individual employee with comprehensive information
        employees = [
            {
                "id": "employee_1",
                "name": "John Doe",
                "experience": 5,
                "department": "Engineering",
                "role": "Software Engineer",
                "skills": "Python, JavaScript, React, Node.js, databases",
                "location": "New York",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_2",
                "name": "Jane Smith",
                "experience": 8,
                "department": "Marketing",
                "role": "Marketing Manager",
                "skills": "Digital marketing, SEO, content strategy, analytics, social media",
                "location": "Los Angeles",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_3",
                "name": "Alice Johnson",
                "experience": 3,
                "department": "HR",
                "role": "HR Coordinator",
                "skills": "Recruitment, employee relations, HR policies, training programs",
                "location": "Chicago",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_4",
                "name": "Michael Brown",
                "experience": 12,
                "department": "Engineering",
                "role": "Senior Software Engineer",
                "skills": "Java, Spring Boot, microservices, cloud architecture, DevOps",
                "location": "San Francisco",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_5",
                "name": "Emily Wilson",
                "experience": 2,
                "department": "Marketing",
                "role": "Marketing Assistant",
                "skills": "Content creation, email marketing, market research, social media management",
                "location": "Austin",
                "employment_type": "Part-time"
            },
            {
                "id": "employee_6",
                "name": "David Lee",
                "experience": 15,
                "department": "Engineering",
                "role": "Engineering Manager",
                "skills": "Team leadership, project management, software architecture, mentoring",
                "location": "Seattle",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_7",
                "name": "Sarah Clark",
                "experience": 8,
                "department": "HR",
                "role": "HR Manager",
                "skills": "Performance management, compensation planning, policy development, conflict resolution",
                "location": "Boston",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_8",
                "name": "Chris Evans",
                "experience": 20,
                "department": "Engineering",
                "role": "Senior Architect",
                "skills": "System design, distributed systems, cloud platforms, technical strategy",
                "location": "New York",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_9",
                "name": "Jessica Taylor",
                "experience": 4,
                "department": "Marketing",
                "role": "Marketing Specialist",
                "skills": "Brand management, advertising campaigns, customer analytics, creative strategy",
                "location": "Miami",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_10",
                "name": "Alex Rodriguez",
                "experience": 18,
                "department": "Engineering",
                "role": "Lead Software Engineer",
                "skills": "Full-stack development, React, Python, machine learning, data science",
                "location": "Denver",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_11",
                "name": "Hannah White",
                "experience": 6,
                "department": "HR",
                "role": "HR Business Partner",
                "skills": "Strategic HR, organizational development, change management, employee engagement",
                "location": "Portland",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_12",
                "name": "Kevin Martinez",
                "experience": 10,
                "department": "Engineering",
                "role": "DevOps Engineer",
                "skills": "Docker, Kubernetes, AWS, CI/CD pipelines, infrastructure automation",
                "location": "Phoenix",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_13",
                "name": "Rachel Brown",
                "experience": 7,
                "department": "Marketing",
                "role": "Marketing Director",
                "skills": "Strategic marketing, team leadership, budget management, campaign optimization",
                "location": "Atlanta",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_14",
                "name": "Matthew Garcia",
                "experience": 3,
                "department": "Engineering",
                "role": "Junior Software Engineer",
                "skills": "JavaScript, HTML/CSS, basic backend development, learning frameworks",
                "location": "Dallas",
                "employment_type": "Full-time"
            },
            {
                "id": "employee_15",
                "name": "Olivia Moore",
                "experience": 12,
                "department": "Engineering",
                "role": "Principal Engineer",
                "skills": "Technical leadership, system architecture, performance optimization, mentoring",
                "location": "San Francisco",
                "employment_type": "Full-time"
            },
        ]

        # Create comprehensive text documents for each employee
        # These documents will be used for similarity search based on skills, roles, and experience
        employee_documents = []
        for employee in employees:
            document = f"{employee['role']} with {employee['experience']} years of experience in {employee['department']}. "
            document += f"Skills: {employee['skills']}. Located in {employee['location']}. "
            document += f"Employment type: {employee['employment_type']}."
            employee_documents.append(document)

        # Adding data to the collection in the Chroma database
        # The 'add' method inserts or updates data into the specified collection
        collection.add(
            # Extracting employee IDs to be used as unique identifiers for each record
            ids=[employee["id"] for employee in employees],
            # Using the comprehensive text documents we created
            documents=employee_documents,
            # Adding comprehensive metadata for filtering and search
            metadatas=[{
                "name": employee["name"],
                "department": employee["department"],
                "role": employee["role"],
                "experience": employee["experience"],
                "location": employee["location"],
                "employment_type": employee["employment_type"]
            } for employee in employees]
        )

        # Retrieving all items from the specified collection
        # The 'get' method fetches all records stored in the collection
        all_items = collection.get()
        # Logging the retrieved items to the console for inspection or debugging
        print("Collection contents:")
        print(f"Number of documents: {len(all_items['documents'])}")

        # Function to perform various types of searches within the collection
        def perform_advanced_search(collection, all_items):
            try:
                print("=== Similarity Search Examples ===")

                # Example 1: Search for Python developers
                print("\n1. Searching for Python developers:")
                query_text = "Python developer with web development experience"
                results = collection.query(
                    query_texts=[query_text],
                    n_results=3
                )
                print(f"Query: '{query_text}'")
                for i, (doc_id, document, distance) in enumerate(zip(
                    results['ids'][0], results['documents'][0], results['distances'][0]
                )):
                    metadata = results['metadatas'][0][i]
                    print(f"  {i+1}. {metadata['name']} ({doc_id}) - Distance: {distance:.4f}")
                    print(f"     Role: {metadata['role']}, Department: {metadata['department']}")
                    print(f"     Document: {document[:100]}...")

                # Example 2: Search for leadership roles
                print("\n2. Searching for leadership and management roles:")
                query_text = "team leader manager with experience"
                results = collection.query(
                    query_texts=[query_text],
                    n_results=3
                )
                print(f"Query: '{query_text}'")
                for i, (doc_id, document, distance) in enumerate(zip(
                    results['ids'][0], results['documents'][0], results['distances'][0]
                )):
                    metadata = results['metadatas'][0][i]
                    print(f"  {i+1}. {metadata['name']} ({doc_id}) - Distance: {distance:.4f}")
                    print(f"     Role: {metadata['role']}, Experience: {metadata['experience']} years")

                print("\n=== Metadata Filtering Examples ===")

                # Example 1: Filter by department
                print("\n3. Finding all Engineering employees:")
                results = collection.get(
                    where={"department": "Engineering"}
                )
                print(f"Found {len(results['ids'])} Engineering employees:")
                for i, doc_id in enumerate(results['ids']):
                    metadata = results['metadatas'][i]
                    print(f"  - {metadata['name']}: {metadata['role']} ({metadata['experience']} years)")

                # Example 2: Filter by experience range
                print("\n4. Finding employees with 10+ years experience:")
                results = collection.get(
                    where={"experience": {"$gte": 10}}
                )
                print(f"Found {len(results['ids'])} senior employees:")
                for i, doc_id in enumerate(results['ids']):
                    metadata = results['metadatas'][i]
                    print(f"  - {metadata['name']}: {metadata['role']} ({metadata['experience']} years)")

                # Example 3: Filter by location
                print("\n5. Finding employees in California:")
                results = collection.get(
                    where={"location": {"$in": ["San Francisco", "Los Angeles"]}}
                )
                print(f"Found {len(results['ids'])} employees in California:")
                for i, doc_id in enumerate(results['ids']):
                    metadata = results['metadatas'][i]
                    print(f"  - {metadata['name']}: {metadata['location']}")

                print("\n=== Combined Search: Similarity + Metadata Filtering ===")

                # Example: Find experienced Python developers in specific locations
                print("\n6. Finding senior Python developers in major tech cities:")
                query_text = "senior Python developer full-stack"
                results = collection.query(
                    query_texts=[query_text],
                    n_results=5,
                    where={
                        "$and": [
                            {"experience": {"$gte": 8}},
                            {"location": {"$in": ["San Francisco", "New York", "Seattle"]}}
                        ]
                    }
                )
                print(f"Query: '{query_text}' with filters (8+ years, major tech cities)")
                print(f"Found {len(results['ids'][0])} matching employees:")
                for i, (doc_id, document, distance) in enumerate(zip(
                    results['ids'][0], results['documents'][0], results['distances'][0]
                )):
                    metadata = results['metadatas'][0][i]
                    print(f"  {i+1}. {metadata['name']} ({doc_id}) - Distance: {distance:.4f}")
                    print(f"     {metadata['role']} in {metadata['location']} ({metadata['experience']} years)")
                    print(f"     Document snippet: {document[:80]}...")

                # Check if the results are empty or undefined
                if not results or not results['ids'] or len(results['ids'][0]) == 0:
                    # Log a message if no similar documents are found for the query term
                    print(f'No documents found similar to "{query_text}"')
                    return

                # Log the header for the top 3 similar documents based on the query term
                print(f'Top 3 similar documents to "{query_text}":')
                # Loop through the top 3 results and log the document details
                for i in range(min(3, len(results['ids'][0]))):
                    # Extract the document ID and similarity score from the results
                    doc_id = results['ids'][0][i]
                    score = results['distances'][0][i]
                    # Retrieve the document text corresponding to the current ID from the results
                    text = results['documents'][0][i]
                    # Check if the text is available; if not, log 'Text not available'
                    if not text:
                        print(f' - ID: {doc_id}, Text: "Text not available", Score: {score:.4f}')
                    else:
                        print(f' - ID: {doc_id}, Text: "{text}", Score: {score:.4f}')
            except Exception as error:
                print(f"Error in advanced search: {error}")

        # Call the perform_advanced_search function with the collection and all_items as arguments
        perform_advanced_search(collection, all_items)

    except Exception as error:
        # Catching and handling any errors that occur within the 'try' block
        # Logs the error message to the console for debugging purposes
        print(f"Error: {error}")

if __name__ == "__main__":
    main()

python3.11 similarity_employeedata.py
# output
modules.json: 100%|█████████████████████████| 349/349 [00:00<00:00, 2.33MB/s]
config_sentence_transformers.json: 100%|████| 116/116 [00:00<00:00, 1.14MB/s]
README.md: 10.5kB [00:00, 27.1MB/s]
sentence_bert_config.json: 100%|███████████| 53.0/53.0 [00:00<00:00, 379kB/s]
config.json: 100%|██████████████████████████| 612/612 [00:00<00:00, 3.76MB/s]
model.safetensors: 100%|█████████████████| 90.9M/90.9M [00:00<00:00, 112MB/s]
tokenizer_config.json: 100%|████████████████| 350/350 [00:00<00:00, 3.88MB/s]
vocab.txt: 232kB [00:00, 85.1MB/s]
tokenizer.json: 466kB [00:00, 110MB/s]
special_tokens_map.json: 100%|██████████████| 112/112 [00:00<00:00, 1.08MB/s]
config.json: 100%|██████████████████████████| 190/190 [00:00<00:00, 1.98MB/s]
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Collection created: employee_collection
Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given
Collection contents:
Number of documents: 15
=== Similarity Search Examples ===

1. Searching for Python developers:
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
Query: 'Python developer with web development experience'
  1. John Doe (employee_1) - Distance: 0.5156
     Role: Software Engineer, Department: Engineering
     Document: Software Engineer with 5 years of experience in Engineering. Skills: Python, JavaScript, React, Node...
  2. Matthew Garcia (employee_14) - Distance: 0.5724
     Role: Junior Software Engineer, Department: Engineering
     Document: Junior Software Engineer with 3 years of experience in Engineering. Skills: JavaScript, HTML/CSS, ba...
  3. Alex Rodriguez (employee_10) - Distance: 0.5967
     Role: Lead Software Engineer, Department: Engineering
     Document: Lead Software Engineer with 18 years of experience in Engineering. Skills: Full-stack development, R...

2. Searching for leadership and management roles:
Query: 'team leader manager with experience'
  1. Jane Smith (employee_2) - Distance: 0.5382
     Role: Marketing Manager, Experience: 8 years
  2. Sarah Clark (employee_7) - Distance: 0.5467
     Role: HR Manager, Experience: 8 years
  3. David Lee (employee_6) - Distance: 0.5497
     Role: Engineering Manager, Experience: 15 years

=== Metadata Filtering Examples ===

3. Finding all Engineering employees:
Found 8 Engineering employees:
  - John Doe: Software Engineer (5 years)
  - Michael Brown: Senior Software Engineer (12 years)
  - David Lee: Engineering Manager (15 years)
  - Chris Evans: Senior Architect (20 years)
  - Alex Rodriguez: Lead Software Engineer (18 years)
  - Kevin Martinez: DevOps Engineer (10 years)
  - Matthew Garcia: Junior Software Engineer (3 years)
  - Olivia Moore: Principal Engineer (12 years)

4. Finding employees with 10+ years experience:
Found 6 senior employees:
  - Michael Brown: Senior Software Engineer (12 years)
  - David Lee: Engineering Manager (15 years)
  - Chris Evans: Senior Architect (20 years)
  - Alex Rodriguez: Lead Software Engineer (18 years)
  - Kevin Martinez: DevOps Engineer (10 years)
  - Olivia Moore: Principal Engineer (12 years)

5. Finding employees in California:
Found 3 employees in California:
  - Jane Smith: Los Angeles
  - Michael Brown: San Francisco
  - Olivia Moore: San Francisco

=== Combined Search: Similarity + Metadata Filtering ===

6. Finding senior Python developers in major tech cities:
Query: 'senior Python developer full-stack' with filters (8+ years, major tech cities)
Found 4 matching employees:
  1. Michael Brown (employee_4) - Distance: 0.6726
     Senior Software Engineer in San Francisco (12 years)
     Document snippet: Senior Software Engineer with 12 years of experience in Engineering. Skills: Jav...
  2. Chris Evans (employee_8) - Distance: 0.7537
     Senior Architect in New York (20 years)
     Document snippet: Senior Architect with 20 years of experience in Engineering. Skills: System desi...
  3. David Lee (employee_6) - Distance: 0.8344
     Engineering Manager in Seattle (15 years)
     Document snippet: Engineering Manager with 15 years of experience in Engineering. Skills: Team lea...
  4. Olivia Moore (employee_15) - Distance: 0.8761
     Principal Engineer in San Francisco (12 years)
     Document snippet: Principal Engineer with 12 years of experience in Engineering. Skills: Technical...
Top 3 similar documents to "senior Python developer full-stack":
 - ID: employee_4, Text: "Senior Software Engineer with 12 years of experience in Engineering. Skills: Java, Spring Boot, microservices, cloud architecture, DevOps. Located in San Francisco. Employment type: Full-time.", Score: 0.6726
 - ID: employee_8, Text: "Senior Architect with 20 years of experience in Engineering. Skills: System design, distributed systems, cloud platforms, technical strategy. Located in New York. Employment type: Full-time.", Score: 0.7537
 - ID: employee_6, Text: "Engineering Manager with 15 years of experience in Engineering. Skills: Team leadership, project management, software architecture, mentoring. Located in Seattle. Employment type: Full-time.", Score: 0.8344

Project3: Books advanced search

# Importing the necessary modules from the chromadb package:
# chromadb is used to interact with the Chroma DB database,
# embedding_functions is used to define the embedding model
import chromadb
from chromadb.utils import embedding_functions

# Define the embedding function using SentenceTransformers
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create a new instance of ChromaClient to interact with the Chroma DB
client = chromadb.Client()

# Define the name for the collection to be created or retrieved
collection_name = "book_collection"

# Define the main function to interact with the Chroma DB
def main():
    try:
        # Create a collection in the Chroma database with a specified name, 
        # distance metric, and embedding function. In this case, we are using 
        # cosine distance
        collection = client.create_collection(
            name=collection_name,
            metadata={"description": "A collection for storing book data"},
            configuration={
                "hnsw": {"space": "cosine"},
                "embedding_function": ef
            }
        )
        print(f"Collection created: {collection.name}")

        # List of book dictionaries with comprehensive details for advanced search
        books = [
            {
                "id": "book_1",
                "title": "The Great Gatsby",
                "author": "F. Scott Fitzgerald",
                "genre": "Classic",
                "year": 1925,
                "rating": 4.1,
                "pages": 180,
                "description": "A tragic tale of wealth, love, and the American Dream in the Jazz Age",
                "themes": "wealth, corruption, American Dream, social class",
                "setting": "New York, 1920s"
            },
            {
                "id": "book_2",
                "title": "To Kill a Mockingbird",
                "author": "Harper Lee",
                "genre": "Classic",
                "year": 1960,
                "rating": 4.3,
                "pages": 376,
                "description": "A powerful story of racial injustice and moral growth in the American South",
                "themes": "racism, justice, moral courage, childhood innocence",
                "setting": "Alabama, 1930s"
            },
            {
                "id": "book_3",
                "title": "1984",
                "author": "George Orwell",
                "genre": "Dystopian",
                "year": 1949,
                "rating": 4.4,
                "pages": 328,
                "description": "A chilling vision of totalitarian control and surveillance society",
                "themes": "totalitarianism, surveillance, freedom, truth",
                "setting": "Oceania, dystopian future"
            },
            {
                "id": "book_4",
                "title": "Harry Potter and the Philosopher's Stone",
                "author": "J.K. Rowling",
                "genre": "Fantasy",
                "year": 1997,
                "rating": 4.5,
                "pages": 223,
                "description": "A young wizard discovers his magical heritage and begins his education at Hogwarts",
                "themes": "friendship, courage, good vs evil, coming of age",
                "setting": "England, magical world"
            },
            {
                "id": "book_5",
                "title": "The Lord of the Rings",
                "author": "J.R.R. Tolkien",
                "genre": "Fantasy",
                "year": 1954,
                "rating": 4.5,
                "pages": 1216,
                "description": "An epic fantasy quest to destroy a powerful ring and save Middle-earth",
                "themes": "heroism, friendship, good vs evil, power corruption",
                "setting": "Middle-earth, fantasy realm"
            },
            {
                "id": "book_6",
                "title": "The Hitchhiker's Guide to the Galaxy",
                "author": "Douglas Adams",
                "genre": "Science Fiction",
                "year": 1979,
                "rating": 4.2,
                "pages": 224,
                "description": "A humorous space adventure following Arthur Dent across the galaxy",
                "themes": "absurdity, technology, existence, humor",
                "setting": "Space, various planets"
            },
            {
                "id": "book_7",
                "title": "Dune",
                "author": "Frank Herbert",
                "genre": "Science Fiction",
                "year": 1965,
                "rating": 4.3,
                "pages": 688,
                "description": "A complex tale of politics, religion, and ecology on a desert planet",
                "themes": "power, ecology, religion, politics",
                "setting": "Arrakis, distant future"
            },
            {
                "id": "book_8",
                "title": "The Hunger Games",
                "author": "Suzanne Collins",
                "genre": "Dystopian",
                "year": 2008,
                "rating": 4.2,
                "pages": 374,
                "description": "A teenage girl fights for survival in a brutal televised competition",
                "themes": "survival, oppression, sacrifice, rebellion",
                "setting": "Panem, dystopian future"
            },
        ]

        # Create comprehensive text documents for each book
        book_documents = []
        for book in books:
            document = f"{book['title']} by {book['author']}. {book['description']} "
            document += f"Themes: {book['themes']}. Setting: {book['setting']}. "
            document += f"Genre: {book['genre']} published in {book['year']}."
            book_documents.append(document)

        # Adding book data to the collection with comprehensive metadata
        collection.add(
            ids=[book["id"] for book in books],
            documents=book_documents,
            metadatas=[{
                "title": book["title"],
                "author": book["author"],
                "genre": book["genre"],
                "year": book["year"],
                "rating": book["rating"],
                "pages": book["pages"]
            } for book in books]
        )

        # Retrieve all the items (documents) stored in the collection
        all_items = collection.get()
        print("Collection contents:")
        print(f"Number of documents: {len(all_items['documents'])}")

        # Function to perform advanced book search
        def perform_book_search(collection):
            print("=== Book Similarity Search ===")

            # Similarity search for magical adventures
            print("\n1. Finding magical fantasy adventures:")
            results = collection.query(
                query_texts=["magical fantasy adventure with friendship and courage"],
                n_results=3
            )
            for i, (doc_id, document, distance) in enumerate(zip(
                results['ids'][0], results['documents'][0], results['distances'][0]
            )):
                metadata = results['metadatas'][0][i]
                print(f"  {i+1}. {metadata['title']} by {metadata['author']} - Distance: {distance:.4f}")

            print("\n=== Metadata Filtering ===")

            # Filter by genre
            print("\n2. Finding Fantasy and Science Fiction books:")
            results = collection.get(
                where={"genre": {"$in": ["Fantasy", "Science Fiction"]}}
            )
            for i, doc_id in enumerate(results['ids']):
                metadata = results['metadatas'][i]
                print(f"  - {metadata['title']}: {metadata['genre']} ({metadata['rating']}★)")

            # Filter by rating
            print("\n3. Finding highly-rated books (4.3+):")
            results = collection.get(
                where={"rating": {"$gte": 4.3}}
            )
            for i, doc_id in enumerate(results['ids']):
                metadata = results['metadatas'][i]
                print(f"  - {metadata['title']}: {metadata['rating']}★")

            print("\n=== Combined Search ===")

            # Combined search: dystopian themes with high ratings
            print("\n4. Finding highly-rated dystopian books:")
            results = collection.query(
                query_texts=["dystopian society control oppression future"],
                n_results=3,
                where={"rating": {"$gte": 4.0}}
            )
            for i, (doc_id, document, distance) in enumerate(zip(
                results['ids'][0], results['documents'][0], results['distances'][0]
            )):
                metadata = results['metadatas'][0][i]
                print(f"  {i+1}. {metadata['title']} ({metadata['year']}) - {metadata['rating']}★")
                print(f"     Distance: {distance:.4f}")

        perform_book_search(collection)
    except Exception as error:
        print(f"Error: {error}")

if __name__ == "__main__":
    main()

python3.11 books_advanced_search.py
# ouput
Collection contents:
Number of documents: 8
=== Book Similarity Search ===

1. Finding magical fantasy adventures:
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
  1. Harry Potter and the Philosopher's Stone by J.K. Rowling - Distance: 0.5385
  2. The Lord of the Rings by J.R.R. Tolkien - Distance: 0.6017
  3. The Hunger Games by Suzanne Collins - Distance: 0.6631

=== Metadata Filtering ===

2. Finding Fantasy and Science Fiction books:
  - Harry Potter and the Philosopher's Stone: Fantasy (4.5★)
  - The Lord of the Rings: Fantasy (4.5★)
  - The Hitchhiker's Guide to the Galaxy: Science Fiction (4.2★)
  - Dune: Science Fiction (4.3★)

3. Finding highly-rated books (4.3+):
  - To Kill a Mockingbird: 4.3★
  - 1984: 4.4★
  - Harry Potter and the Philosopher's Stone: 4.5★
  - The Lord of the Rings: 4.5★
  - Dune: 4.3★

=== Combined Search ===

4. Finding highly-rated dystopian books:
  1. 1984 (1949) - 4.4★
     Distance: 0.4764
  2. The Hunger Games (2008) - 4.2★
     Distance: 0.6794
  3. To Kill a Mockingbird (1960) - 4.3★
     Distance: 0.7307

Project4: Food search

# setup environment
pip install numpy==2.3.1
pip install scipy==1.16.0
pip install chromadb==1.0.12
pip install sentence-transformers==4.1.0
pip install ibm-watsonx-ai==1.3.24
# download dataset
wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/sN1PIR8qp1SJ6K7syv72qQ/FoodDataSet.json
# example data structure
{
    "food_id": 1,
    "food_name": "Apple Pie",
    "food_description": "A classic dessert made with a buttery, flaky crust filled with tender, spiced apples.",
    "food_calories_per_serving": 320,
    "food_nutritional_factors": {
        "carbohydrates": "42g",
        "protein": "2g", 
        "fat": "16g"
    },
    "food_ingredients": ["Apples", "Flour", "Butter", "Sugar", "Cinnamon", "Nutmeg"],
    "food_health_benefits": "Rich in antioxidants and dietary fiber",
    "cooking_method": "Baking",
    "cuisine_type": "American",
    "food_features": {
        "taste": "sweet",
        "texture": "crisp and tender",
        "appearance": "golden brown",
        "preparation": "baked",
        "serving_type": "hot"
    }
}

Page Source

Chroma DB Filtering

🏷️ Metadata Filtering in Chroma DB

📝 Document Filtering in Chroma DB

Setting up HNSW in ChromaDB

Project1: Similarity search

Project2: Similarity employee data search

Project3: Books advanced search

Project4: Food search

Setting up `HNSW` in ChromaDB