Chroma and Node.js

text topic classification using a vector database

In recent years, vector databases have become popular among developers and researchers in the field of machine learning and artificial intelligence. These databases allow for efficient storage and processing of high-dimensional vector representations of data. This opens new possibilities in tasks such as: semantic similarity search, clustering, and building recommendation systems.

For example, what if we have the task of classifying specific texts by topics - determining from a single sentence which category from a predefined list it best matches. For instance, if we have the record "These components provide stability and durability to the entire structure, which is especially important under high load conditions", does it belong to "Construction" or "Software Architecture"? A regular keyword search cannot handle this task, as it is difficult to distinguish terms in the context of construction and software development without interpreting the text's semantics.

Many vendors offer their vector databases, but one of the most popular at the moment is Chroma. You can compare its functionality with other VDBs here: https://superlinked.com/vector-db-comparison

Chroma provides a convenient API for storing and searching vectors, which can be used for natural language processing, computer vision, and other applications.

In this article, I will show you how to use Chroma with Node.js to create a simple topic classification tool.

What We Need

  • Node.js and npm
  • Docker – to run the Chroma backend
  • OpenAI API key – to create embeddings

Installing Chroma

Let's create a new project and install the Chroma client:

npm install --save chromadb chromadb-default-embed

Let's run the Chroma backend:


docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma
                            

Generating Embeddings

An embedding is a representation of objects (such as words, sentences, or images) in the form of vectors in a multidimensional space. These vectors retain semantic information, allowing machine learning models to work effectively with data by finding similar objects through computing the distances between their vector representations.

In the context of our article, embeddings are used to transform texts into numerical vectors, which can then be processed using the Chroma vector database for topic classification tasks.

To generate them, we will need an OpenAI key – I will use the text-embedding-ada-002 model, but you can use text-embedding-3-small or text-embedding-3-large. Also available for creating embeddings are Cohere, Google Gemini, HF Server, Hugging Face, Instructor, JinaAI, Roboflow, Ollama Embeddings, but for some of them, the JavaScript API is not available.


const {OpenAIEmbeddingFunction} = require('chromadb');
const embeddingFunction = new OpenAIEmbeddingFunction({
    openai_api_key: "yourApiKey",
    model: " text-embedding-ada-002"
})

const collection = await client.createCollection({
    name: "name",
    embeddingFunction: embeddingFunction
})                        
					    


const entries = [
" These components provide stability and durability to the entire structure, which is especially important under high load conditions.",
" Adjusting parameters helps optimize performance and avoid possible failures in the application's operation under high loads."
]                    
                            

const embeddings = embeddingFunction.generate(entries)
                        


Text Topic Classification

The essence of a simple way to classify topics is to insert an embedding for each topic with some text example and for any given text find the nearest vector. For example, let's use "These components provide stability and durability to the entire structure, which is especially important under high load conditions" as a representative of the "Construction" category, and "Adjusting parameters helps optimize performance and avoid possible failures in the application's operation under high loads" as a representative of the "Software Architecture" category. The following code generates embeddings for these two sentences and inserts them into Chroma with the associated category as metadata.


// The order of categories corresponds to the order of entries
const categories = [{ category: ' Construction ' }, { category: 'Software Architecture' }]    
                            

await collection.add({
    ids: ['id1', 'id2'],
    documents: entries,
    embeddings: embeddings,
    metadatas: categories
});                   
                        

Then, to classify the text, we pass the text to the query() function, which will automatically generate embeddings for us using OpenAI. The following code classifies 4 sentences where the phrase "High loads" is used in different contexts.


const queryTexts = [
'The introduction of new architectural solutions significantly improved the building's resistance to high seismic loads.',
'The use of modern building materials allowed for the reduction of construction time for infrastructure objects that can withstand high loads.',
'Optimizing algorithms reduced task execution time and improved the overall performance of the program under high loads of over 1000000 RPS.',
'The implementation of a modular approach in development simplified the testing and scaling process of the application under high loads.'
]
                            
						
const results = await collection.query({
    nResults: 1,
    queryTexts: queryTexts
  });
console.log('Results', results.metadatas.map(res => res[0].category));                    
						
					


Results:

  • ✅'Construction'
  • ✅'Construction'
  • ✅'Software Architecture'
  • ✅'Software Architecture'

The result is excellent – based on just two vectors, we managed to correctly classify all the sentences almost without common words, except for the phrase “high load”.

As a result, using Node.js and the Chroma vector database, you can create an effective tool for topic classification based on semantic text analysis. This approach significantly improves the accuracy of category determination, especially when the same word can have different meanings in different contexts.