With Llama Stack being released earlier this year, we decided to look at how to implement key aspects of an AI application with Node.js and Llama Stack. This article is the second in a series exploring how to use large language models with Node.js and Llama Stack. This post covers retrieval-augmented generation (RAG).
For an introduction to Llama Stack, read A practical guide to Llama Stack for Node.js developers.
How retrieval-augmented generation works
Retrieval-augmented generation is one of the ways that you can provide context that helps a model respond with the most appropriate answer. The basic concept is as follows:
- Data that provides additional context—in our case, the Markdown files from the Node.js Reference Architecture—is transformed into a format suitable for model augmentation. This often includes breaking up the data into maximum sized chunks that will later be provided as additional context to the model.
- An embedding model is used to convert the source data into a set of vectors that represents the words in the data. These are stored in a database so the data can be retrieved through a query against the matching vectors. Most commonly the data is stored in a vector database like Chroma.
- The application is enhanced so that before passing on a query to the model it first uses the question to query the database for matching documents chunks. The most relevant document chunks are then added to the context and sent along with the question to the model as part of the prompt.
- The model returns an answer which in part is based on the context provided.
So why not just pass the content of all of the documents to the model? There are a number of reasons:
- The size of the context that can be passed to a model might be limited.
- Passing a large context might cost you more money.
- Providing a smaller, more closely related set of information can result in better answers.
With that in mind, we need to identify the most relevant document chunks and pass only that subset.
Now let's move on to looking at how we implemented retrieval-augmented generation with Llama Stack.
Setting up Llama Stack
The first step was to get a running Llama Stack instance that we could experiment with. Llama Stack is a bit different from other frameworks in a few ways.
The first difference from other frameworks we've looked at is that instead of providing a single implementation with a set of defined APIs, Llama Stack aims to standardize a set of APIs and drive a number of distributions. In other words, the goal is to have many implementations of the same API, with each implementation being shipped by a different organization as a distribution.
As is common when this approach is followed, a "reference distribution" is provided, but there are already a number of alternative distributions available. You can see the list of available distributions in the GitHub README.
The other difference is a heavy focus on plug-in APIs that allow you to add implementations for specific components behind the API implementation itself. For example, you could plug in an implementation (maybe one that is custom tailored for your organization) for a specific feature like Telemetry while using an existing distribution. We won't go into the details of these APIs in this post, but hope to look at them later on.
With that said, the first question we had was which distribution to use in order to get stated. The Llama Stack quick start shows how to spin up a container running Llama Stack, which uses Ollama to serve the large language model. Because we already had a working Ollama install, we decided that was the path of least resistance.
Getting the Llama Stack instance running
We followed the Llama Stack quick start using a container to run the stack with it pointing to an existing Ollama server. Following the instructions, we put together this short script that allowed us to easily start and stop the Llama Stack instance:
export INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct"
export LLAMA_STACK_PORT=8321
export OLLAMA_HOST=10.1.2.38
podman run -it \
--user 1000 \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama:0.2.2 \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://$OLLAMA_HOST:11434
Note that it is different from what we used in our earlier post in the series, in that the Llama Stack version has been updated to 0.2.2. If you look in the package.json
for the example, you will also see that we updated the Llama Stack client package to 0.2.2 as well. With the rapid pace of change in Llama Stack these days, it seems important to keep these in sync.
Our existing Ollama server was running on a machine with the IP 10.1.2.38, which is what we set OLLAMA_HOST
to.
We followed the instructions for starting the container from the quick start so Llama Stack was running on the default port. We ran the container on a Fedora virtual machine with IP 10.1.2.128, so you will see us using http://10.1.2.128:8321
as the endpoint for the Llama Stack instance in our code examples.
At this point, we had a running Llama Stack instance that we could start to experiment with.
Llama Stack now supports more models
At the time of our earlier exploration of Llama Stack, only a limited set of models were supported when using Ollama. As noted in the previous section, this time we used a later version of Llama Stack, and it is now possible to register and use different models.
In our explorations before looking at Llama Stack, we'd often used more quantized models because they are smaller and easier to run on smaller GPUs. Given the ability to run new models, we tried running with llama3.1:8b-instruct-q4_K_M
instead of the larger llama3.2:8b-instruct-fp16.
To use a model not already known to Llama Stack, you use the model's API to register the new model. The code to do that was as follows:
const model_id = 'meta-llama/Llama-3.1-8B-instruct-q4_K_M';
client.models.register({
model_id,
provider_id: 'ollama',
provider_model_id: 'llama3.1:8b-instruct-q4_K_M',
model_type: 'llm',
});
After having done that, we could then request the model using the model ID meta-llama/Llama-3.1-8B-instruct-q4_K_M
. It's great to have that additional flexibility. In our case, it let us run the first RAG example much faster.
Retrieval-augmented generation with the completion API
As outlined earlier, the first step to using RAG is to ingest the documents and store them in a vector database. Llama Stack provides a set of APIs to make this easy and consistent across different vector databases. The full code for the example is in ai-tool-experimentation/llama-stack-rag/llama-stack-chat-rag.mjs.
Creating the database
To get started, you'll need to create a database in which the documents will be stored. Llama Stack provides an API to manage vector databases and supports an in-memory implementation by default. In addition, you can register additional vector database implementations so that you can use the one best suited to your organization. In our example, we are using the in-memory vector database that is available by default.
The code we used to create the database was as follows:
// use the first available provider
const provider = (await client.providers.list()).filter(
provider => provider.api === 'vector_io',
)[0];
// register a vector database
const vector_db_id = `test-vector-db-${randomUUID()}`;
await client.vectorDBs.register({
vector_db_id,
provider_id: provider.provider_id,
embedding_model: 'all-MiniLM-L6-v2',
embedding_dimentions: 384,
});
The in-memory database was the first provider in the default config, so we just used that. When creating the database with the register call, we created a random vector_db_id
and used the embedding model which comes with Llama Stack by default all-MiniLM-L6-v2.
The database created will persist as long as the container you are running Llama Stack in continues to run, so at the end of our experiment we deleted the database in order to clean up. In a production deployment, you would more likely separate the creation and ingestion process so that you ingest the documents on a less frequent basis and use the database for multiple requests.
The clean-up code was as follows:
////////////////////////
// REMOVE DATABASE
await client.vectorDBs.unregister(vector_db_id);
Ingesting the documents
As with our earlier explorations, we wanted to use the documents in the Node.js reference architecture as our knowledge base. We first cloned the GitHub repo to a local directory (in this case, /home/midawson/newpull/nodejs-reference-architecture
).
We then wrote some code that would read the documents into Objects we could pass to the Llama Stack APIs:
// read in all of the files to be used with RAG
const RAGDocuments = [];
const files = await fs.promises.readdir(
'/home/midawson/newpull/nodejs-reference-architecture/docs',
{ withFileTypes: true, recursive: true },
);
let i = 0;
for (const file of files) {
i++;
if (file.name.endsWith('.md')) {
const contents = fs.readFileSync(path.join(file.path, file.name), 'utf8');
RAGDocuments.push({
document_id: `doc-${i}`,
content: mtt.markdownToTxt(contents),
mime_type: 'text/plan',
metadata: {},
});
}
}
Because the reference architecture documents were in Markdown, we used a package to convert them to text and then set the mime_type
to text in the Objects we planned to pass to Llama Stack.
Next, we used the Llama Stack ragTool insert API to ingest the documents:
await client.toolRuntime.ragTool.insert({
documents: RAGDocuments,
vector_db_id,
chunk_size_in_tokens: 125,
});
The Llama Stack API handled breaking up the documents into chunks based on the value we passed for chunk_size_in_tokens
. Because this value was in tokens, we needed to make it smaller than the value passed for other frameworks where the value controls how documents are split is in characters. What we could not find was a way to tune the overlap between chunks; hopefully we'll see that in later versions of the API.
Querying the documents
Now that we have the documents in the vector database, we can use the Llama Stack APIs to find the most relevant chunks for a given question from the user. You can do this with the ragTool query method:
const rawRagResults = await client.toolRuntime.ragTool.query({
content: questions[i],
vector_db_ids: [vector_db_id],
});
This returns the document chunks that are most relevant to the question being asked by the user. You then provide those chunks to the model as part of the prompt, as follows:
const rawRagResults = await client.toolRuntime.ragTool.query({
content: questions[i],
vector_db_ids: [vector_db_id],
});
const ragResults = [];
for (let j = 0; j < rawRagResults.content.length; j++) {
ragResults.push(rawRagResults.content[j].text);
}
if (SHOW_RAG_DOCUMENTS) {
console.log(ragResults.join());
}
const prompt = `Answer the question based only on the context provided
<question>${questions[i]}</question>
<context>${ragResults.join()}</context>`;
messages.push({
role: 'user',
content: prompt,
});
Asking questions
Having created the prompt with the context retrieved using the ragTool, you can now send the user's question to the model to get the answer based on the RAG results:
const response = await client.inference.chatCompletion({
messages: messages,
model_id: model_id,
});
console.log(' RESPONSE:' + response.completion_message.content);
The result
We used the same question we used in previous RAG experiments based on the Node.js reference architecture: Should I use npm to start an application?
The result was as expected:
node llama-stack-chat-rag.mjs
Iteration 0 ------------------------------------------------------------
QUESTION: Should I use npm to start an application
RESPONSE:No, it's recommended to avoid using npm to start an application as it introduces additional components and processes that can lead to security vulnerabilities and issues with signals and child processes. Instead, use a command like `CMD ["node","index.js"]`.
Based on the answer, we could see that the information from the Node.js reference architecture was used to answer the question. Yay!
If you want to see which document chunks were used, change the value of const SHOW_RAG_DOCUMENTS = false;
to true
and the example will print them out. Here, we can see that the first document chunk used in our run was:
Result 1:
Document_id:doc-3
Content: Node.js applications there are a number
of good reasons to avoid this:
One less component. You generally don't need npm to start
your application. If you avoid using it in the container
then you will not be exposed to any security vulnerabilities
that might exist in that component or its dependencies.
One less process. Instead of running 2 process (npm and node)
you will only run 1.
There can be issues with signals and child processes. You
can read more about that in the Node.js docker best practices
CMD.
Instead use a command like CMD ["node","index.js"],
tooling
It was obviously used in formulating the response returned by the model.
Agents, agents, agents
Agents are often regarded as the best way to leverage the capabilities of large language models. Llama Stack provides an easier way to use RAG though agents, with a built-in tool that can search and return relevant information based on a vector database. This tool is in the builtin::rag/knowledge_search
tool group.
Compared to the previous example using the completions API, the required code is simplified, as we don't need to query the vector database and build the results into the prompt. Instead, we simply provide the agent with the tools from the builtin::rag/knowledge_search
tool group, and the agent figures out that it needs to use the tool to get the related results from the vector database.
The full code for the agent example is in ai-tool-experimentation/llama-stack-rag/llama-stack-agent-rag.mjs. The code to read and ingest the documents into the vector database is the same as when using the completions API, so we won't cover that again.
In order to use RAG with the agent, we started by creating the agent that specifies the builtin::rag/knowledge_search
tool group and specifying the ID for the vector database that should be used:
const agentic_system_create_response = await client.agents.create({
agent_config: {
model: model_id,
instructions:
'You are a helpful assistant, answer questions only based on information in the documents provided',
toolgroups: [
{
name: 'builtin::rag/knowledge_search',
args: { vector_db_ids: [vector_db_id] },
},
],
tool_choice: 'auto',
input_shields: [],
output_shields: [],
max_infer_iters: 10,
},
});
const agent_id = agentic_system_create_response.agent_id;
We also created a session that will be used to maintain state across questions:
// Create a session that will be used to ask the agent a sequence of questions
const sessionCreateResponse = await client.agents.session.create(agent_id, {
session_name: 'agent1',
});
const session_id = sessionCreateResponse.session_id;
We could then use the agent to ask the user's question. Most of the code is for handling the response, as the only option with the agent was to use a stream:
for (let i = 0; i < questions.length; i++) {
console.log('QUESTION: ' + questions[i]);
const responseStream = await client.agents.turn.create(
agent_id,
session_id,
{
stream: true,
messages: [{ role: 'user', content: questions[i] }],
},
);
// as of March 2025 only streaming was supported
let response = '';
for await (const chunk of responseStream) {
if (chunk.event.payload.event_type === 'turn_complete') {
response = response + chunk.event.payload.turn.output_message.content;
} else if (
chunk.event.payload.event_type === 'step_complete' &&
chunk.event.payload.step_type === 'tool_execution' &&
SHOW_RAG_DOCUMENTS
) {
console.log(inspect(chunk.event.payload.step_details, { depth: 10 }));
}
}
console.log(' RESPONSE:' + response);
}
Overall, we can see that it takes less code using the agent. We simply give the agent a tool it can use to get the information needed to answer the question, and it figures out to call it when needed.
We did find, however, that using the more quantized model did not work with the agent example. When we used the llama3.1:8b-instruct-q4_K_M
model, the agent failed to figure out that it should call the RAG tool and answered without using the additional information. We therefore had to fall back to using the llama3.2:8b-instruct-fp16
model .
Having done that, we got a similar answer:
Iteration 0 ------------------------------------------------------------
QUESTION: Should I use npm to start a node.js application
RESPONSE:Based on the search results, it appears that you don't necessarily need to use npm to start a Node.js application. In fact, one of the search results suggests avoiding using npm in a container and instead running your application directly with `node index.js`. This can help reduce the number of processes running and potential security vulnerabilities.
However, if you do choose to use npm, it's worth noting that there are some benefits to using yarn workspaces or npm workspaces for managing multiple packages in a monorepo.
Just like with the earlier example, we can change const SHOW_RAG_DOCUMENTS = false;
to get more info on what documents chunks are being used. From that we can see that the first chunk was the same as before:
{
type: 'text',
text: 'Result 1:\n' +
'Document_id:doc-3\n' +
'Content: Node.js applications there are a number\n' +
'of good reasons to avoid this:\n' +
'\n' +
"One less component. You generally don't need npm to start\n" +
'your application. If you avoid using it in the container\n' +
'then you will not be exposed to any security vulnerabilities\n' +
'that might exist in that component or its dependencies.\n' +
'One less process. Instead of running 2 process (npm and node)\n' +
'you will only run 1.\n' +
'There can be issues with signals and child processes. You\n' +
'can read more about that in the Node.js docker best practices\n' +
'CMD.\n' +
'\n' +
'Instead use a command like CMD ["node","index.js"],\n' +
'\n' +
'tooling\n'
},
Despite getting similar results from the RAG tool call, looking at the Llama Stack logs we can see that the tool call was made with a different query than our question:
20:15:54.585 [INFO] executing tool call: knowledge_search with args: {'query': 'npm vs yarn for node.js applications'}
We found this interesting, as using this questions instead of the user's question for the RAG tool query could have easily returned fewer related document chunks, leading to a poorer answer. This illustrates that when you use lower-level APIs like the completion APIs, you might have more control than when you are using agents. In the agent case, you don't really have any direct control over the question used for the RAG, as the agent decides when and how to call the tools.
In addition, the fact that we had to use a model with less quantization to get a similar result does illustrate that due to the higher resource requirements typically needed when using agents, they might not always be the first choice for some deployments. Using simpler options like the completion API might give you more control and allow you to achieve similar results with lower cost. On the flip side, agents often require less work or knowledge of low-level details to quickly get started.
Wrapping up
This post looked at implementing retrieval-augmented generation (RAG) using Node.js with large language models and Llama Stack. We explored how to ingest documents, query the vector database and ask questions that leverage RAG. We also showed how Llama Stack makes this even easier by integrating RAG into its agent support. We hope it has given you, as a JavaScript/TypeScript/Node.js developer, a good start on using large language models with Llama Stack.
To learn more about developing with large language models and Node.js, JavaScript, and TypeScript, see the post Essential AI tutorials for Node.js developers.
Explore more Node.js content from Red Hat:
- Visit our topic pages on Node.js and AI for Node.js developers.
- Download the e-book A Developer's Guide to the Node.js Reference Architecture.
- Explore the Node.js Reference Architecture on GitHub.