Page
Use a GPU to speed up your LLM
In the previous lesson, we interacted with a model running locally. While this worked, it might have been slower than would be useful in the inner development loop. In this lesson, we will look at how we can speed up a locally running model by leveraging a graphics processing unit (GPU).
If you are on a newer ARM-based macOS machine, good news—GPU acceleration is already enabled and you can skip to Lesson 3.
In order to get full benefit from taking this lesson, you need to:
- Have an NVIDIA-based GPU.
- Install the NVIDIA SDK and C/C++ compiler for your system.
In this lesson, you will:
- Install the NVIDIA SDK.
- Install the C/C++ compiler for your platform.
- Recompile
node-llama-cpp
to enable GPU acceleration. - Run the example to send questions to the running model and get the responses, and see that it executes much faster now.
Set up the environment
- First, install the CUDA toolkit (version 12.x or higher).
- Next, install the C/C++ compiler for your platform, including support for CMake and CMake.js. For Windows, that would be the Microsoft C++ Build Tools, and for Linux, Clang or GCC, along with Ninja and Make. More detailed instructions are available in the requirements section of the cmake-js README.
Recompile node-llama-cpp
To recompile
node-llama-cpp
, run the following command in thelesson-1-2
directory from the last lesson:npx --no node-llama-cpp download --cuda
This will rebuild
node-llama-cpp
with CUDA enabled. It might take a few minutes; you should see the output of the compilation taking place. This compilation was avoided by the default install becausenode-llama-cpp
includes pre-built binaries usingnode-addon-api
for Linux, macOS, and Windows. Our compilation ended up with:√ Compiled llama.cpp Repo: ggerganov/llama.cpp Release: b2249 Done
- If the compilation fails, you should double-check your compiler and CUDA toolkit install. If the compilation has trouble finding the NVIDIA toolkit, we recommend that you restart your machine; this resolved the problem for us.
Profit from GPU acceleration
Look at the langchainjs-basic-gpu.mjs example, in which the only modification from our first example is the addition of the gpuLayers option when creating the model.
const model = await new LlamaCpp({ modelPath: modelPath, gpuLayers: 64 });
Now that we have a CUDA-accelerated version of
node-llama-cpp
, run the same example as before with:node langchainjs-basic-gpu.mjs
In our case, the time needed to answer the question dropped from about 25 seconds to about 3 seconds. That’s much easier to experiment with! Depending on your GPU, you might need to experiment with how many GPU layers you can offload.
Conclusion
Now that we can experiment locally at a faster pace, we will build on the earlier example by:
- Building a more complex example that supports retrieval-augmented generation.
- Showing how LangChain.js makes it easy to develop, experiment, and test in one environment while being able to easily deploy to another environment with minimal changes to your application.