Over the last few months, our team has been working on a new AI project called RamaLama (Figure 1). Yes, another name that contains lama.
What does RamaLama do?
RamaLama facilitates local management and serving of AI models.
RamaLama's goal is to make it easy for developers and administrators to run and serve AI models. RamaLama merges the world of AI inferencing with the world of containers as designed by Podman and Docker, and eventually, Kubernetes.
When you first launch RamaLama, it inspects your system for GPU support, falling back to CPU support if no GPUs are present. It then uses a container engine like Podman or Docker to download a container image from quay.io/ramalama. The container image contains all the software necessary to run an AI model for your systems setup. Currently RamaLama supports llama.cpp and vLLM for running container models.
Once the container image is in place, RamaLama pulls the specified AI model from any of types of model registries: Ollama, Hugging Face, OCI registry.
At this point, once RamaLama has pulled the AI model, it’s showtime, baby! Time to run our inferencing runtime. RamaLama offers switchable inferencing runtimes, namely llama.cpp and vLLM, for running containerized models.
RamaLama launches a container with the AI model volume mounted into the container, starting a chatbot or a rest API service from a simple single command. Models are treated similarly to how Podman and Docker treat container images. RamaLama works with Podman Desktop and Docker Desktop on macOS and Windows.
Running AI workloads in containers eliminates the users need to configure the host system for AI.
8 reasons to use RamaLama
RamaLama thinks differently about LLMs, connecting your use cases with the rest of the Linux and container world. You should use RamaLama if:
- You want a simple and easy way to test out AI models.
- You don’t want to mess with installing specialized software to support your specific GPU.
- You want to find and pull models from any catalog including Hugging Face, Ollama, and even container registries.
- You want to use whichever runtime works best for your model and hardware combination:
llama.cpp
, vLLM,whisper.cpp
, etc. - You value running AI models in containers for the simplicity, collaborative properties, and existing infrastructure you have (container registries, CI/CD workflows, etc.).
- You want an easy path to run AI models on Podman, Docker, and Kubernetes.
- You love the power of running models at system boot using containers with Quadlets.
- You believe in the power of collaborative open source to enable the fastest and most creativity when tackling new problems in a fast-moving space.
Why not just use Ollama?
Realizing that lots of people currently use Ollama, we looked into working with it. While we loved its ease of use, we did not think it fit our needs. We decided to build an alternative tool that allows developers to run and serve AI models from a simple interface, while making it easy to take those models, put them in containers, and enable all of the local, collaborative, and production benefits that they offer.
Differences between Ollama and RamaLama
Table 1 compares Ollama and RamaLama capabilities.
Feature | Ollama | RamaLama |
Running models on host OS | Defaults to running AI models locally on the host system. | Defaults to running AI models in containers on the host system, but can also run them directly using the |
Running models on host container | Not supported. | Default. RamaLama wraps Podman or Docker and launches first, downloading a container with all of the AI tools ready to execute. It also downloads the AI model to the host, then launches the container with the AI model mounted into it, and runs the serving app. |
Support for alternative AI runtimes | Supports | Currently RamaLama supports |
Optimization and installation of AI software | Statically linked with | RamaLama downloads different container images with all of the software, optimized for your specific GPU configuration. Benefit: Users get started faster and optimized for the specific GPU they have, similar to what Flatpak does to pull all of the display stuff at once and use it everywhere. The same optimized containers are used for every model you pull? |
AI model registry support | Defaults to pulling images from Ollama; some support for Hugging Face, and no support for OCI content. | Supports pulling from OCI, Ollama, and Hugging Face. Benefit: Sometimes the latest model is only available in one or two places. RamaLama lets you pull it from almost anywhere. If you can find what you want, you can pull it. |
Podman quadlet generation | None. | RamaLama can generate a Podman Quadlet file suitable for launching the AI model and container underneath systemd as a service on an edge device. The quadlet is based on the locally running AI model, making it easy for the developer to go from experimenting to using it in production. |
Kubernetes YAML generation | None. | RamaLama can generate a Kubernetes YAML file to enable users to easily move from a locally running AI model to running the same AI model in a Kubernetes cluster. |
Switchable inference runtimes | Supports |
|
Bottom line
We want to iterate quickly on RamaLama and experiment with how we can help developers run and package AI workloads with different patterns like retrieval-augmented generation (RAG) models, Whisper support, summarizes, and other patterns.
Install RamaLama
You can install RamaLama via PyPi or the command line.
PyPi
RamaLama is available via PyPi:
pipx install ramalama
Install by script
Install RamaLama by running one of the following one-liners.
Linux:
curl -fsSL https://raw.githubusercontent.com/containers/ramalama/s/install.sh | sudo bash
macOS (run without sudo
):
curl -fsSL https://raw.githubusercontent.com/containers/ramalama/s/install.sh | bash
Distro install
Fedora:
$ sudo dnf -y install ramalama
We need your help!
We want you to install the tool and try it out, and then give us feedback on what you think.
Looking for a project to contribute to? RamaLama welcomes you. It is written in simple Python and wraps other tools, so the barrier to contribute is low. We love help on documentation and potentially web design. This is definitely a community project where we can use varied talents.
We are looking for help packaging RamaLama for other Linux distributions, Mac (Brew?), and Windows. We have it packaged for Fedora and plan on getting it into CentOS Stream and hopefully RHEL. But we really want to see if available everywhere you can run Podman and/or Docker.