From Podman AI Lab to OpenShift AI

Learn how to rapidly prototype AI applications from your local environment with Podman AI Lab, add knowledge and capabilities to a large language model (LLM) using retrieval augmented generation (RAG), and use the open source technologies on Red Hat OpenShift AI to deploy, serve, and integrate generative AI into your application.

Red Hat OpenShift AI product trialRed Hat OpenShift trial

In this last lesson, we’ll connect all of the pieces together to create an RAG chatbot. LangChain will help us connect the LLM we deploy on OpenShift AI to the Elasticsearch vector database that we ingested data into in the last lesson.

In order to get the full benefit from taking this lesson, you need to:

  • Install the OpenShift CLI.
  • Have the appropriate access on your OpenShift cluster to deploy Minio.
  • Have the appropriate access on your OpenShift cluster to create projects.
  • Have admin access on OpenShift AI.
  • Have admin access to create projects and install operators on OpenShift.
  • Have admin access on OpenShift AI to create custom serving runtime images.

In this lesson, you will:

  • Configure an OpenShift AI project, LLM serving model platform, and upload a custom model serving runtime.
  • Update and deploy the chatbot recipe code from Podman AI Lab on OpenShift with LangChain to connect to the Elasticsearch vector database and OpenShift AI model serving endpoint.

Deploy s3 Storage (Minio)

Optional: If you already have s3-compatible storage, you can skip to step 2 to create the bucket.

OpenShift AI model serving has a dependency on s3 storage. We'll deploy Minio for this tutorial, but any s3 compatible storage should work. For an enterprise s3 storage solution, consider OpenShift Data Foundation.

Follow the Minio Installation if you don't have s3-compatible storage:

  1. Log in to the Minio UI. You can find the route in either the web console or from the oc cli in your terminal. Log in with minio/minio123. Minio contains 2 routes, an API route and UI route. Make sure you use the UI route (Figure 1).

    Minio UI login screen with user, password, and Login highlighted.
    Figure 1: Minio login.
  2. Create a bucket named "models" and click the Create Bucket button (Figure 2).

    Minio web console -> Object Browser -> Bucket Name highlighted.
    Figure 2: Minio Create Bucket.
  3. Go to Object Browser, select the models bucket you just created, and click the Create new path button. Name the folder path "mistral7b" and select Create (Figure 3).

    Minio web console -> Object Browser -> Create new path highlighted -> New folder path highlighted -> Create highlighted.
    Figure 3: Minio path.
  4. Upload the Mistral7b model to the folder path you just created. You can find out where the model was downloaded if you go back to Podman AI Lab and click the Open Model Folder icon. See Figure 4.

    Podman AI Lab -> Catalog -> TheBloke/Mistral-7B-Instruct-v0.2-GGUF displayed.
    Figure 4: Podman AI Lab model folder.
  5. In Minio, click the Upload File button and select the model file under the hf.TheBloke.mistral-7b-instruct-v0.2.Q4_K_M directory (Figure 5).

    Minio web console -> Object Browser -> models/mistral7b path -> Upload -> Upload File highlighted.
    Figure 5: Minio model upload.

    If the model is uploaded successfully, you should see the screen depicted in Figure 6.

    Mino web console ->Object Browser -> models/mistral7b path -> Downloads/Uploads successful displayed.
    Figure 6: Minio model upload success.

Create your custom model serving runtime

Follow the product documentation to install the single-model serving platform OR follow the instructions below.

Single-model serving platform automated install

To install the single-model serving platform and add our custom serving runtime, you first need to clone podman-ai-lab-to-rhoai. Then do the following:

  1. Run the following oc command to deploy the Service Mesh operator:

    oc apply -k ./components/openshift-servicemesh/operator/overlays/stable
  2. Run the following oc command to deploy the Serverless operator:

    oc apply -k ./components/openshift-serverless/operator/overlays/stable
  3. Run the following command to find out when the Service Mesh and Serverless operators have installed successfully:

    oc get csv -n openshift-operators --watch 
    NAME                          PHASE
    rhods-operator.2.9.1          Succeeded
    Serverless-operator.v1.32.1   Succeeded
    Servicemeshoperator.v2.5.2    Succeeded

    We'll be using the single stack serving in OpenShift AI so we'll want to use a trusted certificate instead of a self-signed one. This will allow our chatbot to access the model inference endpoint.

  4. Get the name of the ingress cert by selecting a secret that has cert in the name:

    oc get secrets -n openshift-ingress | grep cert
  5. Copy the full name of the secret you chose and replace the name. Make sure you're in the top-level directory of this project when you run the below command:

    oc extract secret/<CERT_SECRET_FROM_ABOVE> -n openshift-ingress --to=ingress-certs --confirm

    You should now have an ingress-certs directory with a tls.crt and tls.key file (Figure 7).

    VS Code -> ingress-certs directory -> tls.crt and tls.key files are highlighted.
    Figure 7: Ingress cert directory.
    Figure 7: Ingress-cert directory.
  6. To update the secret that will be used in our OpenShift AI data science cluster, run the following commands in sequence:

    cd ingress-certs
    
    oc create secret generic knative-serving-cert -n istio-system --from-file=. --dry-run=client -o yaml | oc apply -f -
    
    cd ..

Note: You can delete the ingress-certs folder after you have created the knative-serving-cert secret.

  1. Run the following oc commands to enable the Single Model Serving runtime for OpenShift AI:

    oc apply -k ./components/model-server/components-serving

    It will take around 5 to 10 minutes for the changes to be applied. 

    To see when single-model serving is ready, start by opening the OpenShift web console and going to Operators -> Installed Operators. When you see both Service Mesh and Serverless as shown in Figures 8 and 9, the single-model serving should be available.

    OpenShift web console -> Operators -> Installed Operators -> Red Hat OpenShift Serverless -> showing knative-serving instance.
    Figure 8: Successful RHOAI knative-serving.
    OpenShift web console -> Operators -> Installed Operators -> Red Hat OpenShift Serverless -> showing knative-serving instance.
    Figure 9: Successful RHOAI Service mesh control plane.
  2. Go to the OpenShift AI dashboard, expand Settings, and select Serving Runtimes. You should now see "Single-model serving enabled" at the top of the page (Figure 10).

Note: You might need to refresh the page and it could take a few minutes for the changes to be applied.

  1. OpenShift AI web console -> Settings -> Serving runtimes -> Single-model serving enabled highlighted.
    Figure 10: Single-serving enabled.

Note: Make sure your single-model serving platform is using a trusted certificate. If it is not or you're unsure, make sure to apply steps 4-7 in the Single-model serving platform automated install above.

Add a custom serving runtime

We'll now add a custom serving runtime so we can deploy the GGUF version of the model.

Note: We will continue to use the GGUF version of the model to be able to deploy this model without the need for a hardware accelerator (e.g., GPU). OpenShift AI contains a scalable model serving platform to accommodate deploying multiple full sized LLMs.

  1. Click the Add serving runtime button (Figure 11).

    OpenShift AI web console -> Settings -> Serving runtimes -> Add serving runtime highlighted.
    Figure 11: Add serving runtime button.
  2. Select Single-model serving platform for the runtime and select REST for the API protocol. Upload the ./components/custom-model-serving-runtime/llamacpp-runtime-custom.yaml file as the serving runtime. Select Create (Figure 12).

    OpenShift AI web console -> Settings -> Serving runtimes -> Add serving runtime screen -> Single-model serving platform, REST, and Quay image location highlighted.
    Figure 12: Add serving runtime.

Note: I've included a pre-built image that is public. You can build your own image with the Containerfile under ./components/custom-model-serving-runtime, if you would rather pull from your own repository.

  1. If the serving runtime was added successfully, you should now see it in the list of available serving runtimes (Figure 13).
     

    OpenShift AI web console -> Settings -> Serving runtimes -> LlamaCPP runtime displayed.
    Figure 13: Serving runtime list.

Deploy model

Follow the steps below to deploy the model you uploaded to Minio.

  1. Go to your "podman-ai-lab-rag-project" and select Models. You should see two model serving type options. Select Deploy model under the Single-model serving platform (Figure 14).

    OpenShift AI web console -> Data Science Projects -> podman-ai-lab-rag-project -> Single-model serving platform Deploy model button highlighted.
    Figure 14: Deploy model on single-model serving platform.
  2. Fill in the following values and click the Deploy button at the bottom of the form (Figures 15-17):
    • Model name = mistral7b
    • Serving runtime = LlamaCPP
    • Model framework = any
    • Model server size = Medium
    • Select New data connection
    • Name = models
    • Access key = minio
    • Secret key = minio123
    • Endpoint = Your Minio API URL
    • Region = us-east-1
    • Bucket = models
    • Path = mistral7b
OpenShift AI web console -> Model deployment configuration -> Model name, Serving runtime, and Model framework highlighted.
Figure 15: Model deployment configuration 1.
OpenShift AI web console -> Model deployment configuration continued -> Model server size, New data connection, Name, Access key, and Secret key highlighted.
Figure 16: Model deployment configuration 2.
OpenShift AI web console -> Model deployment configuration continued -> Endpoint, Region, Bucket, Path, and Deploy button highlighted.
Figure 17: Model deployment configuration 3.
  1. If your model deploys successfully, you should see the following page (Figure 18).

    OpenShift AI web console -> Data Science Projects -> podman-ai-lab-project ->Models -> mistral7b model deployed successfully.
    Figure 18: Successful model deployment.
  2. Test your model to make sure you can send in a request and get a response. You can use the client code that is provided by the model service in Podman AI Lab. Make sure to update the URL in the cURL command to the Inference endpoint on OpenShift AI:

    curl --location 'https://YOUR-OPENSHIFT-AI-INFERENCE-ENDPOINT/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "messages": [
        {
          "content": "You are a helpful assistant.",
          "role": "system"
        },
        {
          "content": "How large is the capital of France?",
          "role": "user"
        }
      ]
    }'
  3. Your response should be similar to the following:

    {"id":"chatcmpl-c76974b1-4709-41a5-87cf-1951e10886fe","object":"chat.completion","created":1717616440,"model":"/mnt/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf","choices":[{"index":0,
    "message":{"content":" The size of a city's area, including its metropolitan area, can vary greatly, and when referring to the 
    \"capital\" of a country like France, people usually mean the city itself rather than its total metropolitan area. Paris, the capital 
    city of France, covers an urban area of approximately 105 square 
    kilometers (40.5 square miles) within its administrative limits.
    \n\nHowever, if you are asking about the total area of the Paris 
    Metropolitana region, which includes suburban areas and their 
    combined population, it is much larger at around 13,022 square 
    kilometers (5,028 square miles). This encompasses more than just the city of Paris.",
    "role":"assistant"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":32,"completion
  4. If you face any SSL errors when running the previous command, try to add the max_tokens limit to 100, which will get you safely under the 60-second timeout limit of the Knative queue-proxy service (The use of jq is optional):

    curl -k --location 'https://YOUR-OPENSHIFT-AI-INFERENCE-ENDPOINT/v1/chat/completions' --header 'Content-Type: application/json' --data '{
        "messages": [
          {
            "content": "You are a helpful assistant.",
            "role": "system"
          },
          {
            "content": "How large is the capital of France?",
            "role": "user"
          }
        ],
        "max_tokens": 100
      }' | jq .
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
      100  1011  100   793  100   218     14      3  0:01:12  0:00:55  0:00:17   174
      {
        "id": "chatcmpl-687c22c8-d0ba-4ea4-a012-d4b64069d7a2",
        "object": "chat.completion",
        "created": 1727727459,
        "model": "/mnt/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
        "choices": [
          {
            "index": 0,
            "message": {
              "content": " The size of a city's area, including its urban and rural parts, is typically measured in square kilometers or square miles. However, when referring to the size of a city's capital, people usually mean the size of its urban core or central business district rather than the entire metropolitan area. In this context, Paris, the capital city of France, has an urban area of approximately 105 square kilometers (40.5 square miles). However, if you meant",
              "role": "assistant"
            },
            "logprobs": null,
            "finish_reason": "length"
          }
        ],
        "usage": {
          "prompt_tokens": 32,
          "completion_tokens": 100,
          "total_tokens": 132
        }
      }

Update the chat recipe application

We'll now take a look at the updates to the chat recipe application that we created from Podman AI Lab to use Langchain to connect the model we just deployed on OpenShift AI and the Elasticsearch vector database. 

  1. We'll start from the default chatbot recipe code accessible from Podman AI Lab (Figure 19).

    Podman AI Lab -> Recipes Catalog -> ChatBot -> Model and Open in VSCode highlighted.
    Figure 19: Podman AI Lab chat application.
  2. In Podman AI Lab, after clicking the Open in VSCode button, you should see the following (Figure 20):

    VS Code -> app directory expanded highlighting chatbot_ui.py, Containerfile, and requirements.txt files.
    Figure 20: Chatbot in VS Code.

    The only code that is modified is under the app directory.

  3. Open the ./components/app/chatbot_ui.py file.

    Note: All of the code changes to the chatbot_ui.py file have been made for you. The next step is intended to walk you through the code changes. 

  4. We will first get some environment variables:

    Then we will add in the Langchain code to give us our RAG functionality. Note where the model_service or your OpenShift AI inference endpoint URL is configured and the Elasticsearch setup. Finally, take note of how both of these are passed to Langchain (chain variable):

    llm = ChatOpenAI(
    api_key="sk-no-key-required",
    openai_api_base=model_service,
    streaming=True,
    callbacks=[StreamlitCallbackHandler(st.empty(),
    expand_new_thoughts=True, 
    collapse_completed_thoughts=True)])
    
    db = ElasticsearchStore.from_documents(
        [],
        Embeddings,
        index_name="rhoai-docs",
        es_connection=es,
    )
    
    chain = RetrievalQA.from_chain_type(llm,
    retriever=db.as_retriever(
    search_type="similarity_score_threshold", 
    search_kwargs={"k": 4, "score_threshold": 0.2 }),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
          return_source_documents=True)

    The last updates to the code are just to format the response so that the relevant documents will be included. Extra packages were also added to the ./components/app/requirements.txt file.

  5. You can build the Containerfile and push it to your own repository, or you can use the one at quay.io/jhurlocker/elastic-vectordb-chat.

  6. Update the ./components/app/deployment.yaml file with your values for the MODEL_ENDPOINTELASTIC_URL, and ELASTIC_PASS environment 
    variables.

    env:
    - name: MODEL_ENDPOINT
      value: '<OPENSHIFT_AI_MODEL_INFERENCE_ENDPOINT>'
    - name: ELASTIC_URL
      value: 'https://<YOUR_ELASTICSEARCH_SERVICE_URL>:9200'
    - name: ELASTIC_PASS
      value: '<YOUR_ELASTICSEARCH_PASSWORD>'

Note: Make sure you include https:// and the port :9200 in the ELASTIC_URL environment variable.

  1. Create the project:

    oc new-project elastic-vectordb-chat
  2. Apply the deployment.yaml you just updated to deploy the chatbot application:

    oc apply -f ./components/app/deployment.yaml
  3. Get the route to the chatbot application:

    oc get route -n elastic-vectordb-chat
  4. Open the application in your browser (Figure 21).

    RAG chatbot UI that is deployed on OpenShift running in a web browser.
    Figure 21: Chatbot deployed on OpenShift.
  5. Type in a message and press Enter. It might take a while to respond if the model is deployed on a CPU (Figure 22).

    RAG chatbot running in a web browser with the prompt "What is RHOAI?" and the response.
    Figure 22: RAG chatbot in action.
  6. In the OpenShift web console, you can check the model server logs under the podman-ai-lab-rag-project -> Workloads -> Pods (mistral7b-*) -> Logs. Note the log statements when a message is sent to the model inference endpoint (Figure 23).

    OpenShift web console -> Workloads -> Pods highlighted -> Project: podman-ai-lab-rag-project selected -> Logs -> log output showing a 200 OK response.
    Figure 23: The model pod log in OpenShift.

Congratulations! You've successfully taken a model and application from Podman AI Lab and created an RAG chatbot deployed on OpenShift and OpenShift AI.

Special thanks to the maintainers of the below repositories:

  • LLM On OpenShift The notebook to ingest data into Elasticsearch and the Langchain code added to the chatbot app.
  • AI Accelerator The code used to deploy the various components on OpenShift and OpenShift AI.

I recently did a live YouTube demonstration of this learning path. Click on the link to view it. 

Want to learn more about OpenShift AI? Explore these OpenShift AI learning paths.

Previous resource
Deploy OpenShift AI and Elasticsearch vector database