Make it easy to iterate on models and system prompts

Recently, I’ve been building a MCP client for interacting with Google Cloud using natural language queries.

The quality (or "correctness") of responses matters a lot in this context, as commands given by users are translated directly into a single tool call then executed against their Google Cloud environment.

A process of iterating on different models and system prompts is required for improving the quality of responses. But, the question becomes: how do we know whether we're moving the needle?

This blog lays out ideas for designing AI applications such that models and system prompts are decoupled from the underlying implementation, which in turn makes it easier to iterate on.

Models

Say you want to integrate a mix of local (e.g. Ollama) and hosted (e.g. OpenAI, Anthropic, etc.) models in your application.

The issue with this is that each model may have a unique interface, making it difficult to update the model call, each time you want to try out a different model.

For abstracting out the model, I came across the adapter design pattern - which is a pattern commonly used in Object-Oriented programming.

The core premise is this: defining a unified interface for allowing your application to interact with each model consistently.

Here's how I implemented it in my application: I created two model adapters with the query () function that performs the underlying model call. If I need to add a new model, I can do so by creating a new model adapter and implementing the query() method.

class OllamaAdapter:
    def __init__(self, host, model_name):
        self.model_name = model_name
        self.host = host

    async def query(self, messages):
        # query implementation goes here


class OpenAIAdapter:
    def __init__(self, api_key, model_name: str = "gpt-3.5-turbo"):
        self.model_name = model_name
        self.api_key = api_key
        self.client = OpenAI(api_key=api_key)

    async def query(self, messages):
        # query implementation goes here


class ModelAdapter:
    def __init__(self, model_adapter):
        self.model_adapter = model_adapter

    def query(self, messages):
        return self.model_adapter.query(messages)

Defining models in this way brings few advantages:

Easy to add/remove models without affecting underlying code
Caters for both local and hosted models
Straightforward to understand what models are used/available

However, the main limitation to this approach is that there is still some effort required to update the logic for determining which model to be called; i.e. deciding whether to call the Ollama or OpenAI model based on what has been specified in the .env file.

In any case, it does simplify the feedback loop for model iteration which is the main purpose for this abstraction.

System Prompts

The first step to decoupling system prompts is to think about them as any other piece of code, i.e., maintainable, version controlled and adaptable.

This was definitely a mindset shift for me, but it made sense once I saw the impact that different words had on the model's output.

More practically, I decoupled the system prompts by templatising them using Jinja2, and moving it into its own directory. Each prompt could then be loaded in as desired by utilising a helper function.

System Prompts Directory Structure

System prompts should live alongside the codebase, in a directory structure that is decoupled from the code itself.

.
prompts/
└── system/
    ├── 01.jinja
    └── 02.jinja

Utility Function to Load in System Prompts

A utility function is used to load the system prompts from the directory, to be consumed by the application.

def process_template(template_file: str, data: dict[str, Any]) -> str:
    jinja_env = Environment(
        loader=FileSystemLoader(searchpath="prompts/system"),
        autoescape=select_autoescape(),
    )
    template = jinja_env.get_template(template_file)
    return template.render(**data)

New system prompts can be added to the prompts/system directory and updated into the .env file to be referenced by the application.

Evaluation

Having abstracted both the model and system prompt implementations, we can now iterate on them far more easily using automated testing.

The framework which I used for this was DeepEval, which is a tool used for unit-testing and evaluating AI applications.

For my use-case, I wanted to test gemma3:4b against gpt4-turbo, and evaluate the differences based on two separate system prompts.

With the abstractions above in place, I can specify a config file like below to run the tests against.

import os
from adapter import OllamaAdapter, OpenAIAdapter

SYSTEM_PROMPTS = ["01.jinja", "02.jinja"]

MODEL_ADAPTERS = [
    (OllamaAdapter, {"host": "http://localhost:11434", "model_name": "gemma3:4b"}),
    (
        OpenAIAdapter,
        {"api_key": os.environ.get("OPENAI_API_KEY"), "model_name": "gpt-4-turbo"},
    ),
]

Keep in mind, you will still need to implement your deepeval test code to iterate on these lists and generate the required test cases. However, it becomes far easier to extend your testing suite once this has been set up.

You can check out how I've set this up here.

Final Thoughts

There are many approaches to achieve a similar outcome; but the key idea I want to get across is that abstracting out the model and system prompt enables us to easily iterate on them using automatation.

This is especially important right now, where it feels like new models are coming out virtually every second.

I ran DeepEval against the combination I mentioned above, and the results were as follows:

Model: gemma3:4b, System Prompt: 01.jinja - Correctness (GEval): 61.54% pass rate
Model: gemma3:4b, System Prompt: 02.jinja - Correctness (GEval): 38.46% pass rate
Model: gpt-4-turbo, System Prompt: 01.jinja - Correctness (GEval): 84.62% pass rate
Model: gpt-4-turbo, System Prompt: 02.jinja - Correctness (GEval): 92.31% pass rate

I don’t think it’s that surprising that gpt-4-turbo outperforms gemma3:4b, but I found it surprising to see that the 02.jinja prompt which performed better on the OpenAI model didn’t perform as well on gemma3:4b; meaning we can’t choose to optimize for either of models or system prompts in isolation.

My take is this: the 02 prompt contains more granularity, which a larger model (like gpt-4-turbo) is able to better interpret and make sense of; as compared to a smaller model (like gemma3:4b). Nevertheless, it all comes at a cost...