Principle II: Load Prompts Safely (If You Really Have to)

Would you want your chatbot to start discussing Taylor Swift lyrics instead of providing tech support? That’s what our chatbot did when we violated the principle above. If you want to de-Swift your application and make your AI architecture safer, keep reading. (Sorry Taylor fans!)

Where to Store Prompts

Do you store your prompts with the rest of the code? Or load them from another source? Perhaps a combination of both? Below is the framework for thinking about this decision. 

Option A - Store Prompts in Git

The first question you should ask is: Is there an immediate reason for storing the prompts separately from your code? If no, leave the prompts in Git with the rest of the codebase, where they belong. This is by far the easiest and safest setup to maintain. It is the default option.

Going back to Principle #1 - Prompts are Code. Storing parts of your codebase outside Git is possible and sometimes necessary, but not trivial. Do not take the decision to move prompts out lightly.

Option B - Load Prompts from a Version-Controlled Platform

What if some of your prompts need to be edited by non-Engineers? This could occur if deep domain expertise in an area is needed. Or if a prompt needs to be modified very frequently and you can’t wait on the engineering department.

In this case, you’ll need to load the prompt at runtime from a version controlled source. I’ve seen Confluence and Google Docs successfully used for this purpose. Many other version controlled, API accessible platforms are also available.

When planning the prompt loading logic, do not underestimate the amount of effort in adding this integration. You’ll need to handle a variety of error conditions and scenarios to have confidence in your application. Access permissions need to be configured and maintained, and automated testing and additional monitoring should be extended to catch errors as early as possible.

Here are some of the scenarios you need to plan for:

  • The application is unable to load prompts at runtime. Do you kill the deployment? Switch to a backup version of the prompt?

  • Prompt syntax becomes invalid after a change and returns unusable data structures. Automated tests failed to detect the issue because prompts weren't loaded during test execution. What kind of additional testing infrastructure and monitoring needs to be added to detect this and minimize customer impact?

  • Prompt needs to be urgently rolled back. Does this require a new code deployment? Or do you build a separate UI for prompt deployment?

  • Syntax added to the document by platforms like Confluence can infiltrate the runtime prompt, negatively affecting its performance. Make sure you filter the fuzz out with tools such as Beautiful Soup.

All of these issues are 100% solvable. But it's easy to fall into the pattern of thinking that loading a prompt from a Google Doc is a trivial operation that won’t affect the application architecture in a deep way. As I’ve shown above, loading an external prompt is serious business to be approached with care for high-reliability applications.

Option C - Load Prompts from a Non-Version-Controlled Platform

This is a bad idea, and you will regret it. The source of truth for the prompts needs to be version-controlled, have proper API and access controls. This is not an area to cut corners.

Option D - Hybrid Approach

The hybrid approach combines storing some prompts directly within your codebase and loading others from external, version-controlled sources. While maintaining a unified location for all prompts is often simpler and more reliable, there are scenarios where a hybrid strategy can offer advantages.

Consider adopting a hybrid approach under conditions such as:

  • Mixed Usage: Certain prompts require frequent updates by non-coding domain experts, making external loading practical, while others are only changed by engineers.

  • Risk Management: Critical prompts (e.g., guardrails) should reside in the main repository for maximum reliability. Less critical prompts, particularly those undergoing frequent adjustments, can safely live externally.

  • Evaluation Flexibility: Prompts intended for ML-style evaluation can be managed externally to simplify their integration with an evaluation framework.

Guardrail Prompts

Guardrail prompts (also known as censor prompts) are specialized to screen responses before they reach users, ensuring appropriate, safe, and compliant outputs. Guardrails serve as a protective mechanism, particularly in applications where user interactions carry significant legal or ethical risks. They provide a second line of defense, catching inappropriate outputs that slip through.

Do not load guardrail prompts from an external doc - this adds a significant unnecessary risk. Either keep them in Git with your code or use a dedicated third party tool, such as Fiddle Guardrails. Guardrail logic doesn’t change very often, so this approach won’t slow you down all that much.

Using guardrails is a principle of its own, to be discussed in much more detail in a future post. It's a great pattern that improves the safety of your application and helps you sleep better at night. Just don’t load them from Google Docs.

Loading Prompts for Easier Evaluation

Teams often load prompts externally to integrate them with evaluation engines, such as ML Flow. The underlying assumption behind this practice is that prompts are similar to ML models and need a detached, statistical assessment. You plug in a prompt, measure the F1 score on the output (or whatever metric you prefer) and iterate.

This approach is sometimes valid—for instance, on classification prompts designed to behave as ML models. But most prompts are fundamentally different: as outlined in Principle #1: LLM Prompts Are Code. Typical prompts are more similar to application logic than to ML models. They are more suited to Pass-Fail type evaluation together with the surrounding code, rather than a statistical evaluation approach.

External evaluation engines will not help you with most prompts. Instead, you should use automated AI-driven tests, similar to traditional unit tests. These are going to be the focus of subsequent posts.

Consider the following practices:

  • Only prompts whose functionality explicitly mimics machine learning models (e.g., classification or scoring tasks) should be externally evaluated.

  • Maintain the majority of business logic prompts within the main codebase, employing traditional automated testing approaches similar to unit testing rather than ML validation techniques.

  • Where external evaluation is warranted, isolate only those prompts, when possible.

Case Study

The central issue with loading prompts is availability - what should you do if the prompt doesn’t load when you expect it to.

This is what happened to us in the Taylor Swift example. None of the prompts for a tech support app loaded as a result of a Confluence credentials issue, including the guardrail prompt. This somehow didn’t trigger any runtime errors and the bot began responding without any instructions or input (since the input formatting string was part of the prompt). And what does OpenAI’s LLM want to talk to in the absence of input? Turns out—the lyrics to 'I Want to Break Free' by Queen and various Taylor Swift songs. Fortunately, this was caught and fixed almost immediately, and users enjoyed the music discussion—at least that’s what I tell myself.

Why did this incident occur? Two mistakes were made:

  • No checks were performed that the prompts had successfully loaded. There should have been an error thrown at prompt load time, since the app could not function without prompts loaded.

  • The guardrail prompt was loaded externally with the rest of the prompts. That’s one prompt that should not be loaded in this way. It should have been kept in Git as the last line of defense.

After the incident, the guardrail prompt was re-migrated to Git and exception logic was added to prevent deployment if a prompt failed to load or was invalid. You can save yourself a postmortem by following these recommendations proactively.

Conclusion

In this post, I examined key considerations around prompt storage and loading within AI applications. The default practice is to store your prompts alongside your code in version-controlled repositories. Only deviate from this when there's a compelling reason, such as frequent editing by non-engineers or specific evaluation requirements.

When prompts must be loaded externally, choose reliable, and strictly version-controlled sources, adding testing and monitoring for resilience. Guardrail prompts, given their critical role in application safety, should remain in your codebase to avoid severe reliability risks.

Most prompts are closer in nature to code than to ML models, so only use ML style tools where you need them. Don’t store all of your prompts externally just to simplify integration with an evaluation tool for a few of them.

If you enjoyed this post, follow the series for more insights.


Previous
Previous

Principle III: Modularize The Prompts

Next
Next

Principles of AI Software Development