A hideous monstrosity. Every experienced engineer has seen one: code that is so vast, high risk, and difficult to understand that no one dares to touch it. There are no unit tests, every change is cause for a minor heart attack. The only ones who venture near it are the old timers - those who were around when the monster was built and they only come close when there is no alternative. It's stale, unmodularized and the dependencies are out of date. The component is too dangerous to seriously alter.

image of a monster used to illustrate a bad prompt

I remember the first monstrosity I encountered. A 5,000-line function that was central to the operations of a business worth hundreds of millions of dollars; barely anybody had confidence to touch it. When it broke, whole teams were woken up in the middle of the night. All development in the company was slowed down because of a dependency on this key component. Millions of dollars were spent trying to manage the monster.

What does all of this have to do with LLM prompts? They can become monstrosities too! So scary to change, that no one touches them. Or conversely, teams try fixing them and cause an avalanche of incidents.

What Customers Need

Customers don’t want to pay for software that works correctly only on Tuesdays and Thursdays; they demand constant reliability and a stream of new features. When building long term high-reliability systems it's essential to enable the application to evolve, while constantly keeping the lights on. This applies to Gen AI powered applications as much as traditional software.

So how do you get a healthy AI powered application and not a monstrosity? There are over a dozen approaches all covered in this series. They all start with one principle: instead of one ginormous prompt, you want multiple smaller focused prompts that each aim to solve a single problem.

What is Modularization

Modularization is the practice of breaking down a complex system into smaller, self-contained, and reusable components. In traditional software engineering, this means writing functions, classes, and services that each handle a specific task. In the context of prompt engineering for LLMs, modularization means splitting a large, monolithic prompt into smaller, focused prompts—each designed to perform a single, well-defined job.

Benefits of Modularization

Modularization allows you to safely introduce changes to your system over time. Its importance grows when:

The length of time the application will be maintained increases.
The number and complexity of features expected to be added increases.
The reliability requirements on the system get stricter.

All of these dimensions need to be understood when planning out the system.

But how specifically does modularization help maintain the system? The main benefits are described below.

Risk Reduction

LLM prompt performance is inherently unstable. Their nature is such that any change can affect output in unpredictable ways. You can manage this risk by breaking big prompts into components, where a change can only affect the performance of a part of the system. Even if one prompt is broken, the rest of the system will operate as before the change.

But what if prompts operate as a chain? Wouldn’t breaking one component still break the chain? Yes, it would, but the damage is still reduced in this scenario. An erroneous output in a prompt chain can supply the downstream prompts with faulty inputs, but each component would still operate as before the change on the set of valid inputs. Contrast this with altering a giant prompt - the change can (and will!) affect every bit of logic encoded in that prompt. You didn’t break one aspect of the system - you potentially broke every part of it.

(Operating chains of prompts safely is a future chapter in the series. You need to plan for various types of failures and have contingency plans. But this is beyond the scope here)

Improved Testability

Anyone who has written unit tests knows that a simple function that does a single thing is WAY easier to test than a complex function that tries to do many different things. The same applies to prompts - a small, focused prompt can be tested much more thoroughly both manually and in a fully automated manner.

Better Performance

A wide body of evidence shows that shorter prompts tend to outperform longer ones: 1, 2, 3.

Research on the effects of multitasking on prompt performance is more mixed: 4, 5. A perfectly optimized prompt can, under the right circumstances multitask. In practice though, it is much easier to optimize focused prompts, where you can track performance along a single main dimension. You should aim for more focused prompts wherever possible.

Ease of Knowledge Share

Explaining the intricacies of a super prompt with 3 thousand words to a new team member is a journey. And no matter how much you explain, the only ones who have a feel for this beast will be contributing authors.

A system of prompts, with each part being relatively simple can be onboarded to much faster; engineers will start being productive sooner.

Cost Optimization

By using different models in different parts of the system, you can achieve significant cost and latency savings without affecting response quality.

For example, a prompt that determines input language doesn’t have to be particularly smart - it doesn’t require your latest and most expensive model. On the other hand, the prompt that generates the reply based on documentation could benefit from a built-in chain of thought reasoning embedded in high-end models.

When to NOT Modularize

Most software-powered applications require safely adding features over extended periods of time. There is, however, an exception. Prototype applications are not intended to be maintained for long; they won’t get new features, and are not meant for high reliability. So don’t waste time with modularization when building prototypes. In fact, most of the patterns in this series do not apply to prototype applications. When building a prototype - go quick, verify the critical unknowns and then throw the code away.

Another consideration is knowing when to stop modularizing. There is overhead to managing extra prompts and if benefits of further modularization are low - you should stop breaking the system up further.

Infrastructure for Modularization

If modularizing prompts was trivial - everybody would be doing it. To manage many prompts in a system, you need to invest in infrastructure - without it you will get chaos. Here are the minimal requirements for the LLM prompt infrastructure:

Ability to add prompts quickly and pain-free in a standardized way. Particularly important when prompts are loaded from outside the codebase. See Principle II: Load Prompts Safely (If You Really Have to).
Ability to deploy prompts in an automated way.
Ability to log and monitor inputs/outputs of individual prompts.
Ability to add automated tests that cover prompts.
A way to easily track token/$ spend on various prompts.

Case Study

Let’s see how building a Gen AI powered system plays out in practice with and without modularization.

No Modularization

You are building a tech support app and are determined to implement it with a single prompt. In the simplest version, you can imagine a monolith prompt that generates responses while loading relevant documentation through RAG.

Looks nice and easy, right? But as you add features - problems with this architecture emerge:

You want to respond to messages in a fixed list of languages, but not handle others. To achieve this you add prompt instructions to only respond in certain languages and get the LLM to return the “language” field for reporting purposes.
You want all conversations classified. Add a field “label” to the prompt output.
When the user is unhappy - escalate the case to human support. Add “escalate_to_human” output variable along with instructions in the prompt.
Need a translation of all messages sent for internal audit. Return the “translated” field with a message in English.
Need protection to make sure that the app never asks users about their location and who they voted for in the last election. Add prompt instructions and test it out manually.
Need a summary for every conversation? Add “summary” field to every output.

Perhaps you are beginning to see the problem - this prompt now has six outputs. Testing it will be a nightmare. You add support for another language, and suddenly your app begins to return the summary in Spanish instead of English. Why? Who knows, LLM outputs are unstable, so changing the prompt has unpredictable results.

Congratulations - you’ve created a monster! Over time it will grow and cause even more pain.

With Modularization

Both Prompt Chain and an entirely separated classification prompt is used. The original large prompt is modularized as much as practical.

One prompt detects the language, one provides translation, one determines if the user is upset and escalates to humans, response prompt generates the response, guardrail verifies compliance of response. Outputs of one prompt are chained to be inputs of the next; traditional code can operate between these prompts to, for example, check language eligibility, without involving LLMs.

A change can still break a given prompt, but risks are greatly reduced because:

A change to one part doesn’t risk breaking every part of the application logic.
Testing is much easier and the odds of catching failure early are high.
Each prompt is relatively simple, so it's easier to understand and you are less likely to do damage with a change.
Changes are easier to review.

You get all the benefits of Gen AI, but the risks are greatly reduced. Plus, you can use cheaper models for some components to save money.

Conclusion

Modularization allows you to isolate errors, improve maintainability, and build a more reliable system. Even moderately sized applications will have dozens, if not hundreds, of component prompts. Break up prompts until they each perform a single task, and until the benefits of further modularization are outweighed by added operational complexity. Modularizing your prompts is a necessity if your AI-driven applications are to remain reliable, and continue to add features over the long run. There are plenty of “monster” systems around already - take care not to create new ones!

If you’ve enjoyed this series - subscribe for more posts.

return to all articles

Principle III: Modularize The Prompts