I've seen a worrying trend lately. Teams get so wrapped up in the magic of building with LLMs that they forget the unglamorous, essential craft of software engineering. Specifically, they forget how to test.
There's a quiet assumption that if the model is powerful enough, the rest of the system just... works. You throw a prompt at GPT-4, get some JSON back, and call it a day. Ship it.
This is a huge mistake. And it's going to bite a lot of products in the ass.
AI development doesn't make test cases less important. It makes them more important than ever before.
The Old World: Predictable and Tidy
In traditional software, testing is a solved problem. You write a function, it takes an input, and it produces a predictable output.
def add(a, b):
return a + b
# Test case
assert add(2, 2) == 4
It's clean. It's binary. It either works or it doesn't. We built entire ecosystems around this certainty: unit tests, integration tests, end-to-end tests. All based on the idea that for a given input, you get a known output.
The New World: Messy and Unpredictable
AI, especially LLMs, throws that certainty out the window. Here’s why testing in the age of AI is a completely different beast:
1. Non-Determinism is the Default
Ask an LLM the same question twice, and you might get two different answers. Model updates, fine-tuning, or even just the inherent randomness (temperature settings) mean you can't test for exact matches anymore. Your tests need to validate the behavior and structure of the output, not the exact content.
2. Failure Modes are Hidden and Weird
A traditional app might crash or return a 500 error. An AI system fails in much stranger ways. It can "hallucinate" facts, exhibit hidden biases, or go completely off the rails on an edge case you never thought to check. These aren't bugs in the code; they're failures in the model's reasoning.
3. The System is in Constant Flux
AI systems are never "done." You're constantly feeding them new data, retraining them, or tweaking prompts. A change that improves performance in one area can cause a regression in another. Without a robust, automated test suite, you're flying blind with every update.
4. The Stakes are Higher
When your add(2, 2) function fails, it's usually a minor bug. When an AI in a healthcare app gives the wrong medical advice, or a financial bot misinterprets a transaction, the consequences can be disastrous. The high impact of these errors means "good enough" isn't good enough. You need to build trust, and that trust comes from rigorous testing.
5. It's Not Just About "Correctness" Anymore
We've moved beyond testing for functional correctness. Now, we need to test for things that are much harder to quantify:
- Reliability: Does it consistently perform as expected?
- Fairness: Is it free from harmful biases?
- Safety: Does it have safeguards to prevent misuse?
- Explainability: Can you understand why it gave a certain output?
Shifting Your Mindset: From "Value" to "Bounds"
So what does this mean for us, the builders? It means we have to change the way we think about testing.
The question is no longer: "Does this function return the right value?"
The question is now: "Does this system behave within safe and useful bounds across a wide range of situations?"
Let's make this concrete. Imagine you're building a chatbot for software developers. You ask it: "Explain how a pipeline works."
The word "pipeline" is ambiguous.
- In software, it means a CI/CD or data processing system.
- In oil & gas, it means physical infrastructure for moving petroleum.
- In machine learning, it means preprocessing and training stages.
If your app is developer-focused, a good response is about software. A bad response is about oil. An exact string match test is useless here. You need an evaluation.
Your test should check: ✅ Does the response stick to the software engineering context? ❌ Does it drift into oil & gas or other irrelevant domains?
This requires a new kind of testing. Instead of assert result == "expected string", you write evaluations.
# A simple "eval" to check the AI's response
def is_software_focused(response_text: str) -> bool:
"""
Checks if the response is relevant to software engineering.
This is a basic example. Real-world evals can be much more complex,
even using another LLM to score the response.
"""
positive_keywords = ["ci/cd", "data processing", "build", "deploy", "automation"]
negative_keywords = ["oil", "gas", "petroleum", "drilling"]
text = response_text.lower()
if any(kw in text for kw in negative_keywords):
return False
if any(kw in text for kw in positive_keywords):
return True
# Default to False if no clear signal is found
return False
# Your test case
def test_pipeline_explanation():
# Assume get_ai_response is the function that calls the LLM
response = get_ai_response("Explain how a pipeline works.")
# The test now checks behavior, not a specific string
assert is_software_focused(response)
While the keyword-based approach is a good start, it's brittle. A sophisticated model could talk about "petroleum data processing pipelines" and fool our simple function. This is where modern evaluation frameworks come in.
Going Deeper with Evaluation Frameworks
Tools like DeepEval and LangChain provide more robust, out-of-the-box solutions for this exact problem, often using a more powerful LLM (like GPT-4) to "judge" the output.
Example with DeepEval
DeepEval's AnswerRelevancyMetric is perfect for this. It automatically uses an LLM to check if the output is relevant to the input.
# Example using DeepEval
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# The test case captures the input and the actual model output
test_case = LLMTestCase(
input="Explain how a pipeline works for a software engineer.",
actual_output="In the oil and gas industry, a pipeline is used to transport petroleum products over long distances."
)
# The metric will use an LLM to score relevancy
# We expect a low score here because the context is wrong
relevancy_metric = AnswerRelevancyMetric(
threshold=0.5, # We set a threshold for the test to pass
model="gpt-4",
include_reason=True
)
# Run the evaluation
evaluate([test_case], [relevancy_metric])
# The evaluation will fail because the actual_output is not relevant
# to the software engineering context provided in the input.
# The 'reason' would explain that the output discusses oil and gas, not software.
Example with LangChain
LangChain's CriteriaEvalChain lets you define custom criteria for an LLM judge. This gives you explicit control over the evaluation.
# Example using LangChain
from langchain.evaluation.criteria import CriteriaEvalChain
from langchain_openai import ChatOpenAI
# The LLM that will act as the "judge"
eval_llm = ChatOpenAI(model="gpt-4", temperature=0)
# We define our specific, custom criteria
criteria = {
"context_relevance": "Does the response exclusively discuss software engineering, CI/CD, or data processing pipelines? The response MUST NOT mention topics like oil, gas, or petroleum."
}
# The evaluation chain
evaluator = CriteriaEvalChain.from_llm(llm=eval_llm, criteria=criteria)
# Our input and the model's (bad) output
input_prompt = "Explain how a pipeline works for a software engineer."
prediction = "A pipeline is a crucial piece of infrastructure for transporting crude oil from drilling sites to refineries."
# Run the evaluation
eval_result = evaluator.evaluate_strings(
prediction=prediction,
input=input_prompt
)
# eval_result['value'] would be 'N' because the prediction violates our criteria.
print(eval_result)
This is the shift:
- Property-based testing: The property is "software-focused." The
is_software_focusedfunction checks this property. - Scenario-based testing: You'd build a library of ambiguous terms like "pipeline," "environment," or "container" to test the model's contextual understanding.
- Evaluation-driven development: You run these
evalson every model change to ensure you don't introduce regressions. You track the pass rate as a core metric of quality.
The Bottom Line
AI is not an excuse to abandon engineering discipline. It's a call to elevate it.
The teams that win in the long run won't be the ones that just wrap an API call in a pretty UI. They'll be the ones that build a robust, testable, and reliable system around the AI.
Stop chasing the magic and start building the scaffolding. Your users, and your future self, will thank you for it.