
Anyone who has tried to build an AI assistant for regulated areas like healthcare knows it is not an easy task. You have to balance making the assistant useful and flexible with ensuring it follows a strict “do no harm” policy. That balance is especially difficult when you’re trying to bake such different and often conflicting behaviors into a single model.
In healthcare, finance, and other specialized domains, you can’t just rely on RAG — despite proponents touting it as a silver bullet — no matter how many bells and whistles (knowledge graphs, re-rankers, etc.) you add on top. The problem is that the context window remains limited, and RAG is, at its core, a workaround for a model’s inability to internalize large amounts of specialized data.
Humans in these fields train for years to develop the necessary expertise, with most of their knowledge deeply ingrained in memory — something most LLMs lack. People achieve this through a combination of theory and practice over extended periods, guided by feedback from mentors and peers. That feedback spans multiple dimensions: safety, usefulness, completeness, politeness, clarity, and more.
For a model to understand all of these dimensions, you need a high-quality, large-scale training dataset that reflects them. Getting such a dataset labeled by medical professionals would be extremely difficult and expensive, not to mention the privacy and regulatory challenges around personal data.
So, while it’s technically possible, it’s undeniably challenging.
In this article, I want to outline how to approach this problem and build something both useful and reliable while working around these limitations.
The obvious idea is to use synthetic data for training — that is, data generated by a model. While an LLM can easily generate data, the real question is how to score it properly against metrics such as safety, usefulness, and completeness. Once again, we could use real doctors to evaluate it, but that would be prohibitively expensive given that hundreds of thousands of data points are needed at minimum.
What makes a good medical answer anyway?
Before jumping into synthetic data generation, we need to define what “good” even means. In healthcare AI, we typically evaluate across six key areas:
Press enter or click to view image in full size

Sounds comprehensive, right? But here’s where it gets tricky. Let’s look at a simple example — a user asks “Is my blood pressure normal?”
Response 1:
Hey Alex, Blood pressure is the force of your blood pushing against your artery walls. It’s measured in millimeters of mercury (mmHg) and recorded as two numbers — systolic and diastolic. Normal blood pressure is typically around 120/80 mmHg.
The response starts with unnecessary explanation instead of answering the question. Plus it’s not personalized at all.
Response 2:
Hey Alex, Well, “normal” can vary depending on age, lifestyle, and health conditions. Based on your last 10 readings, your average is 125/82 mmHg — that’s slightly above ideal but still within a healthy range.
Better personalization, but still doesn’t give a direct answer upfront. Users want answers, not philosophy.
Response 3:
Hey Alex, According to medical guidelines, normal blood pressure is below 120/80 mmHg. Based on your recent readings, your average is 125/82 mmHg — slightly elevated but not yet high blood pressure. It may help to monitor it regularly and reduce salt intake.
Direct, personalized, actionable. This is what we’re aiming for.
Building your evaluation framework
Once you know what good looks like, you need a systematic way to evaluate it. We use a priority-based approach:

Level 1: The non-negotiables
- Topic is in scope — is this even a medical question we should answer?
- Answer rate — did we provide an answer or reject inappropriately?
- Completeness — does the answer address all parts of the question?
- Relevance — is the information actually related to what was asked?
Level 2: The difference makers
- Coherence — does the response make logical sense?
- Personalization — are we using the user’s actual data?
- Instruction fidelity — did we address all actionable parts of the query?
Level 3: The nice-to-haves
- Tone of Voice — professional but approachable
- Fluency — natural language flow
- User-centric language — medical terms explained simply
- Readability — appropriate complexity level
Each metric gives a binary decision for each query-response pair. Simple but effective:
Press enter or click to view image in full size

The goal is straightforward: minimize bad answers and bad rejections.
The judge system — LLM as evaluator
Here’s where things get interesting. We can’t have doctors evaluate millions of responses, but we can use LLMs as judges. Yes, AI grading AI — I know how it sounds.

Step 1: Collecting the initial human-labeled dataset
The first step was collecting an initial dataset of several thousand datapoints labeled by actual human professionals (doctors). This is expensive but necessary as a first step. Without this initial human-labeled dataset, we couldn’t train the first version of the judge that we can trust.
These medical professionals evaluated query-response pairs across our defined metrics — safety, usefulness, completeness, relevance, and others. This gave us the ground truth we needed to train the first version of our LLM judge.
Step 2: Training and scaling with LLM judges
Once we had the first version of the judge trained on the human-labeled dataset, we could move to the scalable system:

The evaluation framework consists of:
- Safety evaluation (LLM judge + human oversight for critical cases)
- Usefulness evaluation (LLM judge + periodic human calibration)
Both judges output binary scores that determine if a response passes our quality bar.
Judge calibration challenges
Getting your judges to agree on something usable is tricky. We use several techniques:
- Error analysis via qualitative coding methods
- Inter-rater reliability metrics (Cohen’s Kappa, Krippendorff’s Alpha)
- Prompt engineering — both manual and automated
- Regular calibration sessions with human experts
You can’t have a 1000 specialized judges — it’s too slow and expensive. One way to solve this is to cluster similar problems and create judges for problem categories. Some areas are particularly challenging because they lack objective metrics (like “empathy” or “reassurance”).
The DBRM approach
ADynamic Behavior Reward Model (DBRM) is an AI model that evaluates and scores the behavior or outputs of another model — usually a large language model — to guide its learning or self-improvement. It acts like a judge that provides reward signals to help the main model align with desired goals such as helpfulness, truthfulness, creativity, safety, or task performance.
The process has three main phases:
Generation Phase: the main LLM generates several possible outputs (responses, actions, or plans) for the same prompt.
Evaluation Phase: the DBRM examines these outputs and gives each a reward score — for example, from 0 to 1 or -1 to +1 — based on how well they align with specific behavioral criteria (factual accuracy, politeness, efficiency).
Learning Phase: the LLM updates its policy (via Reinforcement Learning, Direct Preference Optimization, or self-improvement loops) using the reward signals from the DBRM.

Why “dynamic” matters
Unlike static reward models (which are trained once on fixed data), a Dynamic BRM can:
- Continuously adapt based on new data or user feedback
- Update its own criteria and weighting of rewards over time
- Learn to generalize across new domains or tasks
This makes it suitable for autonomous systems, self-aligning AI, and multi-agent environments where the standards of “good behavior” can evolve. In healthcare, this is particularly important — medical guidelines update regularly, new research emerges, and what’s considered best practice can shift.
DBRMs are used in several contexts:
RLAIF (Reinforcement Learning from AI Feedback) — replacing human labelers with AI judges. When you can’t afford continuous human feedback, you use AI to generate the feedback instead.
AI self-play and self-improvement loops — where the model learns from its own outputs, iterating toward better performance.
Multi-agent governance systems — one agent judges another’s reasoning or accuracy.
Ethical alignment frameworks — adaptive control of model behavior to maintain safety and usefulness standards.
The healthcare-specific challenge
Inhealthcare applications, we also need fine-tuning for coherent term usage across all sessions. Medical terms and conditions, symptom names, drug names — these all need to be consistent. A patient’s “high blood pressure” in one response shouldn’t become “hypertension” in another unless there’s a good reason for the terminology change.
The DBRM helps maintain this consistency while adapting to new medical knowledge and guidelines. But remember — the speed vs accuracy vs cost triangle still applies. You can optimize for two, but rarely all three.
So all this sounds reasonable, right? Let me tell you what actually happens when you try to implement this.
Discovery phase
You gather a room of medical professionals to review your AI’s responses and establish what “good” looks like. Simple, right? Wrong.
Doctor A: “This response is too direct. Patients need context before recommendations.”
Doctor B: “Are you kidding? Patients hate when we bury the answer. Lead with the conclusion.”
Doctor C: “No, no. It all depends on the severity level…”
Turns out, even domain experts disagree on fundamental questions. What’s “complete” to one doctor is “overwhelming” to another. What’s “reassuring” versus “dismissive” is surprisingly subjective.
So our job during this phase is to find patterns in the disagreement. Not to resolve it — that’s rather impossible — but to identify the dimensions of variation that actually matter. Age of patient? Severity of condition? Type of question? These become your rubric parameters.
Rubric creation
Once you’ve identified the key dimensions, you need to convert medical professional judgment into something an LLM can understand. This is harder than it sounds.
Take “completeness” — what does that mean for the question “Should I take aspirin daily?” A complete answer needs to cover:
- Current medical guidelines (changes by age and risk factors)
- Patient’s specific risk profile (requires data access)
- Contraindications (requires medication history)
- When to consult a doctor
Now convert that into a rubric that an LLM judge can evaluate consistently. You end up with detailed criteria like:
A complete response must include:
- general guideline
- personalization based on available data
- at least one relevant contraindication or consideration
- clear escalation path if needed
Then you test it on 50 examples and realize your rubric is either too strict (rejecting good answers) or too loose (passing mediocre ones), so you will need several iterations.
Error analysis and measuring disagreement
Once your rubric is implemented, you need to know if it’s working, usually using something like Cohen’s Kappa and Krippendorff’s Alpha — statistical measures of inter-rater reliability.
In plain words: these metrics tell you how much your judges agree beyond random chance. A Kappa of 0.0 means they might as well be flipping coins. A Kappa of 1.0 means perfect agreement (which never happens).
In practice, you’re aiming for 0.6–0.8 for most metrics. Anything below 0.4 means your rubric is too vague or your judges need recalibration.
Here’s the tricky part — your AI judge and human experts will disagree. A lot. You’ll find cases where:
- The AI is technically correct but misses clinical nuance
- Humans disagree with each other more than with the AI
- Edge cases where nobody is quite sure what’s right
- Human doctor is incorrect – that does happen more often than you think
You document these discrepancies, categorize them, and use them to improve your prompts and rubrics.
The annotation phase
Now you need to annotate thousands of query-response pairs, and even with your calibrated rubric, this is a brutal task.
Your annotators (doctors, nurses, medical professionals) are going through responses one by one, evaluating each against your metrics.
You will likely see several common patterns:
- Inter-rater agreement drops after hour 2 of continuous annotation, so better split it across several days and rotate annotators, or rather hire a specialized company
- Controversial topics (vaccines, weight loss, mental health and wellbeing) have lower agreement rates
- The edge cases that matter most are the hardest to annotate consistently
Have fun!
