Notes From Trying to Cache a Healthcare Chatbot

Background

The product is a maternal-health assistant built by ARTPARK in collaboration with ARMMAN : a multilingual, human-in-the-loop chatbot — that answers ' questions about high-risk pregnancies (HRPs) and antenatal care, and acts as a channel for continuous learning and on-the-job support. The workers ask in whatever comes naturally — Hindi, English, code-mixed Hinglish, and other Indian languages — by text and by voice.

As of May 2026, it's deployed across three Indian states (Uttar Pradesh, Telangana, and Maharashtra). A few numbers to set the scale:

Avg. queries / month

~7,700

Unique users

3,400+

Repeat users

1,500+

Text : speech queries

10 : 1

Escalation to human

10.3%

End to End latency

8.5 ± 1.5 s

Production traffic for the HRPTM assistant. The large repeat-user base and paraphrased queries are exactly what makes answer-reuse worth chasing.

Two of those numbers are why a cache looked worth building. 1,500+ repeat users out of 3,400 means a large share of traffic is people coming back with the same kinds of questions — exactly the repetition a cache feeds on. And the 10:1 text-to-speech ratio means most turns are typed, so the latency math (next section) is dominated by text.

The TL;DR

The bet: a cache hit is huge (near-instant, skips multiple ), and if it's a cache miss, then it's just additional few hundred milliseconds to a pipeline that takes 5-8 seconds end-to-end, which is almost impercetible when it comes to the user.
The wall: in healthcare, a wrong hit isn't a small cost, it's a confidently incorrect clinical answer handed to a worker in the field. Incorrect or unsafe answers erode - something we highly value.
The evidence: across six embedding models that we tested, a clinically different number scored as similar as a true paraphrase 60–100% of the time. Threshold tuning, numeric rewriting, learned rerankers, and query canonicalization each hit a decisive negative result.
What survived: a reusable offline evaluation harness, a small trivially-safe slice that can be cached, and a sharp characterization of where the real research problem is.

1. The bet

Start with why anyone wants a semantic cache at all.

The chatbot's pipeline is correct but slow — and a single answer isn't one model call, it's a chain of them.

The chatbot pipeline as a chain: a query flows through preprocessing and generation (two LLM calls), with speech-to-text added at the start and text-to-speech at the end on voice turns. A cache hit short-circuits the whole chain. — A single answer isn't one model call but a chain — two LLM calls every turn (preprocess + generate), plus speech-to-text in and text-to-speech out on voice turns. A cache hit skips all of it.

On a cache miss — the normal path — that whole chain runs, and an answer takes roughly eight seconds end to end, with the large majority spent in the LLM generating the response. (Text turns are lighter; a voice turn adds the speech-to-text and text-to-speech models on either end.)

Cache miss — full pipeline≈ 7.9s

6.5s

Cache hit — if it is safe to reuse≈ 0.2–0.3s

near-instant — skips generation entirely

PreprocessingGeneration (LLM)Format + translate

Where the budget goes on a miss: generation dominates. A safe cache hit would skip generation entirely — and on a voice turn, skip text-to-speech too. That asymmetry is the whole reason to want a cache.

Now look at the economics of a cache hit. If we could safely recognize that a new question has already been answered, we'd return the stored answer in a couple hundred milliseconds — skipping both LLM calls (preprocessing and generation), and on a voice turn the speech-to-text and text-to-speech models too, since the cached entry already carries the audio. That's as many as four heavy model calls avoided in a single lookup. A hit is a near-total win.

And a miss costs almost nothing. The cache lookup adds maybe 0.2 seconds before falling through to the normal pipeline. So the payoff looks lopsided in exactly the direction you want:

A hit saves seconds and up to four model calls. A miss costs a fraction of a second. Take that bet every time.

That asymmetry is why semantic caching shows up in nearly every production LLM system, and why we reached for it. It looked like a solved problem. The rest of this piece is about why, in this domain, it isn't.

2. What a semantic cache actually is

Skip this section if it's familiar.

A semantic cache loosens that exact-match requirement to match on meaning instead. An embedding is a list of numbers that can be roughly used as a proxy for the meaning of a piece of text; two texts that mean similar things tend to land near each other in the embedding space. Cosine similarity is a score from 0 to 1 that measures how close two embeddings are. So a semantic cache embeds the incoming question, finds the nearest it has already answered, and if the score clears a threshold, it reuses that answer instead of calling the model.

incoming query

embed→ vector

vector searchk = 1

top-1 neighbor

similarity ≥ threshold

Return the cached answer.

similarity < threshold

Run the full pipeline. Store (query, answer) in the cache.

The appeal is that it catches paraphrases that an exact-string cache never would. All three of these are the same clinical question, and a good embedding puts them close together:

Illustrative 2D projection only: embeddings live in a high-dimensional space, but the rough picture is useful. Queries with similar meaning tend to land near each other even when the wording or language changes.

That clustering is real, and it's why embedding retrieval feels natural at first. The problem is that clustering by broad meaning is not the same as preserving the details that decide whether an answer is safe to reuse. That gap is the whole story.

3. Why this domain flips the bet

Here is the reframing that took us a while to fully absorb. For ordinary semantic search, the question you're answering is:

Are these two queries about the same thing?

For a medical cache, the bar is much higher:

Is the cached answer still correct under this query's specific facts — its numbers, its polarity, its timing, its intent?

Those are different questions, and the distance between them is where the bet inverts. In a low-stakes product, a wrong cache hit is a mildly irrelevant answer; you shrug and move on. Here, a wrong hit means an ANM with a patient in front of her receives a confidently incorrect recommendation that looks vetted. The downside isn't bounded by "the user is mildly annoyed" — it's bounded by "someone is harmed."

So the real payoff matrix isn't the happy one from Section 1:

Outcome	Generic chatbot	Clinical chatbot
Cache hit, correct	Big win	Big win
Cache miss	Small cost (latency)	Small cost (latency)
Cache hit, wrong	Minor annoyance	Unbounded

The instant one cell goes unbounded, the expected value of the whole bet is dominated by how often you land in it — not by your hit rate. "Make the cache fast and cheap" quietly became "make the cache never serve a wrong clinical answer," and that is a far harder target.

4. The trick that made all of this testable: a replay harness

Before any of the results below, one question had to be answered: how do you test a clinical cache at all? You can't just turn it on. The normal playbook — ship behind a flag, give the new path 5% of traffic, watch the metrics, ramp up — is exactly what the safety bar forbids. The whole point is to never serve a single wrong answer, and a 5% live experiment is, by definition, serving real answers to real workers.

So we never put it in the request path. Instead we built an offline cache replay (backtest) harness that asks a counterfactual: if this cache had been live last quarter, what would it have done?

SeedSep–Oct 2025

For each past query, store:

the query
the answer served then
embedding(query + answer)

ReplayNov–Dec 2025 · held out

Each query, run through the real lookup as if it were live:

run the real cache lookup

Hit

compare the cache answer against the one actually served then — no LLM call

Miss

would have gone to the LLM (no answer is generated in the backtest)

The replay harness, end to end: seed the cache from real history, then replay a later, held-out window through the real lookup logic — scoring what would have happened without ever serving a live query.

We seed the cache with real history — each entry carrying the query, the response that was actually generated and served at the time, and the embedding of the query-plus-response concatenation. Then we replay a later window of real queries through the real lookup logic, as if they were arriving live.

That single design choice paid off three times over:

Zero production risk. We measured hit rate, safety, latency, and cost across the whole pipeline without exposing one real worker to one cached answer. No flag, no 5%-and-ramp, no exposure window where something unsafe could leak — which, given the stakes, was the only acceptable way to run the experiment at all.
Free ground truth — no regeneration. Because the replay queries are past queries, we already had the answer that was generated and served for each one. So for every cache hit we could compare the cache's suggested answer against a known baseline directly, instead of paying an LLM to produce a fresh "correct" answer to grade against. Across a bake-off of six embedding models — several re-run over multiple iterations — that avoided an enormous amount of LLM-generation and embedding-API spend.
Fast iteration. LLM generation is also the slow part. Reusing the pre-generated seed answers collapsed each experiment from "regenerate everything, then measure" down to "embed, look up, compare" — so a full model-and-threshold sweep finished in a fraction of the wall-clock time, and we could try far more ideas per day.

Everything in the sections that follow — the threshold curves, the fooling rates, the 593-hit audit — is an output of this harness. It's also, as it turns out, the part of the project most worth keeping.

5. Where "similar" betrays you

A cartoon gatekeeper checks a queue. Lookalike sheep (surface similarity) pass through; an obviously different black dog and orange cat are routed away; but a few cats and dogs drawn to resemble sheep slip past into the output crowd.

The first thing we tested was the obvious thing: embed the query, take the nearest neighbour above a threshold, score the match. To separate "this is a hit" from "this is a safe hit", we used an LLM judge to rate each candidate match from 1 (clearly wrong, do not reuse) to 5 (strong, reusable), calibrated against a batch I'd first reviewed by hand.

Two problems showed up immediately, and they're the heart of everything.

Problem one: the threshold trap. Raising the cosine threshold does reduce bad matches — but it collapses the hit rate faster than it buys you safety. On one replay, a serve-threshold of 0.80 gave about a 24% hit rate but with unsafe matches mixed in; 0.85 dropped to ~3%; by 0.90 there were essentially zero qualifying semantic hits left. There was no setting that was simultaneously safe and useful.

pplx-embed-v1-4B threshold tradeoff. Raising the threshold reduces bad reuse, but the useful cache-hit rate drops quickly.

Problem two — the deeper one: embeddings are blind to magnitude. Dense embeddings are built to capture topic and meaning. They are not built to treat a number as a hard decision boundary. So a question about hemoglobin of 7 and a question about hemoglobin of 12 sit almost on top of each other in vector space — even though one is anemia needing intervention and the other is normal.

Modelpplx-embed-v1-4bScore~0.91VerdictUnsafe reuse

Incoming Query

6 mahine pregnant mahila ka Hb 7 hai kya karna chaiye?

Matched Query

6 mah garvati ka Hb 10 ho to kya salah de?

Hb 7 is severe anemia and an urgent referral; Hb 10 is moderate. The two queries score quite high against each other (0.91) than a genuine reword of either one does (0.87) — so the cache would confidently serve the milder protocol for an urgent case.

To investigate whether this wasn just a one-off, we built a deliberately adversarial dataset: anchor questions, each paired with a true paraphrase plus a set of clinically meaningful edits (change the number, flip the polarity, add a negation, shift the timing). The test: does the embedder rank the true paraphrase above the dangerous edits? For numeric edits specifically, the answer was usually no — across every model we tried.

“Numeric fool rate” = the share of anchor questions where a clinically different number scores at least as high as a true paraphrase. Every one of the six embedding models is above 50%; four are at or above 80%. This is a property of the representation, not a model-selection problem.

It's worse than "sometimes confused." For four of the six models, the average cosine similarity of a numerically-altered query is actually higher than that of a genuine restatement of the same question:

When the red bar matches or beats the teal bar, the embedder rates a clinically opposite question as similar as, or more similar than, a genuine restatement. That happens in four of the six models.

When the red bar beats the teal bar, the embedder literally prefers the clinically wrong answer. And this is specific to magnitude — the same models handle most other distinctions fine. It's not that the embeddings are bad; it's that the one thing they're worst at preserving (numbers) is exactly the thing that's most often clinically decisive.

The lesson that reframed the project: a "hit" only means a close vector match. Whether that match is safe to reuse is a different question — and for the cases that mattered, the answer was usually no.

6. The ten ways a question changes meaning

Numbers were the loudest failure, but not the only one. To make "is the embedding fooled?" measurable, we built the adversarial dataset around a number of categories of perturbation — distinct ways a query can stay lexically close while becoming a different clinical question.

Category	What it changes	Should it match?
`almost_same`	Pure rephrasing — same meaning (the control)	Yes
`numeric`	Only the value changes (Hb, BP, weeks)	No
`polarity`	Direction flips (low ↔ high, rising ↔ falling)	No
`negation`	Adds/removes "not" (give ↔ don't give)	No
`temporal`	Gestational age / timing (6 months ↔ 2 months)	No
`definition vs procedural`	"what is X" vs "what to do for X"	No
`risk vs symptom`	Risk classification vs symptom description	No
`entity confusion`	Swaps to a related but distinct entity (Hb ↔ BP)	No
`severity`	mild ↔ moderate ↔ severe	No

Only the first should ever produce a cache hit. The other nine are traps, and each one is a real way a worker's question differs from a cached one. A cache that can't tell severe from mild, or give iron from don't give iron, will eventually answer one with the other. The whole point of the taxonomy is to turn that risk into something you can measure instead of fear.

7. Trying to make numbers "louder"

If the problem is that numbers get buried, maybe we can make them more salient. One idea: spell numbers out as words — "five" instead of "5" — to force the tokenizer (and the embedder) to attend to them.

It didn't work. Using the anemia anchor, the true paraphrase sits at cosine 0.872. Every clinically distinct change to the hemoglobin value — including the dangerous drop from 8 to 5 — scores higher than that paraphrase, in both digit and word form. Spelling the number out nudged the score down by a hair, never enough to cross below the paraphrase. The magnitude information simply isn't represented in a usable way, and surface tricks don't put it there.

8. A trained safety gate: great in the lab, brittle in the wild

If a frozen embedding can't separate danger from paraphrase, maybe a small model trained on top can. This direction came from a suggestion by my manager, Jigar — to try reranking and classifiers on top of the embeddings rather than trusting cosine similarity alone. We built rerankers over the frozen embeddings with explicit numeric features (exact-match, count and magnitude differences), in two flavours: a gradient-boosted pairwise ranker (XGBRanker) and a small triplet-loss projection head.

In cross-validation, this looked like the answer. The numeric features drove the numeric fooling rate from 0.60 down to 0.02 and lifted top-1 accuracy from 0.34 to 0.88. On those numbers, you'd ship it.

Then we tested it the way the world would: on 25 postnatal topic groups it had never seen in training. The story inverted.

The XGBRanker wins cross-validation (0.88) then collapses to 0.44 on unseen postnatal topics — below raw cosine’s 0.68. It memorized the perturbation set, not a general notion of clinical safety. Only the triplet head held its ground.

The XGBRanker — the cross-validation champion — fell to 0.44 on unseen topics, worse than raw cosine's 0.68. It had memorized the perturbation dataset, not a general notion of clinical safety. Only the triplet head generalized, and even it failed on a meaningful fraction of new groups.

This is the most instructive negative result in the project. A model posting 0.88 in-distribution can be worse than doing nothing on topics outside its training set. For a system that has to handle whatever a worker types next, in-distribution metrics aren't just insufficient — they're actively misleading. Safety here is an out-of-distribution property, and almost nobody evaluates for it.

9. Abandoning similarity altogether

Every approach so far lived or died on a similarity score. So we tried throwing that out. Instead of asking "are these vectors close", decompose each query into a structured representation — who the patient is, the clinical topic, the measurements and their values, the intent (definition, management, referral), the polarity — normalize it deterministically, and match those structures exactly. Two paraphrases that decompose to the same structure collide on the same key, regardless of wording or language.

At request time it runs as a second track alongside the existing pipeline: a small LLM turns the query into structured JSON, which is normalized, hashed, and looked up. Because it runs in parallel with the preprocessor, its ~2 s never lands on the critical path — a hit returns the cached answer, a miss just continues down the normal pipeline.

Query canonicalization flow: a query fans out to two parallel tracks. The canonicalization track (LLM → structured JSON representation → SHA-256 hash → exact hash lookup → cached response, ~2 ± 0.5 s) runs alongside the existing preprocessor, which falls through to the default pipeline on a miss. — The canonicalization track runs in parallel with the existing preprocessor, so its ~2 s never lands on the critical path — a hash hit returns the cached response, a miss just continues down the normal pipeline.

Concretely: these are five real ways an ANM might ask the same question — across English, Hinglish, and Hindi — that should all land on one cache entry.

Pregnant woman HB 8 g/dl, what to do?

Garbhvati ka hemoglobin 8 hai, kya karna chahiye?

Pregnancy me khoon 8 ho to kya karein?

6 month ki pregnancy mein mahila ka Hb level 8 g/dl hai to kya advise dein

छह महीने की गर्भवती का हीमोग्लोबिन 8 है तो क्या करना चाहिए

one canonical key

intent: clinical · recipient: pregnant_woman · topic: anemia · hemoglobin = 8 g/dl · request: action / counseling

Same clinical question, five surface forms across English, Hinglish, and Hindi. A similarity threshold has to guess they match; a structured key makes them provably identical — or provably different.

The trick is the schema. Each query is decomposed into the same shape, then normalized (synonyms folded, units canonicalized, keys sorted) and hashed. Here is what the queries above actually become:

{
  "schema_version": "query_canonical_v3",
  "intent": "clinical",
  "language": "hi-en",
  "care_recipient": "pregnant_woman",
  "subjects": ["pregnant_woman"],
  "request": {
    "kind": "action", "action_type": "counseling", "information_type": "none",
    "classification_type": "none", "actor": "anm",
    "topic": "anemia", "targets": ["hemoglobin"]
  },
  "context": {
    "gestation": { "kind": "exact", "min_weeks": 24, "max_weeks": 24 },
    "gravida": null, "parity": null, "maternal_age_years": null,
    "inter_pregnancy_interval_years": null, "postpartum_days": null
  },
  "triage": { "stated_acuity": "none", "escalation_asked": false },
  "findings": [
    {
      "name": "hemoglobin", "applies_to": "pregnant_woman",
      "category": "lab", "assertion": "present",
      "value": { "kind": "quantity", "op": "eq", "number": 8, "unit": "g/dl" },
      "temporal": { "frequency": "none", "duration_days": null, "event_time": null }
    }
  ]
}

The full schema they target — intent, care recipient, a structured request, gestational context, triage acuity, and a list of clinical findings carrying values, units, and operators:

And the structure is produced by a single versioned LLM prompt. Note how much clinical domain logic is hand-encoded — which becomes the whole problem later, when we ask why "just fix the prompt" doesn't generalize:

Illustrative tradeoff: embedding caches tend to maximize reuse, while decomposition gives up some reuse to preserve meaning-bearing details like numbers, negation, and clinical intent.

The appeal is real: instead of a fragile floating-point threshold, a cache hit becomes an inspectable, explainable equality between two structured objects. A clinician or PM can look at a key and understand why two questions matched. And run back through the replay harness (§4), it produced the most encouraging number in the whole project: a 49% hit rate (14% exact + 35% structural). That's more than double the bar we'd set for "worth the complexity."

A hit rate is a promise, though, not a guarantee. So we read the matches — all 593 of them — one at a time, asking a single question per row: could this matched answer be safely reused for the incoming query?

Every one of the 593 canonical-hit candidates was reviewed by hand for safe reuse. Only 32.5% were clean paraphrases; 44.8% were clinical violations — matches that would serve a wrong or unusable answer.

Only a third were clean paraphrases. Nearly half were clinical violations — matches that would serve a wrong or unusable answer. The 49% was real; it was just mostly made of matches that shouldn't have been made.

And the violations weren't random noise. They clustered into a small set of systematic failures:

The 266 violations cluster into a small set of systematic schema failures (one violation can carry several tags). Topic mismatch and meta/garbage pollution dominate — not numeric edits, because by this stage similarity is gone and the failure has moved into structure extraction.

A bare "Hi" hashing to a complex post-abortion seed answer. "When is breathlessness a danger sign?" (when to worry) colliding with "Why does breathlessness happen?" (a different answer entirely). "Pregnant woman, age 36 — is she high-risk?" canonicalizing to a generic high-risk question with the age 36 dropped. Notice what changed: the dominant failures are no longer numeric — they're topic and intent and structure-extraction errors. We traded "is the embedding similar enough" for "did the LLM extract the right structure," and that extraction failed in exactly the clinically decisive ways.

The tempting fix is to keep tightening the schema and the extraction prompt — split this field, preserve that number, enumerate this special case. But every such fix is one more hand-coded rule, and patching the failure list case by case is just overfitting the backtest — the same trap that sank the reranker, now in prompt form. A canonicalizer that only behaves on cases it has been explicitly shown is not a deployable clinical component.

10. The design we drew up anyway

Running alongside all of this was a design question worth recording, because the thinking holds up even though it never went to production: if you had a retrieval layer you trusted, how should the cache itself be structured for a high-stakes domain?

The answer we converged on was a two-tier cache built around a single distinction:

Status	Meaning	Served to a user?
Verified	A medical reviewer approved this answer for this kind of question	Yes
Provisional	Auto-created when the LLM answered something new; unreviewed	No — used only for grouping and learning

…with two thresholds instead of one: a high bar to serve (only verified entries, only at high similarity) and a lower bar to group unreviewed entries so reviewers can batch them. The system stays conservative about serving and liberal about learning. From there an incoming query resolves into one of four named cases — verified-only, provisional-only, both-with-verified-winning, and the interesting one: a provisional entry out-scoring the served verified answer, which is the signal that the cache has found its own best upgrade candidate.

I still think that's the right shape. But it rests on having reviewers who can promote provisional entries into verified ones — and that, in turn, needs a trusted source of ground truth that the product didn't yet have. The design assumed a foundation that wasn't there. That's part of why the whole effort paused rather than shipped.

11. What it all taught us

Laid out together, the gauntlet looks like this:

Approach	Verdict	The decisive evidence
Raw embedding cache	Dead end	60–100% numeric fooling; unsafe hits at any useful threshold
Numeric-word substitution	Dead end	Altered values still beat true paraphrases
Learned reranker (XGBRanker)	Overfit	0.88 in CV → 0.44 on unseen topics (below baseline 0.68)
Triplet projection head	Best generalizer, still short	Held out-of-distribution, but failed a real fraction of new topics
Query canonicalization	Audit-failed	49% hit rate, ~45% violations; fixes overfit the test
Offline replay + LLM judge	Kept	The reusable asset that made every result above measurable

There were two genuinely useful findings inside the wreckage. First, an exploitable asymmetry: non-numeric questions are materially safer to reuse than numeric ones — at the same threshold, non-numeric matches were judged safe roughly 1.7× as often. A cache that simply refuses to serve numeric/decision queries and routes them to the model is a defensible design. Second, a trivially-safe slice that needs no embeddings at all: a meaningful share of real traffic is exact repeats of greetings, acknowledgements, and stable definitions ("what is HRP?"). Those are safe by construction.

The deeper lesson is a single sentence: similarity is not equivalence, and the things that make two clinical queries different — a number, a polarity, a specific history — are exactly the things current methods are worst at preserving. "Reuse a past answer for a similar question" is nearly solved in general domains. Under a clinical-safety constraint, it's an open research problem, and now we can say precisely why: magnitude-blind representations, safety as an unmeasured out-of-distribution property, and structure extraction that fails on the decisive details.

12. Where we left it

We paused automatic semantic answer-reuse for clinical content. To be clear about the status: none of it shipped to users. What we'd actually deploy, in order, is the boring-but-safe set — an exact-match cache for the trivially-safe slice, and perhaps a short-lived exact-match layer to absorb bursts of identical questions (useful during new-worker onboarding, when trainees fire the same queries at once). Everything clinical and substantive keeps going to the model.

The thing I'm most glad we built is the part that wasn't a cache at all: the replay harness from §4 — it scores matches with a calibrated judge and lets you ask "what would have happened" without ever exposing a real worker to an untrusted answer. That's what turned a string of hunches into decisive results, and it's what any future attempt should start from.

If this is ever worth revisiting, three things would have to become true: a representation that respects magnitude without brittle hand-built features; a way to train and evaluate a safety gate so that out-of-distribution behavior is the headline metric, not a footnote; and a structure extractor trustworthy enough that "did we capture the query correctly" stops being its own failure mode. Until then, the honest answer is the one this whole project arrived at: in a domain where being wrong is unbounded, a cache you can't fully trust is not faster or cheaper — it's just a faster way to be wrong.

A note on method and privacy: "unsafe" wasn't a vibe. I reviewed a batch of matches by hand, derived the failure taxonomy from what I saw, then encoded that taxonomy into an LLM judge calibrated against those manual labels — so the judge detects failure classes already validated by a human, rather than acting as an oracle of clinical correctness. Every example here is illustrative, drawn from curated experiment datasets; no personal or identifying information from production traffic appears anywhere.