Ivan Kairatov is a biopharma and health-tech veteran who has spent years turning clinical logic into usable digital tools. In this conversation, he explains how a new self-triage chatbot grounds every step in established medical flowcharts, why a multi-agent design matters, and where real-world hospital pilots should focus. We discuss transparency, safety guardrails, EHR integration, accessibility, and how to measure impact without hype. He also shares candid views on failure modes, clinician workflows, and what it will take to earn patient trust at scale.
What real-world problem were you trying to solve with self-triage, and how did you define success in terms of fewer unnecessary ER visits, faster care for urgent cases, and patient understanding?
We set out to cut noise in the first mile of care, where panic or confusion drives avoidable ER visits and delays true emergencies. Success means fewer non-urgent arrivals and quicker routing of high-risk symptoms, both backed by clear explanations patients can follow. We used protocol-grounded guidance so people aren’t guessing between contradictory web pages. In simulations across more than 30,000 conversations, we saw strong adherence to decision steps—over 99%—which is the foundation for those real-world outcomes.
Walk me through the multi-agent design: how does each agent pick a flowchart, interpret free-text answers, and translate clinical phrasing, and where do handoffs most often fail?
Three agents work in tandem. The first selects a symptom-specific flowchart, factoring details like age and sex, from a library of 100. The second converts free-text replies into structured decisions and advances to the next node. The third rephrases clinical language into everyday speech, and most handoff issues occur when the second agent sees ambiguous phrasing that could map to multiple branches.
You trained the system on step-by-step medical flowcharts; how did you select and adapt them, and what guardrails did you add to avoid unsafe shortcuts during conversation?
We started with 100 stepwise protocols precisely because they’re granular and auditable. We preserved node order and branching logic, then added metadata for contraindications and red flags. Guardrails prevent skipping nodes, force clarification when inputs conflict, and lock emergency branches once triggered. We also log every node traversal so deviations are visible and can be corrected.
In simulated abdominal pain scenarios, how does the chatbot narrow from vague descriptions to a specific pathway, and what language tweaks most improved answer quality and patient confidence?
It begins broad—location, timing, and associated symptoms—then funnels through pattern-defining questions from the chosen chart. Phrasing shifts made a real difference: “How bad is the pain on a scale of 1 to 10?” outperformed “Is the pain severe?” by reducing hedging. We also swapped jargon like “acute” for “started suddenly today or built up over days.” Those small shifts made the 84% flowchart selection rate more durable across varied user language.
You reported high accuracy following decision steps; where do errors still cluster—flowchart selection, ambiguous symptoms, or edge cases—and how are you closing those gaps with data or design changes?
Errors cluster in initial chart selection and in overlapping symptom patterns. Ambiguity in user narratives can point to two or more starting charts. We’re adding disambiguation “micro-questions” and retraining with adversarial phrasing derived from the 30,000 simulated runs. We also created fallback routes that re-check earlier branches when confidence dips.
Transparency is often elusive in AI; how do you trace each recommendation to a specific protocol node, and how do clinicians audit, override, or annotate those decisions in practice?
Every question, answer, and branch maps to a unique protocol node ID. The final recommendation lists the path, node-by-node, so it’s immediately auditable. Clinicians can override at any step and attach annotations that persist in the audit log. Those annotations feed back into quality reviews and targeted retraining.
Many health systems use provider-specific protocols; how can teams plug in their own logic, validate updates safely, and version-control changes without breaking existing workflows?
We support modular protocol packs so organizations can swap in local logic without touching the core engine. Changes go through staging with synthetic cases and regression checks to catch unintended shifts. Version tags travel with every conversation transcript, so decisions are always tied to the exact rule set. Rollbacks are one click if validation flags a variance.
Patients rarely answer with “yes” or “no”; what NLP techniques helped map messy narratives to structured decisions, and can you share examples where subtle phrasing flipped the recommended action?
We use intent classification tuned to protocol semantics and entity extraction tied to symptom ontology. Confidence thresholds drive clarification prompts instead of risky guesses. A small phrasing shift—“the pain woke me up at night” versus “it’s worse at night”—can escalate urgency because nocturnal awakening is a different signal. Similarly, “sudden and worst-ever” crosses into an emergency branch that “gradual and moderate” does not.
Reducing avoidable visits is a priority; what outcome metrics will you track in deployment—ED diversion rates, time-to-care, cost savings—and how will you attribute improvements to the chatbot?
We’ll track ED diversion for non-urgent codes, time-to-appropriate-care for high-acuity patterns, and downstream utilization. Attribution uses stepped-wedge rollouts and control cohorts to separate signal from seasonality. We’ll also monitor patient comprehension via brief post-triage checks. Accuracy data from simulations—84% correct chart selection and over 99% step fidelity—provides a baseline for expected impact.
When urgency is underestimated, harm can occur; what fail-safes escalate concerning patterns, and how do you communicate uncertainty or “red flag” warnings without causing alarm fatigue?
Red-flag nodes trigger automatic escalation to urgent or emergency guidance with no ability to de-escalate in the same session. Confidence-aware prompts ask clarifying questions when signals conflict. We state uncertainty plainly—“Based on your answers, we can’t safely rule out X”—and explain why action is advised. Frequency capping and deduped warnings prevent repetitive alerts.
How will you run hospital pilots: recruitment, consent, inclusion criteria, and clinician backup plans, and what sample sizes and endpoints are you targeting for statistical power?
We’ll recruit from digital front doors and nurse advice lines, capturing e-consent with clear limits and emergency disclaimers. Inclusion focuses on adult ambulatory complaints initially, with exclusions for obvious emergencies. Every session routes to a clinician review queue for spot checks and immediate backup when red flags appear. Endpoints mirror operations—diversion, time-to-care, and adherence—powered to detect changes comparable to what we saw in 30,000 simulations.
Integration often stalls adoption; how will you connect with electronic health records, map data fields, and ensure that triage summaries are useful, not noisy, for downstream clinicians?
We use standards-based APIs to push structured summaries keyed to protocol node IDs. Field mapping aligns answers to discrete vitals, symptoms, and timelines rather than dumping raw text. The summary is concise: presenting complaint, decision path, and the recommended disposition. Clinicians can expand any node for full context when needed.
Accessibility matters; how will you support older adults, non-English speakers, and low health literacy users with voice input, multilingual prompts, and image sharing while maintaining clinical accuracy?
We’re building voice input with slow-speech tolerance and confirmation read-backs. Multilingual prompts mirror the same protocol nodes to keep logic identical across languages. Image sharing is limited to cases where a protocol has a defined visual cue, with disclaimers when images are insufficient. Plain-language rewrites—like the 1-to-10 pain scale—boost comprehension without diluting clinical precision.
Can you outline your privacy and security approach—data minimization, on-device versus cloud processing, audit logs—and how you meet HIPAA and other regulatory expectations?
We collect only what the protocol needs and purge non-essential metadata. Sensitive parsing can run on-device, with encrypted cloud calls for protocol logic and storage under strict access controls. Every node decision is logged with tamper-evident records to support audits. Business associate agreements, HIPAA-aligned safeguards, and clear patient notices are standard.
What are the most common failure modes you’ve seen—contradictory answers, comorbidities, pediatric quirks—and how does the system recover or ask clarifying questions without user frustration?
Contradictions arise when users change their minds midstream or misread a question. The system highlights the inconsistency and asks a short, neutral clarifier rather than scolding. Comorbidities and pediatric nuances are flagged for conservative routing until specialized charts are enabled. Confidence thresholds, not hunches, drive when to loop back or escalate.
Compared with general web searches or generic chatbots, where does protocol-grounded guidance most clearly outperform, and where does it still lag in empathy or nuance?
Protocol grounding prevents the whiplash of conflicting advice and anchors decisions to vetted steps. That’s why we can trace every recommendation to a specific node and hit over 99% step fidelity. Empathy can lag if language is too terse, so the third agent focuses on tone and clarity. We’re still improving reflective prompts that acknowledge fear without overstating risk.
For clinicians, how does this change triage workflows day-to-day, what training or dashboards are needed, and how will roles shift between nurses, physicians, and care coordinators?
Nurses gain a pre-structured intake that front-loads key answers and branch logic. Dashboards surface exception cases, red flags, and low-confidence paths for quick review. Physicians see fewer non-urgent interruptions and more complete symptom narratives. Care coordinators can focus on navigation and follow-up instead of basic questioning.
As you move to mobile apps, what product decisions—offline capability, push follow-ups, symptom monitoring—most affect adherence, and how will you measure long-term engagement?
Lightweight offline mode supports question review and drafts, with final decisions computed when connected. Push follow-ups nudge check-ins at protocol-defined intervals rather than arbitrary times. Symptom monitoring ties back to the same node logic to avoid drift. Engagement is measured by completed triage paths, on-time follow-ups, and adherence to recommended dispositions.
Customization can introduce variability; how will you balance local flexibility with consistency, and what validation steps ensure that modified protocols remain safe and effective?
We separate local edits from the canonical set and require automated regression against a bank of synthetic cases. Any change that alters a disposition triggers review by a clinical committee. Performance is compared to baseline metrics from the original 100 charts. Only then do we promote updates to production cohorts.
Looking ahead, how will you expand to multimorbidity reasoning, pregnancy-specific pathways, and pediatric variants, and what research collaborations would accelerate progress?
We’re adding cross-chart modifiers that consider comorbid risks without exploding complexity. Pregnancy and pediatrics get dedicated charts with age- and stage-specific red flags. We’ll validate with academic partners who can help generate diverse simulated cases beyond the 30,000 we started with. Shared datasets and prospective pilots will sharpen safety boundaries.
What is your forecast for AI-guided self-triage over the next five years?
I expect protocol-grounded triage to move from pilot to standard front door, measurably cutting avoidable visits while speeding true emergencies to care. The leaders will be those who can prove traceability—node by node—and sustain over 99% process fidelity in the wild, not just in 30,000 simulations. Mobile, multilingual, and EHR-connected experiences will become table stakes. Most importantly, success will be defined by trust: patients understanding why a recommendation was made and clinicians seeing exactly how the system got there.
