We are joined by Ivan Kairatov, a biopharma expert whose work sits at the cutting edge of technology and clinical innovation. His latest research venture, Observer, is poised to reshape our understanding of healthcare by moving beyond the static, text-based data of electronic health records. The project introduces a new paradigm of multimodal data capture—using video and audio to record the subtle, human dynamics of patient-clinician interactions. In our conversation, we explore how this rich new dataset captures the previously invisible aspects of care, the sophisticated de-identification technology required to protect patient privacy, and the transformative potential for artificial intelligence to augment, and ultimately re-humanize, the clinical encounter.
The article contrasts Observer with data like clinical notes, calling them “traces left behind.” Can you elaborate on what crucial subtleties are missed in these records and share a specific anecdote from the Observer footage that highlights the importance of capturing body language or vocal tone?
That phrase, “traces left behind,” really gets to the heart of the problem we’re trying to solve. Electronic health records are invaluable, but they are an abstraction of the visit, a summary written after the fact. They miss the entire texture of the human interaction. For instance, a doctor’s note might state, “Patient agrees to the new treatment plan.” On paper, that looks like a success. But when you watch the video, you might see the patient nodding slowly while their brow is furrowed in confusion, their shoulders are tensed, and they keep glancing anxiously at their spouse. The audio might pick up a very hesitant, high-pitched “okay.” The video and audio tell a completely different story: one of uncertainty and a need for more explanation. That discrepancy is where medical errors and non-adherence are born, and it’s a subtlety that has been completely invisible to researchers until now.
Your MedVidDeID tool achieved over 90% automatic de-identification, which is impressive. Could you walk us through the step-by-step process, from audio scrubbing to face blurring, and explain why keeping a “human in the loop” remains a critical final step for ensuring total HIPAA compliance?
Building MedVidDeID was a massive undertaking, and it’s a multi-stage pipeline designed for both efficiency and security. First, the system ingests the raw video and audio. It immediately gets to work extracting a transcript of the entire conversation, which is then programmatically scrubbed of any identifying text like names or locations. In parallel, the audio track is processed to scrub spoken identifiers and the voices themselves are transformed to be unrecognizable. On the video side, we use state-of-the-art computer-vision models that automatically detect and blur faces, computer screens showing patient data, or even a name tag on a lab coat. This automated process is a workhorse, handling more than 90% of the de-identification and reducing our team’s total review time by over 60%. But that last fraction of a percent is where the real risk lies. An AI might miss a patient’s name reflected in a glass frame on the wall. That’s why the “human in the loop” is our non-negotiable final step. A trained human reviewer performs a final quality control check on every recording to catch what the machine might have missed, ensuring we can guarantee absolute patient privacy and HIPAA compliance.
The project deployed multiple cameras, including head-mounted ones for both clinicians and patients. Can you discuss the patient and clinician feedback on this process and explain how capturing these different perspectives allows you to ask more ambitious research questions about the clinical encounter?
We were very deliberate about making participation an empowered choice, and the feedback has been overwhelmingly positive. Both patients and clinicians were excited by the prospect of contributing to research that could fundamentally improve healthcare. By providing different camera options, we gave them agency in how they participated. The real magic, from a research perspective, is in combining those different viewpoints. The fixed room camera provides a neutral, third-person view of the interaction—the layout, the equipment, the overall dynamic. The clinician’s head-mounted camera is fascinating; it shows us exactly what they are looking at, letting us quantify the split in their attention between the patient and the computer screen. And when patients opt-in to wear a camera, we get their direct point of view, seeing what they focus on when a doctor is explaining a complex diagnosis. This allows us to move beyond simple questions and ask much more ambitious ones, like “How does the physical layout of the exam room impact the amount of eye contact, and how does that, in turn, correlate with patient satisfaction scores?” You can only begin to answer that when you can see the encounter from every angle.
You plan for Observer to become a national resource, much like the MIMIC project. Beyond just adding video and audio, how does this multimodal data fundamentally change the types of AI models that can be developed? Please describe a potential AI tool that could only be built using Observer.
The MIMIC project was truly revolutionary for health informatics by providing a large, standardized dataset of ICU records. We see Observer as the next evolution of that concept for primary care, adding these incredibly rich sensory layers. This fundamentally changes the game for AI development. Current AI is brilliant at finding patterns in structured data—lab values, diagnostic codes, prescription histories. But it is completely blind to the human process of care. With Observer, we can build models that understand social, emotional, and environmental context. A perfect example of a tool that could only be built with this data would be an “AI communication coach” for medical residents. Imagine a system that analyzes a resident’s interaction with a patient in real time. It wouldn’t just transcribe the words; it would analyze the doctor’s vocal tone, speaking pace, and the patient’s facial expressions and body language. It could then provide discreet, real-time feedback, perhaps flagging that the patient’s non-verbal cues signal confusion, prompting the resident to pause and re-explain a concept. That kind of empathetic, context-aware AI is impossible to build from a text-based health record alone.
What is your forecast for how AI tools, developed using rich datasets like Observer, will change the day-to-day experience of a primary care visit for both patients and clinicians in the next decade?
My forecast is that these tools will work largely in the background to re-humanize the clinical encounter. For clinicians, AI will become an intelligent co-pilot, not a replacement. It will listen to the natural conversation and handle the vast majority of clinical documentation automatically, freeing the doctor from the tyranny of the keyboard to focus their full attention on the patient. This will transform the dynamic of the room. For patients, the experience will feel more connected and less transactional. Their doctor will be making more eye contact, listening more intently, and engaging more deeply. Following the visit, the patient might receive an AI-generated summary of the conversation, translated into plain language and tailored to their specific health literacy level. Ultimately, the goal of this technology isn’t to insert more tech into the exam room, but to use it to strip away the administrative burdens and digital distractions, allowing the core of medicine—the human relationship between a patient and their doctor—to take center stage once again.
