Skip to content
Back to Work
002

Optum

Production ML for Healthcare at UnitedHealth Group

Role

Software Engineer, AI/ML

Duration

Feb 2026 – Present

Team

Enterprise AI/ML Team

Status

Current

Private Repo

Overview

Most ML in healthcare dies in a notebook. A model performs well on historical data, gets handed to engineering, and never survives contact with production traffic, schema drift, or a compliance audit. I work on the systems that keep models alive after deployment — pipelines ingesting millions of records daily, monitoring that catches drift before it reaches patients, infrastructure that makes all of it auditable under HIPAA. The challenge is not building the model. It is building the system around it that runs at 150M-patient scale.

Problem

A care gap detected six months late is a care gap that sent someone to the ER. Clinical records, claims, pharmacy data, and lab results live in separate systems with different schemas, update cadences, and access controls. Models trained on clean historical data encounter missing fields, delayed records, and format changes in production. The failure mode is not a bad prediction — it is a prediction that looked correct at training time and degrades silently in production until a downstream clinician makes a decision on stale confidence scores.

Approach

  • 01Own the full lifecycle from data ingestion through feature engineering, training, deployment, and post-deployment monitoring — no handoff gap between research and production
  • 02Deploy on AWS with Kubernetes orchestration and automated rollback triggered by performance degradation, not just infrastructure failure
  • 03Build processing jobs against strict SLAs — healthcare records that arrive late or process late compound downstream into missed care windows
  • 04Instrument every model with drift detection, prediction confidence tracking, and anomaly alerting. If a model's output distribution shifts, we know before a clinician sees the result
  • 05Translate research models into production services with latency and throughput guarantees. A model that takes 30 seconds per prediction is a model that will not be used
  • 06Enforce HIPAA compliance as infrastructure, not policy — encryption at rest and in transit, row-level access controls, immutable audit trails. Compliance is not a checklist item, it is an architectural constraint

Design Decisions

Technology Stack

Languages

PythonSQL

ML/AI

PyTorchScikit-learnMLflowFeature Stores

Infrastructure

AWSKubernetesDockerTerraform

Data

SparkAirflowPostgreSQLS3

Compliance

HIPAASOC 2EncryptionAudit Logging

Impact

Scale

150M+

Patients served across UnitedHealth Group — every pipeline decision compounds at this scale

Records

Millions/day

Healthcare records processed daily with strict SLA guarantees

Uptime

99.9%+

Production model availability with automated rollback on degradation

Next Case Study

MedVanta Platform