Attentive generates millions of personalized AI messages per day for our clients, each containing references to products their customers are likely to want, along with purchase links. In production AI, the real challenge isn’t preventing every anomaly—it’s detecting and correcting issues before they reach users. We addressed that challenge by building a multi-layered observability platform that blends objective, code-based checks with subjective, LLM-based evaluations. The platform combines the precision of code-based validation with the nuanced understanding of large language models— effectively creating a QA team that never sleeps.
Traditional software monitoring falls short when it comes to AI systems. Unlike deterministic code paths, an LLM produces open-ended, probabilistic text—so failures show up as hallucinations, policy violations, spelling or grammatical errors, tone drift, or missing data rather than exceptions in a log. In 2025, benchmarks show that even state-of-the-art models still hallucinate: the best model errs 15% of the time, and the average across common setups is about 44%. How do you verify that an AI-generated response is correct, safe, human-sounding, and still contains all the information a user needs to act?
The platform uses over 20 validators, grouped into two strategic approaches. All of this supports our broader goal: helping brands build deeper customer relationships through timely, relevant, and trusted messaging.
The platform continues to evolve as we personalize and experiment with our messages. We regularly add new evaluation use cases as they’re identified.
We generate n candidate messages per input. Objective validators run on each candidate. The highest-quality message that passes all constraints is selected and sent.
Once messages are delivered, we sample outputs for subjective review. Any degradation in message quality triggers alerting and root cause analysis. Resulting issues may lead to:
Before we release a new feature, rigorous testing is done using the AI Observability platform. We generate thousands of samples and score them against the different evaluators. Only features that achieve a clean pass across all evaluators make it to production.
Our alerting pipeline scores live message traffic against historical baselines. Any spike in error rates, sentiment drift, or guideline violations triggers an automated alert, routing the offending samples—and their metadata—into a triage queue for rapid root‑cause analysis. This keeps escaped defects near zero and lets us remediate issues before they impact users at scale.
Running all the subjective evaluators on all sent messages would be prohibitively expensive. The platform addresses this with an intelligent sampling strategy:
Risk‑Based Sampling
High‑risk interactions—such as high‑priority messages, sensitive topics, or new features—receive full validation coverage. Once monitoring shows they’re stable, lower‑risk interactions are sampled at a reduced rate.
Context‑Aware Validation
Not every validator applies to every message. For each use case, the platform runs only the relevant subset of checks, ensuring feature parity without unnecessary overhead.
Adaptive Sampling
As we gain confidence in a use case and its error rate stays consistently low, we further taper its sampling frequency, reallocating capacity to new or higher‑risk scenarios.
Our approach is constantly evolving—every bug, false positive, or missed edge case is a chance to refine how we think about AI quality. That culture of rapid feedback and iteration is core to how we build. Building this platform revealed several important lessons:
Since launch, the observability platform has surfaced critical issues before they reach users—boosting click‑through rates, earning customer accolades for message quality, and driving manual QA time to near‑zero.We typically keep end-to-end error rates below 1 %. When they edge higher—as in a recent uptick to ~2 %—our monitoring flags the spike and we deploy fixes immediately to bring the rate back down. Automated sampling and validation have slashed the feature-release cycle, shrinking it from a process that once took weeks to one that now finishes in days—an order-of-magnitude faster than the old manual workflow.
Our AI observability platform will keep evolving. Next on the roadmap: expanding validators to new channels and refining our anomaly‑detection signals for even faster root‑cause pinpointing. We’ll also tighten cost‑efficiency by pairing each validator with the smallest model that still meets its accuracy target—maintaining full coverage without compromising speed or budget.
Curious about building the future of AI quality at scale? We’re hiring!