Building Message Quality at Scale

Engineering

Published:

August 27, 2025

Author

Suramya Mukesh

Suramya Mukesh is a Senior Software Engineer at Attentive, where she focuses on building robust backend systems that power AI applications. She's also involved in the company's mentoring program, sharing knowledge with fellow engineers. When not working on scalable solutions, she enjoys traveling to new destinations, exploring art, discovering new hiking trails and diving into a good book.

Attentive generates millions of personalized AI messages per day for our clients, each containing references to products their customers are likely to want, along with purchase links. In production AI, the real challenge isn’t preventing every anomaly—it’s detecting and correcting issues before they reach users. We addressed that challenge by building a multi-layered observability platform that blends objective, code-based checks with subjective, LLM-based evaluations. The platform combines the precision of code-based validation with the nuanced understanding of large language models— effectively creating a QA team that never sleeps.
‍

The Challenge: Monitoring AI In Production

Traditional software monitoring falls short when it comes to AI systems. Unlike deterministic code paths, an LLM produces open-ended, probabilistic text—so failures show up as hallucinations, policy violations, spelling or grammatical errors, tone drift, or missing data rather than exceptions in a log. In 2025, benchmarks show that even state-of-the-art models still hallucinate: the best model errs 15% of the time, and the average across common setups is about 44%. How do you verify that an AI-generated response is correct, safe, human-sounding, and still contains all the information a user needs to act?
‍

A Hybrid Approach: Over 20 Validation Checks Across Two Dimensions

The platform uses over 20 validators, grouped into two strategic approaches. All of this supports our broader goal: helping brands build deeper customer relationships through timely, relevant, and trusted messaging.

Objective Code Checks: Handle the measurable aspects in the messages:
- Does the message have valid data that it was told to include based on inputs?
- Does the message contain any language that shouldn't be based on a specific checklist?
- Does the message contain a url?
- Is the url in the correct format?
- Does the message adhere to the brand settings selected by the marketer?
Subjective LLM Checks: Evaluate the qualitative dimensions:
- Does the message contain any harmful content?
- Does the message sound human?
- Is the message grammatically correct?
- Does the message refer to the product or brand correctly?
- Is the message appropriate for the intended audience?
- Is the message context appropriate?

The platform continues to evolve as we personalize and experiment with our messages. We regularly add new evaluation use cases as they’re identified.
‍

Where Validation Checks Are Applied

1. Pre-Send (Generation-Time Filtering)

We generate n candidate messages per input. Objective validators run on each candidate. The highest-quality message that passes all constraints is selected and sent.

2. Post-Send (Runtime Monitoring)

Once messages are delivered, we sample outputs for subjective review. Any degradation in message quality triggers alerting and root cause analysis. Resulting issues may lead to:

Switching models used for generation.
Adjusting prompt templates.
Updating validation logic.
Fixing data that is sent as context to the LLM.

3. Offline Sample Generation

Before we release a new feature, rigorous testing is done using the AI Observability platform. We generate thousands of samples and score them against the different evaluators. Only features that achieve a clean pass across all evaluators make it to production.

‍

Real‑Time Alerting & Anomaly Detection

Our alerting pipeline scores live message traffic against historical baselines. Any spike in error rates, sentiment drift, or guideline violations triggers an automated alert, routing the offending samples—and their metadata—into a triage queue for rapid root‑cause analysis. This keeps escaped defects near zero and lets us remediate issues before they impact users at scale.

Smart Sampling: Balancing Coverage and Costs

Running all the subjective evaluators on all sent messages would be prohibitively expensive. The platform addresses this with an intelligent sampling strategy:

Risk‑Based Sampling
High‑risk interactions—such as high‑priority messages, sensitive topics, or new features—receive full validation coverage. Once monitoring shows they’re stable, lower‑risk interactions are sampled at a reduced rate.

Context‑Aware Validation
Not every validator applies to every message. For each use case, the platform runs only the relevant subset of checks, ensuring feature parity without unnecessary overhead.

Adaptive Sampling
As we gain confidence in a use case and its error rate stays consistently low, we further taper its sampling frequency, reallocating capacity to new or higher‑risk scenarios.
‍

Key Insights From Implementation:

Our approach is constantly evolving—every bug, false positive, or missed edge case is a chance to refine how we think about AI quality. That culture of rapid feedback and iteration is core to how we build. Building this platform revealed several important lessons:

LLM Evaluators Aren’t Perfect: While powerful for subjective evaluation, LLM evaluators can also hallucinate. Using confidence scoring and cross validation across multiple LLM evaluators ensure that critical checks are conducted safely.

Context Matters Enormously: The more context that is given to the LLM evaluators the better it judges a message and gives an accurate answer.

Confidence Scores Matter: Asking the LLM to return a confidence score with its answer lets us decide when to trigger additional review or fallback logic.

Explicit Instructions on How To Use Context: LLMs can impose their own, unstated criteria based on some of the input provided in the context. Adding explicit instructions to the evaluation guidelines prevents the model from over‑reaching beyond what we intend it to judge.

Use the Best Model for Each Check: No single model excels everywhere. Through prompt and model experimentation, we pair each validator with the model that benchmarks highest for that task.

Balancing Cost with Performance: As newer and better performing models come out in the market, there is a need to upgrade our evaluation framework. To ensure that the new models are evaluated correctly we have built a comprehensive suite of golden test cases that need to pass before the model switch is done. The evaluator is run with the new model for a trial period in shadow mode where error rates are compared to the older model. If the rates are consistent or better we then switch to the newer models.
‍

The Impact: Proactive Quality Assurance

Since launch, the observability platform has surfaced critical issues before they reach users—boosting click‑through rates, earning customer accolades for message quality, and driving manual QA time to near‑zero.We typically keep end-to-end error rates below 1 %. When they edge higher—as in a recent uptick to ~2 %—our monitoring flags the spike and we deploy fixes immediately to bring the rate back down. Automated sampling and validation have slashed the feature-release cycle, shrinking it from a process that once took weeks to one that now finishes in days—an order-of-magnitude faster than the old manual workflow.
‍

Looking Forward

Our AI observability platform will keep evolving. Next on the roadmap: expanding validators to new channels and refining our anomaly‑detection signals for even faster root‑cause pinpointing. We’ll also tighten cost‑efficiency by pairing each validator with the smallest model that still meets its accuracy target—maintaining full coverage without compromising speed or budget.

Curious about building the future of AI quality at scale? We’re hiring!

View all articles