Supercharging Attentive’s ML Platform with Ray

Engineering

Published:

March 5, 2025

Author

Christian Stano

Christian is an Engineering Manager at Attentive, where he started the ML Platform team and has grown the vision, operating model, and infrastructure from 0 to 1 over the last year. The ML Platform team is working hand in hand with our ML teams to close the gap between Attentive's mission of personalized, bespoke shopping experiences and reality with scalable systems and workflows. Outside of work, you can usually find Christian chasing a mountain top on his bike or pretending that he knows how to cook.

Scaling AI-Powered SMS & Email with Ray

At Attentive, we leverage cutting-edge AI and ML frameworks like Ray to drive massive scale for our ML platform, empowering our teams to innovate and shape the future of shopping.

We’ve made high-scale data and compute accessibility the cornerstone of our ML platform. This investment has expanded our ML team’s ability to enhance personalization and campaign effectiveness for marketers. However, with scale comes an explosion in infrastructure and workflow complexity.

In this post, we’ll break down why scale creates complexity, how we cut through it using Ray, and the game-changing impact we’re seeing in our adoption journey.

Tackling the S-Curve of ML Infrastructure Complexity

Driving automation, hyper-personalization, and individualization is no simple task for any machine learning team. Achieving continuous gains in customer ROI means pushing the boundaries of what our models can do. Machine learning engineers (MLEs) are constantly thinking about how to produce improved predictions and how to scale that to 100s and then 1000s of customers.

This tuning is usually an iterative process—MLEs will consider things like:

Can we enhance predictions by incorporating more data?
Should we scale up model complexity with more parameters?
Can a new framework or architecture give us an edge?
What new marketing insights can this model unlock?
Should we serve recommendations in real-time or batch mode?

From an ML Platform perspective, supporting these questions at scale demands massive compute power, which quickly compounds infrastructure complexity. Datasets become larger. The memory, CPU, and GPU resources needed to load the data and train the models increases. The cycle time of model development increases. Inferencing and serving those models requires even more compute. Dedicated tooling and process is needed to support the release, reliability, observability, and efficiency of these workflows.

As ML performance and scale grow, infrastructure complexity increases, requiring thoughtful solutions to maintain progress.

Our teams hit this compute inflection point, where traditional vertical and naive horizontal scaling hit a ceiling. Even the largest nodes suffered from out-of-memory (OOM) errors, and our growing collection of supported and needed tools became a bottleneck. We view this as the “S curve” of complexity and scale below, where complexity quickly increases, blocking the next level of performance and scale. We knew we needed to invest in the next-generation of tooling to unlock our target scale.

‍

The diagram below shows the state of our kubernetes-based compute platform when we hit this inflection point. As we ramped up, the ML platform hit scalability and performance ceilings. Without dedicated tooling and tuning, we could not perform asynchronous cross-pod data and model sharding. We started seeing out-of-memory (OOM) errors on even the largest vertically scaled nodes with as much naive horizontal distribution of load as possible at the time. Coordinating deployments and troubleshooting across this pipeline became a massive challenge as errors were unpredictable and reliability of instance availability was low.

‍

We needed a more efficient approach—not just additional compute, but better resource management.

Unlocking Training, Tuning, and Serving at Scale with Ray

Our Machine Learning and ML Platform teams collaborated to design our next gen high-scale, future-proof ML engine. Early designs yielded what seemed to be a patchwork of tools:

Apache Spark for distributed data processing
A home-grown high volume data loading and inferencing tool
Kubernetes support for parallel model training frameworks
Workflow orchestrators like Airflow or Kubeflow
NVIDIA GPU driver maintenance and optimization
Multi-GPU management and networking
Heterogeneous cluster management, management of AWS capacity reservations for HPC instances with terabytes (TBs) of memory
Low-latency model loading and serving tools like KServe or Seldon

‍

Each tool introduced new operational complexity, turning our ML pipeline into a tangle of integrations. Instead of layering more complexity, we chose to simplify and unify.

Ray: A Key Tool in Scaling Our ML Platform

Ray provided a unified, high-performance framework and interface to manage compute needs across the entire ML lifecycle. It mitigated infrastructure and tool sprawl, allowing MLEs to focus on building and deploying models, and our ML Platform to focus on enabling use cases instead of managing overhead.

Model Training

Moving our model training and tuning over to Ray was the biggest unlock for our ML platform. We needed to bring our compute up to meet the scale of our data - unlocking both vertical and horizontal, distributed scaling capabilities.

Migrating our model training to Ray significantly improved our ability to scale. We unlocked:

Seamless distributed training—scaling beyond vertical constraints
Simple onboarding—MLEs could wrap training code with Ray in minutes
Full hardware optimization—Ray’s observability tools helped maximize GPU utilization and minimize costs

With minimal code changes, our MLEs quickly migrated models and saw immediate value from eliminating vertical scaling constraints. By moving from vertical to horizontal distributed training, previous assumptions and constraints around total training time and model performance were removed.

Using Ray’s out of the box hardware observability enabled our teams to also squeeze every ounce of performance (including GPU saturation) and cost optimization out of our models.

‍

Model Serving

To power personalization, our platform needed to meet customers at their point of highest intent to make real-time predictions with data - with zero compromise on speed and reliability. To accomplish this, our ML Platform team focused on model serving in parallel with the above model training effort, selecting Ray Serve as our serving engine.

We focused on 3 key principles / challenges to address with our serving framework:

Scale - We plug into a system that sends and receives 2.5B+ messages every month
Reliability - Ensuring uptime, even on peak shopping weekends ($1.8B revenue supported in 2023)
Experimentation - ML at its core is about experimentation - we wanted a framework that supports rapid iteration at its roots

Ray Serve offered us a unified layer for model serving within a single ecosystem to solve for all of the above and more:

Scale - provides interfaces for us to manage our endpoints like a kubernetes cluster with replicas and metric-based autoscaling
Reliability - enabled new models and endpoints to be introduced with zero downtime deployments
Experimentation - supports model multiplexing, a technique used to selectively serve traffic across multiple related models for experimentation, similar to a kubernetes service mesh

With the same reduction in complexity of managing infrastructure as Ray Train, our ML Platform team was able to build a flywheel to bring a new model online at scale on the order of weeks instead of months, and manage the continuous deployment of new iterations of that model easily.

The Power of Consolidating & Simplifying - Driving Performance, Velocity, & Developer Experience

Unifying our ML platform around Ray improved both efficiency and speed of development. We paid down our complexity and increased our velocity by consolidating future investment into a single framework, an ecosystem that’s extensible to all aspects of ML development and deployment under a singular interface and coding framework.

With the introduction of any new tool and coding framework, one of the major risks is a drop in team velocity and poor developer experience until additional investment can be made into improvements and optimizations. However, this is where Ray actually de-risked our adoption journey. By eliminating tool sprawl and standardizing on a single framework, we transformed how our ML teams operate. We observed:

Faster development by leveraging shared examples and templates
Faster debugging of technical issues because Ray makes the right metrics and data accessible and digestible to squeeze every ounce of performance out of the hardware
Improved support from ML Platform with all abstractions and troubleshooting compounding on learnings from Ray and the Ray community
Easier onboarding and hiring with a focus on upskilling around Ray

Wrap-up

In about two months, our MLEs were able to self-service migrate two critical use cases for us: our AI Pro product and our Product Recommendations model supporting multiple products. This migration yielded:

99% cost reduction from improved AWS hardware utilization
5x training time reduction with a 12x training data volume increase
50x increase in number of customers supported by models

As we expand our use of Ray, we’re seeing new opportunities to optimize and scale our ML workflows in SMS and email marketing. Our MLEs are advancing AI capabilities in SMS and Email marketing, and Ray is helping us scale these efforts efficiently.

Want to join us on our journey? We’re excited to keep innovating and delivering exceptional results for our clients, and you should come along for the ride. Check out the open roles on our team.

View all articles