Evolving Attentive’s Personalized Recommendations

Engineering

Published:

October 1, 2024

Author

Rishabh Misra

Rishabh has built extensive expertise in Deep Learning and NLP domains over the past decade, specializing in deploying low-latency, large-scale ML models in production. As a Machine Learning Lead at Attentive, he seamlessly blends research and engineering aspects to drive innovation in recommendation systems. He co-authored the book Sculpting Data for ML and has published and reviewed papers at leading conferences like ACL, RecSys, WSDM, ICML, and more, with his work garnering over 700 citations. His unique combination of research and practical engineering skills enables him to deliver cutting-edge ML solutions.

At Attentive, our mission is to help our 8,000+ e-commerce and retail brands create magical conversations with their 500M+ subscribers. Consumers increasingly expect marketing messages to be engaging, relevant, and personalized to their tastes and preferences.

By leveraging machine learning techniques, Attentive drives higher consumer engagement and boosts conversions for brands with personalized product recommendations. Attentive is deeply integrating into the product catalogs of our brands, enabling our clients to deliver tailored, personalized product suggestions at scale.

In this post, we’ll dive into the evolution of our recommendation system—from a traditional collaborative filtering model to a cutting-edge deep learning system—showcasing how our team has enhanced infrastructure, optimized training, and balanced relevance with diversity to deliver real-time, high-performance recommendations.

Existing recommender system

Our existing recommendation system utilizes a Bayesian Personalized Ranking (BPR) model, which has served us well by providing a straightforward yet effective approach to generating personalized product recommendations for consumers.

The simplicity of the BPR model lies in its focus on pairwise comparisons; the model assigns a higher score to an observed interaction (positive feedback) than to a non-observed interaction (negative feedback) for a given user. By learning to rank item pairs based on observed interactions, such as clicks or purchases, it efficiently predicts which items a user might prefer. This approach has proven effective, enabling our clients to deliver relevant recommendations that enhance consumer engagement.

However, the BPR model has limitations. One significant constraint is its focus solely on interactions, without considering additional user and item attributes. This limits the model’s ability to provide more nuanced recommendations that consider factors such as user demographics or item features — especially impacting new users or items. Furthermore, the BPR model lacks the flexibility to learn multiple objectives jointly, such as optimizing for both clicks and purchases simultaneously. Doing this with the BPR model requires independent models or introducing hand-crafted weights in the loss functions, which is suboptimal, and requires more tuning efforts. Lastly, in our experience, the BPR model also has the potential to overlap more with bestsellers.

To better illustrate our current system, here's a simplified architecture diagram of the BPR model pipeline:

As we continue to innovate at Attentive, it became clear that our recommendation system needed to evolve. While our BPR-based model provided a solid foundation, we recognized the need for a more flexible and scalable solution that could harness the rich consumer and product data our clients have. By leveraging these attributes, we aimed to enhance the personalization of recommendations our clients send to their consumer, ultimately driving better engagement and conversions.

The solution: two-tower based recommender system

To enable our clients to deliver better product recommendations, we selected the two-tower deep learning model architecture, which effectively utilizes user and item attributes. This model consists of two separate neural networks (or "towers")—one for users and one for items—that learn dense embeddings from their respective features. These embeddings are then used to predict interactions, enabling our clients to deliver more accurate and personalized recommendations.

Strengths of the two-tower model include:

Feature Flexibility: It easily incorporates various types of user and item features.
Cold Start Handling: It can provide better recommendations for new users or items (cold start scenario) as it does not rely solely on interaction history.
Parallelization and Scalability: It is easier to parallelize and distribute training processes due to two independent towers, leading to better scalability. This was one of the requirements for the new architecture as we tackle the scale of 500M+ subscribers and millions of products.
Real-time Recommendations: Moving from batch to real-time serving was also a requirement and the two-tower model allows for efficient recommendations retrieval via similarity search.

Implementation deep dive

Transitioning to the two-tower model required significant infrastructure enhancements and meticulous attention to feature engineering, model development, and overcoming various challenges. Before diving deeper into specific aspects, here’s a detailed architecture diagram of our two-tower model-based recommender system pipeline:

Feature engineering

Engineering the right features for users and items is crucial to improving recommendation quality in the two-tower model. We use a feature store to encode signals about users' tastes and preferences, such as users’ demographic information, interaction history, engagement patterns, and product attributes like category, price, and description.

Given the abundance of textual data on the product side, we frequently leverage transformers to extract unique characteristics of products. On the user side, we incorporate features like users’ interaction history and custom attributes, which help encode their preferences explicitly. This approach significantly enhances the personalization of our clients’ recommendations.

In general, understanding data quality and coverage, conducting thorough data analyses, encoding features correctly, scaling the data processing pipeline, and setting up hypothesis-driven experiments are essential steps we go through to narrow down promising features for the recommendations task.

Modeling efforts

Modeling is a core part of our efforts and here are some non-exhaustive highlights from our model development process:

Robust embedding representations: The two-tower model is about learning useful user and item representations. To ensure we learn robust entity embeddings, we applied common techniques such as regularization to prevent overfitting. Importantly, as we scaled up we incorporated residual connections within the towers to improve optimization.

Mixing in Sparse Features: Since two-tower models support mixing sparse and dense features, we explored many sparse features, such as consumers' recent product interactions. By leveraging categorical embeddings, we ensured that the model could efficiently handle sparse features without overwhelming the overall model complexity.

Multi-objective optimization: Optimizing for both clicks and conversions required careful design decisions. Conversions are higher intent (and drive revenue directly) but much sparser whereas clicks are more frequent but lower intent. We utilize auxiliary signals and optimize them jointly to provide additional supervision to the model to learn interactions that drive more revenue for our clients.

Evaluation

We also focus on building robust processes to make sure the two-tower model is well-equipped to handle the product recommendations task. Here are some non-exhaustive highlights:

Robust evaluation: We ensured that offline model metrics aligned well with our objective of enabling our clients to serve highly relevant recommendations. With each iteration, we apply cross-validation to judge the model’s suitability for deployment. We also built qualitative evaluation pipelines to ensure product rankings are of high quality to guard against model degradation.

Automated hyperparameter tuning: With deep neural networks, it is crucial to tune hyperparameters regularly. We built a custom pipeline to automatically select the best-performing hyperparameters while keeping the number of candidates we explore in a reasonable range.

Intelligent Sampling: Given the high volume of events at Attentive, we ensure that the sampled data is representative of the true distribution. We intelligently sampled the data to ensure training efficiency, while making sure it is well-calibrated to provide inference on real-world distribution.

Scaling and serving efforts

Following are some of the highlights that allow us to handle Attentive’s scale of 8,000+ clients and their 500M+ subscribers:

Infrastructure updates for two-tower model: To support the productionization of the two-tower model, we invested in several key infrastructure updates. At Attentive, we use Ray to support distributed training and MLFlow to manage the entire ML model lifecycle to streamline the offline development process. Implementing data pipelines with the new feature store and launching distributed training jobs with Ray on GPU clusters were key steps in setting up the offline pipeline. Our feature store offered a standardized interface to ingest rich user and item attributes, while Ray’s distributed training capabilities on GPUs helped accelerate model training on large datasets. This significantly sped up our development cycle, enabling faster iteration and improvement of our models.

Model sharding and parallelism: To handle larger models and datasets, we explored model sharding and parallelism techniques, distributing different parts of the model across multiple GPUs. By partitioning the model, we could parallelize computations and leverage Ray’s ability to schedule tasks dynamically, further improving training times. This approach was complemented by using heterogeneous clusters, where a mix of high-memory GPUs and general-purpose GPUs ensured that resources were efficiently utilized.

Memory optimization with Mixed Precision Training: To further optimize memory usage, we implemented mixed precision training, which reduced GPU memory consumption by using lower floating-point precision for certain operations without sacrificing model accuracy. This allowed us to train larger models and process more data in memory.

Real-time personalized recommendations with ANN Vector Store: To enable our clients to deliver real-time personalized recommendations for their 500M+ subscribers, we also invested in real-time APIs that interact with an Approximate Nearest Neighbor (ANN) vector store for on-demand ranking. The two-tower model integrates seamlessly with this architecture, as we store the final layer representations of the user and item towers in a vector store, enabling efficient ANN operations to allow querying top items for any user. The APIs are standardized to ensure the system can serve recommendations across various product use cases. This setup guarantees our clients’ instant delivery of recommendations, maintaining high performance for the end-users.

Results and impact

Performance metrics

The implementation of our two-tower recommendation system has led to significant improvements in several key performance metrics. Since obtaining impact on online business metrics requires longer cycles, we extensively run experiments to monitor lift in offline metrics to gauge the model's effectiveness. We use metrics such as PR-AUC and Precision @ K to evaluate the model during evaluation. These metrics help us ensure that the model was learning effectively and would perform well in the product recommendation task once deployed.

For measuring the impact on online business metrics, we performed A/B experiments to compare the performance lift coming from the two-tower model. We evaluated the existing and new recommendation systems using metrics such as clicks, conversions, and revenue. These metrics provide a comprehensive view of how well our recommendations are driving engagements and revenue.

Client impact

The benefits of our new recommendation system extend beyond immediate metrics improvements. For e-commerce and retail brands, enhanced personalization across various products allows our end-users to receive more relevant and timely recommendations from our clients, which not only drives higher engagement but also fosters stronger brand loyalty. Moreover, our new recommendation system also caters to more than double the total users across various brands.

One example of the positive impact is a testimonial from a major retail brand that reported a double-digit % revenue per send and conversions increase after switching to our new recommendation system. They highlighted how the improved recommendations led to more meaningful interactions with their consumers, translating into higher sales and consumer retention.

Overall, the transition to our two-tower recommendation system has been a major success, delivering measurable improvements in performance metrics and providing tangible benefits for our brands. We are excited about the continued potential of this system to drive innovation and growth in SMS and email marketing.

Next steps

For our clients, Attentive’s advanced personalized recommendation system offers significant value. By leveraging rich user and item attributes, our system enables you to send more relevant and timely recommendations, driving higher engagement and revenue.

Looking ahead, we’re excited about future enhancements and deeper integration of recommendations into more products.

Want to join us on our journey? We’re excited to keep innovating and delivering exceptional results for our clients, and you should come along for the ride. Check out the open roles on our team.

View all articles