
ML Recommendations Experimentation | Fox Nation
Worked with ML engineering to test and optimize Up Next recommendations, increasing autoplay minutes watched by 10%.
Overview
Fox Nation already had an Up Next recommendations system in place when I got involved, powered by an AWS Item Similarity model. My role was to serve as the product owner for ML recommendations: gathering requirements from Fox Nation stakeholders, relaying them to the ML team, and working closely with a Data Scientist who was building out a new internal embeddings model.
The goal was to decrease drop-off, increase completion rates, and improve average watch time. Recommendations were a known pain point for senior executives at Fox, so I felt strongly about pushing this effort forward.
The Models
The Item Similarity model relied on collaborative filtering. It looked at a user's viewing history and showed them content that other users with similar taste profiles had watched. Essentially, it answered the question: "What would someone like you want to watch?"
The embeddings model took a different approach. It converted metadata, including transcriptions of show audio, into vectors and ranked content based on similarity to whatever the user was currently watching. It answered a different question: "What's most similar to the thing you're watching right now?"
The key difference is that Item Similarity recommends based on your taste profile, while the embeddings model recommends based on content similarity to what's in front of you.
Training the Model
I worked closely with our lead Data Scientist to test and refine the embeddings model. He built an app I could run locally, with a UI that let me test content against the model and see what it recommended. I'd run tests, take notes on my findings, and provide feedback so he could tweak the weights. We did this iteratively until we were satisfied with the model's output.
This was my first time being involved in training an AI/ML model, and it was a great learning experience.
Experimentation
We ran A/B tests to measure the embeddings model against the Item Similarity baseline. The initial test didn't yield statistically significant results, so we decided to push further and try blended approaches.
We tested hybrid versions that combined both models, pitting them against each single model on its own. We ran multiple iterations, tweaking the weights of each model in the hybrid approach until we saw the best results. All tests ran within Up Next recommendations (the autoplay after a show or movie ends) to keep things consistent.
Eventually, we landed on a hybrid approach that outperformed either model alone. The Item Similarity model was better at surfacing content the user might enjoy based on their broader taste profile. The embeddings model was better at finding content most similar to what they were currently watching. Combining them, with the right weights, solved for both use cases.
Collaboration
I had a weekly standup with the VP of Product and the ML team, where we reviewed the model training work and, later, the A/B test results. These meetings were also where we ideated on new test ideas, including the hybrid approach.
Any time we got data back from tests, I'd bring it to my manager and walk her through our progress. This kept leadership in the loop and helped build momentum for the project internally.
Results
The hybrid approach increased Up Next autoplay minutes watched by 10%. This validated my hypothesis that the embeddings model could meaningfully improve recommendations when combined with what we already had.
Before I left Fox, the ML team was excited about this work and bullish on adding more personalization across the app and other business units. I was also helping test and implement the embeddings model into content collections on the Home page, so content editors could set up personalized carousels within the Fox Nation UI.
Additional Thoughts
I'm proud of the fact that I recognized the potential of this embeddings model and championed the effort to test it! When the initial results weren't statistically significant, I kept pushing. Testing hybrid approaches eventually got us a meaningful improvement and ultimately made the recommendations experience better for users.
The lesson I took away from this: keep pushing the boundaries. If I had stopped after the first inconclusive test, none of this would have come to fruition.