Title: Building an ML Platform Part 1

Building an ML Platform Part 1

This is the first of several posts that roughly outline how to create an ML platform on AWS. In this post I'll discuss the goals of the platform and the tools that I used to achieve them. In the subsequent posts I'll dive deeper into each goal and how it was specifically achieved. In a final post I'll discuss the platform with an eye towards future use cases and scalability.

Mission of the ML Platform

The mission of the ML platform was simple: Give data scientists control of their models as close to the point of deployment as possible. This means that any given DS can easily manage and own as much of the ML pipeline as possible, and in doing so we reduce the involvement to data engineers and data scientists.

Fulfilling the Mission: Infrastructure, Infrastructure, Infrastructure

To fulfill the mission I focused on two goals:

  1. Standardize infrastructure to retrain, update, or train ML models
  2. Standardize infrastructure to deploy models

I put a heavy focus on infrastructure because I believed it was the single biggest obstacle and the best way to dramatically reduce time to deployment. Prior to approaching this project the team had deployed two recommender models and in each case the deployment process was lengthy and involved people across the engineering organization as they debated how to structure the project. In both instances it took about a month to deploy a completed model largely due to the number of persons involved. I, therefore, believed that by standardizing the infrastructure we could drastically improve time to deployment. Also, if I could automate it and give the power to the DS team, the opportunity for decreasing model deployment time was even greater.

My focus on infrastructure paid off. In the end reduced the time to deployment from one month to ~30 minutes.