In this post, we have the pleasure of hosting
, MBA, Strategic Product Manager at Scale AI, Ex-Meta, who shares his valuable insights and expertise on data labeling strategies for AI products. I met José in Boston, while he was studying at Harvard Kennedy School, and he has always been very energetic and has a creative way to explain business ideas in simple terms.TLDR
As AI weaves its way into an array of products and industries, Product Managers should become familiar with the key dependency of data labeling (aka data annotation). The above image serves as an analogy to help frame the conversation of how raw data undergoes a transformational process (annotation) to then be used for training a Machine Learning (ML) or Large Language Model (LLM).
Introduction
Product Managers (PMs) are adept in creating product roadmaps and understand well how project dependencies interlock to meet a project’s ultimate deadline. Some additional milestones that PMs should become acquainted with are those associated with data labeling for training ML models. Data labeling is the longest process in the ML lifecycle. PMs should become familiar with the end-to-end process of data labeling, since training a model in most cases can’t begin without labeled data. Data labeling is the process that we will deep dive into in this post. The key concepts we will analyze are:
The Focused Factory Concept
Data Annotation Providers as Focused Factories
The Idea in Practice - Notional Product Manager Use Case
Brief Overview of the End-to-End Annotation Process
The Triple Constraints
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41d0438b-dc0d-466f-b110-e01c8f265759_732x610.png)
1. The Focused Factory Concept
One of the most memorable learnings from my MBA training at IE Business School was the principle of the focused factory in Operations Management. This principle is elegantly simple: a focused factory hones its expertise and resources on a finite spectrum of tasks or products to achieve higher efficiency, improved quality, and faster response to market demands.
The story of Shouldice Hospital is a classic case study often cited in MBA classrooms. Based in Canada, this hospital is renowned for its singular focus on hernia surgery operations. This specialization has culminated in a finely tuned process, where surgical expertise is stratified: novice surgeons hone their skills on simple hernia surgery cases, while experienced surgeons only tackle the most intricate and complex ones. This stratification ensures peak efficiency and expert care tailored to each patient’s needs.
The same specialization of labor principle can be applied to data labeling projects and extended to individual data pipelines. For example, Google likely employs highly skilled medical personnel to fine-tune the Med-PaLM 2 LLM, which is reportedly being tested at the Mayo Clinic. However, leveraging a highly skilled medical workforce to conduct general-purpose LLM fine-tuning would not be the best use of resources. We’ll discuss more on data pipeline specialization in a separate post.
2. Data Annotation Providers as Focused Factories
You can think of data annotation providers (e.g. Scale AI, LabelBox, Surge AI) as focused factories, each having its own AI domain specialty. These “focused factories” are not just assembly lines but centers of excellence, where deep domain-specific knowledge in areas like Lidar or LLM use cases is nurtured and applied.
When considering a data annotation partner, it is crucial to assess their specialized expertise and breadth of their services. A simplified way of segmenting annotation providers is by differentiating pure-play data annotation providers and Business Process Outsourcing (BPO) firms:
Pure-Play Data Annotation Providers: These are the specialists. For instance, Deepen AI is a data annotation provider focused on autonomous systems. Similarly, Surge AI is a data annotation provider focused on Reinforcement Learning from Human Feedback (RLHF).
Business Process Outsourcing (BPO): These are generalists with a broad “palette”. TaskUs, and Teleperformance offer an extensive suite of services, but also showcase Lidar annotation in their services.
Selecting a data annotation provider is akin to choosing a trusted data partner for a journey through the complex terrain of AI product development - one that aligns with your project’s specific needs and goals.
Data Annotation Providers ≠ Business Process Outsourcing Providers
Some technologists think of data annotation providers as akin to BPO providers, but they are not interchangeable entities within the technology ecosystem. The distinction is akin to comparing a team of skilled artisans with a versatile labor force. In short, a BPO can be thought of as a provider that can conduct a set of specific tasks with a large managed workforce. For example, during my time at Meta, I managed a large number of BPOs that assisted with classifying data to improve the recommendation system of Facebook Group.
In contrast, during my time at Scale AI (a pure-play data annotation provider) I have served as a thought partner to a large Original Equipment Manufacturer (OEM) in the autonomous vehicle industry. I partnered with the OEM on a number of technical topics ranging from building custom camera models to designing the best data pipeline structure for their ML use case.
Another example is Deepen AI, who advertises a LiDAR to-camera calibration product - this is a sophisticated task that typically lies beyond the expertise of a conventional BPO. Unlike a data annotation provider, a BPO would likely not possess the domain-specific technical know-how to conduct sensor calibration for a customer. Essentially, a BPO is a generalist, who is not equipped to do specialized labeling.
3. The Idea in Practice - Notional Product Manager Use Case
Let's ground some of the aforementioned topics into a concrete case study. Imagine that you are a PM working on a Computer Vision product where you would like to train a model to distinguish images of hot dogs from images that are not hot dogs. This task parallels a humorous scenario featured in HBO's 'Silicon Valley,' which provides a comedic glimpse at similar technological endeavors. The relevant video can be viewed here.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a04b7b4-35b2-4ac9-a26c-e2fb66376442_800x533.png)
This project is being assigned to you on January 1st 2024 and you should deliver a Minimum Viable Product (MVP) to your organization’s leadership by April 1st 2024. The Machine Learning Research Engineer in your team has stated that they require 1,000,000 annotated images at 95 % Precision and Recall quality specifications. What are the next steps you take as a PM to enable your team to deliver the MVP?
The diagram below showcases the End-to-End annotation process that we explain below.
The Triple Constraints
Scope, schedule, and cost are referred to as the triple constraints in project management. In our computer vision data labeling use case, we should dig deeper to assess if we, as a PM, can deliver the MVP under the constraints.
Scope
A computer vision product to classify images as hot dogs or not hot dogs
1M annotated images
Success metrics: 95% precision & recall quality specifications
Schedule
MVP no later than April 1st 2024
Timeline for annotated images - 8 weeks
Timeline required to train a CV model - 4 weeks
Cost
To be aligned with decision makers following receipt of vendor quotes
Double Click - Timeline for annotated images
A key dependency in the PM’s product development journey is the time to annotate all 1M images. Thus, the logical first step is to calculate the average handling time (AHT) to annotate a single image. From my experience, an initial read on AHT can be calculated by holding a “labeling party.” Labeling party is a colloquial way to describe a time-boxed meeting where a diverse set of stakeholders come together and annotate data in accordance with the annotation instructions. More on annotation instructions to follow in a separate post. In our notional scenario, the PM and a handful of colleagues labeled a sample of images and found the AHT to be 10 minutes per image.
1M images * 10 AHT = 10,000,000 minutes of work
It would take a single person 104 years to label the 1M images by working at a speed of 10 minutes per image. Thus, the next question is, how many resources would we need to work in parallel to meet our timeline?
After a discussion with the engineering team, we decided that the team requires at least a month to train and deploy the model for the MVP. Thus, we set a target deadline of February 28th, 2024, to have the 1M images annotated. For planning purposes, we will use eight weeks of image annotation.
166,667 total hours/8 weeks => 20,833 hours of annotation/week
20,833/32 hours per person = 651 annotation personnel required
Let's pause here to recap our findings. In order to deliver 1M annotated images in eight weeks, we would need 651 full-time annotation personnel working 32 hours per week; 32 hours per week equates to 80% utilization at 40 hours per week. As a PM, what do you think about this timeline and the potential risks?
The aggressive eight-week deadline does not pass the sniff test. The plan assumes that (1) all annotations will be at quality specs on the first attempt and (2) no training time is required. For the purposes of this exercise, let's assume that we have a magic wand and that everything works out perfectly. Thus, our data annotation provider will be delivering 125k annotated images per week based on the hypothesis that we deliver the 1M images in eight weeks.
Ensuring Quality Deliveries
The quality demands of machine learning are steep, and bad data can rear its ugly head twice both in the historical data used to train the predictive model and in the new data used by that model to make future decisions.
If Your Data Is Bad, Your Machine Learning Tools Are Useless
by Thomas C. Redman, Harvard Business Review
The age-old method of random sampling and auditing can be applied to the 125K weekly deliveries. Deciding what percentage of the weekly delivery one will audit is a first step. For this exercise, let's set our sampling threshold at 10 percent. Thus, the next step is identifying who will conduct the audits. Assuming that auditing data takes 2 minutes per image, this equates to ~400 hours of audit time per week or 13 full-time personnel required to audit the data.
Now What!?
At this point in our journey, we finally have usable data to train our model!
In conclusion, data labeling is a critical dependency that Product Managers should become familiar with as AI proliferates across industries. The process of transforming raw data into labeled data (that is ready for model training) is operationally intensive. PMs should partner with data annotation experts early in the product development process to properly scope data annotation projects. Factoring in quality control measures such as sampling and auditing is imperative to ensure the labeled data meets the quality specifications required by machine learning engineers. With the proper planning and execution, data labeling can enable teams to train AI models and deliver innovative products successfully.
Stay tuned as we deep dive into more specialized data labeling use cases for images, Lidar and LLMs in a next post!
There are lot of assumption which can lead to risks here.
1. We are assuming that each personnel will deliver the min level of quality as per the annotation guidelines.
2. Annotation fatigue and all are considered as 20% underutilisation but rather at times it could be higher
3. Redoing in case of mistakes and how to handle the rejected scenarios also needs to be considered if it is going to be 100% manual. 8 weeks has to consider all the quality check and final restructuring of the data in the shape required by ML team