Core AI for PMs: Learning from Data

AI Fundamentals explained from the lens of Product Management

Oct 26, 2023

In this post, we’ll dive deeper into key concepts product leaders should consider when working on AI products:

Core mechanisms of AI model training
Data requirements & synthetic data
Advanced concepts

Core mechanisms of AI model training

Step 1: Identify AI problem type

Before exploring the AI model training mechanisms, we need to understand whether we have a discriminative (or predictive) or generative problem type.

The key difference between the two is the task that the AI model performs:

discriminative models predict a response based on trained data, whereas
generative models generate new content by synthesizing one or multiple data modalities (check out our previous post for more information)

Step 2: Determine data requirements

Data requirements depend on the AI model training type. There are three general types of AI model training both for generative and predictive AI models:

Supervised learning: training is achieved using human labeled data
- Product use cases: predict stock market, predict sales in retail stores/investments
Unsupervised learning: training is achieved without human labeled data, the AI model identifies patterns in unlabeled data without guidance
- Product use cases: identify disease epicenters, customer segmentation
Reinforcement Learning (RL): training achieved by interacting with an environment and making decisions through trial and error. RL includes training models with active human feedback.
- Product use cases: personalized recommendations in e-commerce platforms, energy consumption optimization

Supervised learning (Source: Label Your Data)

Unsupervised learning (Source: Label Your Data)

Step 3: Model training loss functions

In order for an AI model to be trained, we need to provide it with a concrete goal for achieving its expected behavior.

How does the model measure “success”?

The loss function determines how far from success our AI model is.

Examples of success metrics include:

Instagram Stories:
- stories watched more than 30sec
- watch time
- story reactions
Google Ads:
- click through rate (CTR)
- ad spend / user conversion

Before launching, there are lots of other factors that determine the final selection of our model such as our data strategy and user experience requirements.

After completing these steps, we need to determine our AI model evaluation strategy, which is a process that we will explore in a next post.

Data requirements & synthetic data

OpenAI hasn’t released exact costs to train GPT-3 publicly, but estimates indicate that it was trained on around 45 terabytes of text data—that’s about one million feet of bookshelf space, or a quarter of the entire Library of Congress—at an estimated cost of several million dollars. These aren’t resources any organization or startup can access.

Are we running out of human-generated data?

Assuming current data consumption and production rates will hold, research from Epoch AI predicts that we will have exhausted the stock of low-quality language data by 2030 to 2050, high-quality language data before 2026, and vision data by 2030 to 2060. Notable innovations that might challenge the hypotheses in the article are speech recognition systems like OpenAI’s Whisper that could make all audio data available for LLMs, as well as new OCR models like Meta’s Nougat. It is rumored that plenty of transcribed audio data has already been made available to GPT-4.

Can synthetic data solve this problem?

An idea to expand AI model training data in the research community is to use AI-generated content for training. There is no evidence yet that synthetic data could help alleviate the data scarcity problem. As product leaders, we need to discuss these issues with research or data science teams.

How does data influence training?

There are lots of requirements for feeding AI models with good quality data (e.g., avoiding noisy, sparse or skewed datasets), methods of data collection (user generated, synthetic, data augmentation), which go beyond the scope of this post. For example, when you provide feedback on ChatGPT, the AI model (LLM) learns based on your user behavior (like or dislike). User behavior data can then be fed in the AI model training (Reinforcement Learning with Human Feedback).

As Product Leaders, we need to answer the question of cost vs quality tradeoff when engaging with data science or research teams.

Advanced concepts

Watermarking

As generative AI models become more mature and capable, the longstanding problem of understanding whether the generated content is produced by a human or AI entity becomes harder to solve.

Watermarking is a technique to embed signals that are only algorithmically detectable in generated content.

If you want to learn more, check out this paper

As a Product Leader developing AI product roadmaps, you need to answer questions:

👉 How will watermarking systems ensure the integrity and privacy of generated content (e.g. availability behind an API, LLM providers run that themselves)?

👉 How will responsibilities be distributed across the stakeholders involved in the LLM model development, training and execution?

👉 How will watermarking be enforced in open-sourced models such as LLaMA?

LLM leaderboards

There is a plethora of open-sourced and closed LLMs, and as product leaders we are often questioned on why specific LLMs were chosen and what the tradeoffs are.

Stanford’s HELM leaderboard and Hugging Face’s LLM Benchmark are the current standard for comparing LLM model capabilities. Categories for metrics include accuracy, robustness, fairness, bias etc.

LLM leaderboards (Sources: HELM, Hugging Face)

Stay tuned as we simplify complex AI concepts in the next post, where we will provide an AI glossary for Product Leaders!

Hatch Labs substack

Discussion about this post