Can AIs learn to do better than unreliable humans?

We aim to understand whether and why AIs trained on unreliable supervision can generalize beyond the supervisor.

Ye Yaowen

Supervisor: Huang Chao

Project Title: Understanding Weak-to-Strong Generalization

Project Code: FYP24020

Why is this important?


As AIs become more capable, we would like them to solve more complex tasks

E.g., coding large scale projects, solving hard math problems, etc.

When tasks become harder, human supervision become noisy and unreliable

E.g., human-written code often contain bugs that can be imitated by models

Hence the question:

Can we train AIs that do not just imitate our errors, but generalize beyond us?

I.e., when and why will there be weak-to-strong generalization (W2SG)?

Method: Three Simulation Approach

Since we can only study tasks that we can actually solve, we need to simulate what unreliable supervision would be like.

Simulate Unreliable Supervision with small Language Models

We finetune small Pretrained Language Models (PLMs) to generate weak labels, which can be used as simulated unreliable supervision

Simulate Unreliable Supervision with Time-limited Human

We recruit human annotators to label data in limited time, hence their provided label simulates unreliable supervision.

Simulate Unkown Ground Truth with Proxy Gold Model

For tasks without known ground truth (e.g., recommender systems, reward modelling), we take a proxy model’s outputs as ground truth.

Our Project Goals

We aim to provide insights to weak-to-strong generalization with strong empirical support and propose new algorithms to learn from unreliable data.

Understand how W2SG behave under different types of simulated human errors and biases

Investigate what W2SG looks like when there is systematic errors in mathematical reasoning training data or social biases like gender biases or racism?

Understand consequences of optimizing an unreliable proxy goal for human preference

Study both reward modelling in language model alignment and recommender systems using simulated gold models to understand overoptimization.

Propose new methods to better learn from unreliable human data

Design algorithms that enable stonger W2SG and show its effectiveness with comprehensive empirical experiments.

Project Plan

Tentative plan for the project

The End.