We aim to understand whether and why AIs trained on unreliable supervision can generalize beyond the supervisor.
Ye Yaowen
Supervisor: Huang Chao
Project Title: Understanding Weak-to-Strong Generalization
Project Code: FYP24020
As AIs become more capable, we would like them to solve more complex tasks
E.g., coding large scale projects, solving hard math problems, etc.
When tasks become harder, human supervision become noisy and unreliable
E.g., human-written code often contain bugs that can be imitated by models
Hence the question:
Can we train AIs that do not just imitate our errors, but generalize beyond us?
I.e., when and why will there be weak-to-strong generalization (W2SG)?
Since we can only study tasks that we can actually solve, we need to simulate what unreliable supervision would be like.
We finetune small Pretrained Language Models (PLMs) to generate weak labels, which can be used as simulated unreliable supervision
We recruit human annotators to label data in limited time, hence their provided label simulates unreliable supervision.
For tasks without known ground truth (e.g., recommender systems, reward modelling), we take a proxy model’s outputs as ground truth.
We aim to provide insights to weak-to-strong generalization with strong empirical support and propose new algorithms to learn from unreliable data.
Investigate what W2SG looks like when there is systematic errors in mathematical reasoning training data or social biases like gender biases or racism?
Study both reward modelling in language model alignment and recommender systems using simulated gold models to understand overoptimization.
Design algorithms that enable stonger W2SG and show its effectiveness with comprehensive empirical experiments.
Tentative plan for the project
The End.
Contact: elwin@connect.hku.hk