FYP24080

Why do we care?

Current language models cannot cater to
inhomogeneous and conflicting human preferences.

Lack of individualized preference datasets

Current datasets mix up preferences from a group of anonymous annotators. Yet human preferences are usually inhomogeneous – users
with diverse backgrounds – for instance, races, genders, and occupations – may favor different sets of responses. Under such anonymous ratings with missing annotator information, the
minority population’s voices are underrepresented. Many endeavors have been made to bridge this gap.

Lack of proper training paradigms to preserve diverse preferences

Each preference comparison sample is essentially treated equally in current training paradigms. Yet such comparisons are often conflicting – some users prefer A over B, while others prefer B over A. Such samples impose equal and opposite forces that navigate how the model adapts. In this case, existing training paradigms treat such conflicts as data noise and eventually fails to learn any preference

Objectives & Methodology

Using recently established individualized preference datasets, we aim to resolve the second issue above –
explore frameworks that embed diverse user preferences in one model
and enable it to personalize adaptively.

Stage 1: A personalized reward model

In regard to the thumb rule that evaluation is usually easier than generation, we will first try to obtain a reward model that can predict the preferences of diverse users, in other words, evaluate different responses from a certain user’s perspective. We plan to try and compare different practices, including roleplay prompting, few-shot dialogs, and conditional reward model training.

A personalized policy model

We will further study how to obtain
a desired policy model that generates tailored content for different users. We will try two mainstream approaches for policy training – the typical Reinforcement Learning from Human Feedback with the reward model obtained in the first stage
guiding Proximal Policy Optimization, and Direct Preference Optimization (DPO) that converts reinforcement learning to a more stable supervised learning.

Mathematical modeling and algorithmic refinement

We will examine different frameworks for analyzing the preference learning process, such as conditional distribution fitting and multi-objective optimization. We also investigate whether certain steps in the overall training process can be improved by tailored algorithmic design.

Timeline

Phase

Objectives

Progress

September, 2024

Literature review;
First deliverables: project plan & website

In progress
Done

October, 2024

Stage 1: Reward model training
1. PERSONA dataset preprocessing
2. Set up environment and codebase
3. Establish a reward model training pipeline

Pending

November to December,
2024

Stage 1: Reward model training
1. Experiment with different methods
2. Experiment with different sets of attributes as user information

Pending

January, 2024

Stage 1: Reward model training
1. Result analysis and method refinement
2. Second deliverable: interim report

Pending

February, 2024

Stage 2: Policy model training
Experiment with the two preference alignment approaches

Pending

March, 2024

Stage 2: Policy model training
1. Experiment with other potential methods
2. Result analysis
3. Materials integration

Pending

April to May, 2024

Final Stage
1. Draft final report
2. Prepare poster and presentation
3. Project exhibition

Pending

Personalization of Large Language Modelsfor Diverse User Preferences

Why do we care?

Lack of individualized preference datasets

Lack of proper training paradigms to preserve diverse preferences

Objectives & Methodology

Stage 1: A personalized reward model

A personalized policy model

Mathematical modeling and algorithmic refinement

Timeline

Phase

Objectives

Progress

Stay tuned!

Personalization of Large Language Models
for Diverse User Preferences