Personalization of Large Language Models
for Diverse User Preferences
Student: Liu Meitong
Supervisor: Prof. Luo Ping
2024-25 Computer Science Final Year Project, The University of Hong Kong
Student: Liu Meitong
Supervisor: Prof. Luo Ping
2024-25 Computer Science Final Year Project, The University of Hong Kong
Current language models cannot cater to
inhomogeneous and conflicting human preferences.
Current datasets mix up preferences from a group of anonymous annotators. Yet human preferences are usually inhomogeneous – users
with diverse backgrounds – for instance, races, genders, and occupations – may favor different sets of responses. Under such anonymous ratings with missing annotator information, the
minority population’s voices are underrepresented. Many endeavors have been made to bridge this gap.
Each preference comparison sample is essentially treated equally in current training paradigms. Yet such comparisons are often conflicting – some users prefer A over B, while others prefer B over A. Such samples impose equal and opposite forces that navigate how the model adapts. In this case, existing training paradigms treat such conflicts as data noise and eventually fails to learn any preference
Using recently established individualized preference datasets, we aim to resolve the second issue above –
explore frameworks that embed diverse user preferences in one model
and enable it to personalize adaptively.
In regard to the thumb rule that evaluation is usually easier than generation, we will first try to obtain a reward model that can predict the preferences of diverse users, in other words, evaluate different responses from a certain user’s perspective. We plan to try and compare different practices, including roleplay prompting, few-shot dialogs, and conditional reward model training.
We will further study how to obtain
a desired policy model that generates tailored content for different users. We will try two mainstream approaches for policy training – the typical Reinforcement Learning from Human Feedback with the reward model obtained in the first stage
guiding Proximal Policy Optimization, and Direct Preference Optimization (DPO) that converts reinforcement learning to a more stable supervised learning.
We will examine different frameworks for analyzing the preference learning process, such as conditional distribution fitting and multi-objective optimization. We also investigate whether certain steps in the overall training process can be improved by tailored algorithmic design.
September, 2024
Literature review;
First deliverables: project plan & website
In progress
Done
October, 2024
Stage 1: Reward model training
1. PERSONA dataset preprocessing
2. Set up environment and codebase
3. Establish a reward model training pipeline
Pending
November to December,
2024
Stage 1: Reward model training
1. Experiment with different methods
2. Experiment with different sets of attributes as user information
Pending
January, 2024
Stage 1: Reward model training
1. Result analysis and method refinement
2. Second deliverable: interim report
Pending
February, 2024
Stage 2: Policy model training
Experiment with the two preference alignment approaches
Pending
March, 2024
Stage 2: Policy model training
1. Experiment with other potential methods
2. Result analysis
3. Materials integration
Pending
April to May, 2024
Final Stage
1. Draft final report
2. Prepare poster and presentation
3. Project exhibition
Pending