Summary: Deep Reinforcement Learning from Human Preferences

Aashi Dutt
4 min readJan 4, 2024

--

Reinforcement Learning with Human Feedback

A long time back when technology took over some of human work, we had questions about whether humans and machines could work together one day. Nevertheless, we wanted machines to understand human language and react to problems like humans do (not frustratingly though 😅 ).

Returning to the current era, the evolution and fast-paced development in the field of natural language aka LLM (large language model) era. LLMs have become widely used in chatbots (ChatGPT), generative models (Diffusion), co-pilots for coding, and much more. However, the goal is still the same, i.e. make LLMs better and better at human preferences.

The paper we’ll discuss falls on similar lines — “Deep Reinforcement Learning from Human Preferences” is a paper by the Google Deepmind team, covering the fundamental concepts of RLHF through experiments done on RL games like Atari and robotic simulator MuJoCo.

The Big Picture:

Reinforcement learning with human feedback (RLHF) is no new concept and has been around for some time now. It uses the basic concepts of Reinforcement learning along with a base LLM model to fine-tune.

If you need a refresher on RL, feel free to check out my previous blog.

This paper aims to solve complex RL tasks without access to reward function and provide human feedback on less than 1% of agents' interaction with the environment. The paper follows an approach to learning a reward function from human feedback and then optimizing the RL algo’s reward function.

Fitting a reward function to human feedback while simultaneously training a policy to optimize the current predicted reward function

A brief Summary:

The Goal :

In traditional RL, the goal is to maximize the discounted sum of rewards. But here, instead of having a reward function, we have a human overseer who expresses preferences between trajectory segments.

Trajectory Segments:

A trajectory segment is a sequence of observations and actions,

σ = ((o0, a0), (o1, a1), . . . , (ok−1, ak−1)) ∈ (O × A)^k. (k-1 is in subscript)

So, we write σ1 ≻ σ2 to indicate that the human preferred trajectory segment σ1 to trajectory segment σ2. In this paper, σ1 and σ2 are video clips of a few seconds, and as per σ1 ≻ σ2, human preferred the first video clip over the second.

Thus, the goal of the agent here is to produce trajectories that are preferred by human while making as few queries as possible to the human. This can be quantitatively evaluated by knowing the reward function r such that we can evaluate the agent quantitatively 1(as if it had been using RL to optimize r).

So, what about if we have no reward function to evaluate our algorithms as well? Simply quantitatively evaluate how well the agent satisfies human preferences.

The Method:

At each point in time, the method proposed by the paper does two major things:

  1. Maintains a policy π: O → A and,
  2. A reward function estimate ȓ: O x A→ R, each parameterized by the deep neural network.

The networks are updated by 3 processes which are interconnected with each other.

This process runs synchronously from policy environment interaction to segment pair selection and human comparison to parameters being optimized via supervised learning before looping back to stage 1. Reward function estimate ȓ helps to compute rewards which are then combined with traditional RL to find an optimal policy.

Preference Elicitation

It is noteworthy that human judgments are stored in a database D of triples (σ1, σ2, μ), where σ1 and σ2 are the two segments and μ is a distribution over {1, 2} indicating which segment the user preferred. If a human selects one segment as preferable then, μ puts all its mass on this choice. if both choices are equally preferable then, μ is uniform and if none of the segments are preferred by the user then the comparison is not included in the database.

Fitting the Reward Function

The human probability of preference of a segment depends exponentially on the value of the latent reward summed over the length of the clip (used by a human labeler) given as:

We choose rˆ to minimize the cross-entropy loss between these predictions and the actual human labels:

Selecting Queries

To select preferred queries, we sample a large number of pairs of trajectory segments of length k, then use each reward predictor in the ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members.

Conclusion:

This paper shows that we can train deep reinforcement learning models with human preferences. Also, the cost of computing is comparable to that of non-expert feedback which opens horizons to explore in this field. This is a step towards practical applications of deep RL to complex world tasks rather than low-complexity goals.

Feel free to read through the whole paper to understand the experiments done by the team and the results they achieved.

--

--