Reinforcement Learning with Large Language Models (e.g. GPT-3)

How can you implement reinforcement learning with LLMs?

  • Use case: When training a GPT-3 davinci model on complex and specific structures and content, we have found that 300 examples is the minimum to get decent results. As more content is created, we add that to the training dataset and then retrain the model. But that is a very slow process. Can we use rewards and penalties to speed up the learning/improving process…something more creative than throwing more data at it.

This is a great question @jrw, while I don’t have an answer for you I am hosting a session on reinforcement learning tomorrow (Nov 18) with Susan Shu Chang. If you’re not able to make it, I can ask this question on your behalf to kick start a discussion.


So you fine tune your base model with 300 examples, then gather more data and fine tune your fine-tuned model on the newly gathered data?

For the reinforcement learning set up, you’d need some goal the model is aiming for within some “environment”. Do you have these clearly defined for your task?

Usually for RL problems, the action space is relatively small. In language tasks, the space can sometimes be very very large, but it may be capable of being reduced to something more manageable. It seems like current research is focused on doing just that.

This paper also talks about GPT models and how it may be possible to do what you have suggested (See page 10):
Survey on reinforcement learning for language processing


Reinforcement learning will only improve under two conditions, and I’m going by ab-initio principles here -

  1. One is that your policy / value functions must be close to what the original linguistic rules and sentence writing tendencies are. The reinforcement will fail if the reward evaluation is incorrect. Next, if you try to auto-deduce your reward function (yes, this is possible), you’ll again need what comes under (2).

  2. Your small set of examples must be case-wise comprehensive. In the sense, if 3 out of 10 sentences say - “Long has Theoden King ruled over Rohan”, “Dark are the waters of Kheledzaram”, “Cold are the springs of Kibilnala” - your system is stuck with the hyperbole device in literature (excuse the ultra-Tolkienian phrases). So this means your small set should be piece-wise balanced and yet adequately diverse.

Then yes, with simple semantic parsing, a good value function, reinforcement learning should show up with sane results…


Great input @kbaheti and @lu.riera!

@jrw hope these insights, along with the response from Susan at the AMA, help out.

If you make progress on this and want to share what you’ve done…be sure to let us know in the How I Solved It section!


@jrw i think this tweet thread might be of interest to you:

1 Like

@jrw This question was pre-ChatGPT, I wonder what you’ve learned from that and applied to your application?

1 Like