How can you implement reinforcement learning with LLMs?
Use case: When training a GPT-3 davinci model on complex and specific structures and content, we have found that 300 examples is the minimum to get decent results. As more content is created, we add that to the training dataset and then retrain the model. But that is a very slow process. Can we use rewards and penalties to speed up the learning/improving process…something more creative than throwing more data at it.
So you fine tune your base model with 300 examples, then gather more data and fine tune your fine-tuned model on the newly gathered data?
For the reinforcement learning set up, you’d need some goal the model is aiming for within some “environment”. Do you have these clearly defined for your task?
Usually for RL problems, the action space is relatively small. In language tasks, the space can sometimes be very very large, but it may be capable of being reduced to something more manageable. It seems like current research is focused on doing just that.
Reinforcement learning will only improve under two conditions, and I’m going by ab-initio principles here -
One is that your policy / value functions must be close to what the original linguistic rules and sentence writing tendencies are. The reinforcement will fail if the reward evaluation is incorrect. Next, if you try to auto-deduce your reward function (yes, this is possible), you’ll again need what comes under (2).
Your small set of examples must be case-wise comprehensive. In the sense, if 3 out of 10 sentences say - “Long has Theoden King ruled over Rohan”, “Dark are the waters of Kheledzaram”, “Cold are the springs of Kibilnala” - your system is stuck with the hyperbole device in literature (excuse the ultra-Tolkienian phrases). So this means your small set should be piece-wise balanced and yet adequately diverse.
Then yes, with simple semantic parsing, a good value function, reinforcement learning should show up with sane results…