We trained this model using Reinforcement Learning from Human Feedback

Generation Data
Records
Prompts
Copy
We trained this model using Reinforcement Learning from Human Feedback (RLHF)
,
using the same methods as InstructGPT
,
but with slight differences in the data collection setup
.
We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant
.
We gave the trainers access to model-written suggestions to help them compose their responses
.
We mixed this new dialogue dataset with the InstructGPT dataset
,
which we transformed into a dialogue format
.
To create a reward model for reinforcement learning
,
we needed to collect comparison data
,
which consisted of two or more model responses ranked by quality
.
To collect this data
,
we took conversations that AI trainers had with the chatbot
.
We randomly selected a model-written message
,
sampled several alternative completions
,
and had AI trainers rank them
.
Using these reward models
,
we can fine-tune the model using Proximal Policy Optimization
.
We performed several iterations of this process
.
INFO
Checkpoint & LoRA

Checkpoint
AbsoluteRealIndian
#Realistic
#Photography
0 comment
0
0
0