Now we know what OpenAI’s superalignment team has been up to

Can we prevent a ‘superintelligent’ AI (whatever that means) from going ‘rogue’? That’s the question MIT Technology Review asks about current work at OpenAI toward ‘superalignment’:

The question the team wants to answer is how to rein in, or “align,” hypothetical future models that are far smarter than we are, known as superhuman models. Alignment means making sure a model does what you want it to do and does not do what you don’t want it to do. Superalignment applies this idea to superhuman models.

One of the most widespread techniques used to align existing models is called reinforcement learning via human feedback. In a nutshell, human testers score a model’s responses, upvoting behavior that they want to see and downvoting behavior they don’t. This feedback is then used to train the model to produce only the kind of responses that human testers liked. This technique is a big part of what makes ChatGPT so engaging.   

The problem is that it requires humans to be able to tell what is and isn’t desirable behavior in the first place. But a superhuman model—the idea goes—might do things that a human tester can’t understand and thus would not be able to score. (It might even try to hide its true behavior from humans, Sutskever told us.)  

Lying AIs? Wherever did they learn that from? 🤔

Read the article here.

Leave a comment