CompanyResearch

Assessing Advancements in Scalable Supervision of Large Language Models

The development of safe and functional general-purpose AI systems necessitates significant strides in scalable oversight, which entails supervising systems that could potentially surpass human abilities in most relevant tasks. Empirical exploration of this issue is complex, given that we currently lack systems that comprehensively outdo human capabilities. This paper outlines our principal approach to this challenge, concentrating on viable empirical study methods. We initially propose an experimental design focusing on tasks where human specialists excel, but unassisted humans and existing general AI systems fall short. Subsequently, we present a proof-of-concept experiment intended to underscore a critical characteristic of this experimental design, demonstrating its feasibility through two question-answering tasks: MMLU and time-constrained QuALITY. In these tasks, we observe that human participants, when engaging with a large but unreliable language model dialog assistant via chat - a simplistic baseline strategy for scalable oversight - markedly outperform both the model in isolation and their individual unaided performance. These findings offer promising evidence that scalable oversight is amenable to examination with current models and strengthen recent discoveries that large language models can effectively assist humans in tackling challenging tasks.