Assessing the Self-Awareness of Language Models: A Deeper Understanding of What They Know
The capability of language models to evaluate the accuracy of their own assertions and anticipate the questions they can answer correctly. We initially demonstrate that larger models exhibit sound calibration on a diverse range of multiple choice and true/false questions when appropriately formatted. Consequently, we can extrapolate self-evaluation on open-ended sampling tasks by requesting models to propose answers initially, followed by estimating the probability "P(True)" that their responses are accurate. We report promising performance, calibration, and scalability for P(True) across a varied set of tasks. The self-evaluation performance further improves when models are allowed to review multiple of their own samples prior to predicting the veracity of a specific proposition. Subsequently, we explore if models can be trained to forecast "P(IK)", the likelihood that "I know" the answer to a query, without considering any specific proposed answer. Models perform commendably at predicting P(IK) and demonstrate partial generalization across tasks, although they exhibit difficulties with calibration of P(IK) on novel tasks. The predicted P(IK) probabilities suitably increase in the presence of relevant source materials in the context, as well as with hints towards solutions to mathematical word problems. It is our aspiration that these insights pave the way for the development of more transparent models, and for further investigation into how this transparency extends to scenarios where models are trained on objectives beyond the mimicry of human writing.