Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation,
the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful.
Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging
due to the difficulty of crafting prompts that appear harmful but are benign.
We introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses.
We plot the evaluation results in the following figure. The x-axis is the rejection rate on seemingly toxic prompts and the y-axis is the rejection rate on real toxic prompts. In the ideal case, the model should be on the top-left corner where the model rejects the most number of toixc prompts and the least number of seemingly toxic prompts.
Citation
@article{cui2024or, title={OR-Bench: An Over-Refusal Benchmark for Large Language Models}, author={Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui}, journal={arXiv preprint arXiv:2405.20947}, year={2024} }