RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
The paper introduces RedBench, a unified dataset designed to improve how large language models (LLMs) are evaluated for safety and robustness. Existing red-teaming datasets are inconsistent in how they categorize risks and cover different types of attacks, which makes it hard to systematically test models. RedBench aggregates and standardizes 37 existing datasets into a consistent taxonomy of risk categories and domains, with tens of thousands of samples of both adversarial and refusal prompts. The authors analyze gaps in current datasets, provide baseline evaluations for modern LLMs, and open-source the dataset and evaluation code to support better, more comprehensive LLM safety research.
Comments
Post a Comment