DecodingTrust is the Adversarial GLUE Benchmark. DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models.

This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and potential risks involved in deploying these state-of-the-art Large Language Models (LLMs).
This project is organized around the following eight primary perspectives of trustworthiness, including:

  • Toxicity
  • Stereotype and bias
  • Adversarial robustness
  • Out-of-Distribution Robustness
  • Privacy
  • Robustness to Adversarial Demonstrations
  • Machine Ethics
  • Fairness

Paper: https://arxiv.org/abs/2306.11698
Repo: https://github.com/AI-secure/DecodingTrust