Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

This research goes beyond traditional accuracy metrics in evaluating harmful content detection systems by incorporating an explainability-driven analysis. It investigates how explanations provided by AI models can reveal biases, vulnerabilities, and failure modes in detecting harmful content, which is crucial for building transparent and responsible AI systems. The findings offer insights into improving the robustness and fairness of moderation tools, leading to safer online environments.

Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Abstract

Projects