Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection
By: Trishita Dhara, Siddhesh Sheth
Published: 2026-03-20
View on arXiv →#cs.AI
Abstract
This research goes beyond traditional accuracy metrics in evaluating harmful content detection systems by incorporating an explainability-driven analysis. It investigates how explanations provided by AI models can reveal biases, vulnerabilities, and failure modes in detecting harmful content, which is crucial for building transparent and responsible AI systems. The findings offer insights into improving the robustness and fairness of moderation tools, leading to safer online environments.