Scalable Knowledge Graph Construction from Noisy Text with Large Language Models

By: Dr. Anya Petrova, Prof. Serhii Kovalenko, Dr. Elena Vasylenko, Dmytro Kuzmenko, Olena Mykhailiuk

Published: 2025-12-22

View on arXiv →
#cs.AIAI Analyzed#Knowledge Graph#LLM#NLP#Data Engineering#Noise Reduction#Information ExtractionBusiness IntelligenceSocial Media AnalysisHealthcareFintechCybersecurity

Abstract

This paper presents a novel framework for automatically constructing large-scale knowledge graphs from unstructured, noisy text data by leveraging the advanced capabilities of large language models. It addresses challenges in entity recognition, relation extraction, and knowledge fusion, demonstrating significant improvements in scalability and accuracy compared to previous methods, with clear potential for enterprise data management and semantic search applications.

Impact

practical

Topics

6

💡 Simple Explanation

Imagine trying to build a family tree from a box of messy, handwritten, coffee-stained letters. Traditional computers struggle to read them. This paper proposes a new method using advanced AI (like ChatGPT) to first 'clean up' the digital text and then strictly organize the information into a map (Knowledge Graph). This helps companies turn messy emails, chats, and logs into structured databases they can actually query.

🎯 Problem Statement

Constructing accurate Knowledge Graphs usually requires clean, grammatical text. Real-world data (chats, OCR output, logs) is noisy. Existing methods either fail to parse this noise or, when using LLMs, suffer from 'hallucinations' where the model invents incorrect facts or valid-looking but wrong relationships.

🔬 Methodology

The authors utilize a multi-stage pipeline. First, a 'Denoising Adapter' (a fine-tuned smaller LLM) rewrites noisy input segments into canonical forms. Second, the system employs 'Schema-Constrained Instruction Tuning', forcing the main LLM to extract entities and relations strictly according to a predefined ontology. Finally, a graph-based consistency check resolves conflicting triples by analyzing global graph topology.

📊 Results

The proposed framework achieved a 15% increase in F1 score on the 'Noisy-RE' benchmark compared to vanilla GPT-4 extraction. It reduced hallucination rates by 22% via the consistency check module. Scalability tests showed linear time complexity relative to input size, processing 1M documents in under 4 hours on a standard GPU cluster.

Key Takeaways

LLMs can be effectively tamed for structured extraction from messy text if wrapped in a rigorous pipeline of denoising and schema validation. The hybrid approach of 'GenAI + Symbolic Logic' (Graph consistency) is the path forward for reliable enterprise AI.

🔍 Critical Analysis

The paper tackles a very real problem—real-world data is never clean. However, the reliance on LLMs for 'denoising' might introduce subtle semantic shifts that are hard to detect. The claim of scalability is relative; while better than manual annotation, the token costs for processing terabytes of logs would still be prohibitive for many companies. The evaluation on 'noisy' text needs to be scrutinized to ensure it represents true real-world chaos (e.g., ASR errors, slang) rather than just synthetic noise.

💰 Practical Applications

  • API-based extraction service charged per document
  • Enterprise on-premise deployment for secure data
  • Consulting for custom schema design and ontology mapping

🏷️ Tags

#Knowledge Graph#LLM#NLP#Data Engineering#Noise Reduction#Information Extraction

🏢 Relevant Industries

Business IntelligenceSocial Media AnalysisHealthcareFintechCybersecurity