Scalable Knowledge Graph Construction from Noisy Text with Large Language Models
By: Dr. Anya Petrova, Prof. Serhii Kovalenko, Dr. Elena Vasylenko, Dmytro Kuzmenko, Olena Mykhailiuk
Published: 2025-12-22
View on arXiv →Abstract
This paper presents a novel framework for automatically constructing large-scale knowledge graphs from unstructured, noisy text data by leveraging the advanced capabilities of large language models. It addresses challenges in entity recognition, relation extraction, and knowledge fusion, demonstrating significant improvements in scalability and accuracy compared to previous methods, with clear potential for enterprise data management and semantic search applications.
Impact
practical
Topics
6
💡 Simple Explanation
Imagine trying to build a family tree from a box of messy, handwritten, coffee-stained letters. Traditional computers struggle to read them. This paper proposes a new method using advanced AI (like ChatGPT) to first 'clean up' the digital text and then strictly organize the information into a map (Knowledge Graph). This helps companies turn messy emails, chats, and logs into structured databases they can actually query.
🎯 Problem Statement
Constructing accurate Knowledge Graphs usually requires clean, grammatical text. Real-world data (chats, OCR output, logs) is noisy. Existing methods either fail to parse this noise or, when using LLMs, suffer from 'hallucinations' where the model invents incorrect facts or valid-looking but wrong relationships.
🔬 Methodology
The authors utilize a multi-stage pipeline. First, a 'Denoising Adapter' (a fine-tuned smaller LLM) rewrites noisy input segments into canonical forms. Second, the system employs 'Schema-Constrained Instruction Tuning', forcing the main LLM to extract entities and relations strictly according to a predefined ontology. Finally, a graph-based consistency check resolves conflicting triples by analyzing global graph topology.
📊 Results
The proposed framework achieved a 15% increase in F1 score on the 'Noisy-RE' benchmark compared to vanilla GPT-4 extraction. It reduced hallucination rates by 22% via the consistency check module. Scalability tests showed linear time complexity relative to input size, processing 1M documents in under 4 hours on a standard GPU cluster.
✨ Key Takeaways
LLMs can be effectively tamed for structured extraction from messy text if wrapped in a rigorous pipeline of denoising and schema validation. The hybrid approach of 'GenAI + Symbolic Logic' (Graph consistency) is the path forward for reliable enterprise AI.
🔍 Critical Analysis
The paper tackles a very real problem—real-world data is never clean. However, the reliance on LLMs for 'denoising' might introduce subtle semantic shifts that are hard to detect. The claim of scalability is relative; while better than manual annotation, the token costs for processing terabytes of logs would still be prohibitive for many companies. The evaluation on 'noisy' text needs to be scrutinized to ensure it represents true real-world chaos (e.g., ASR errors, slang) rather than just synthetic noise.
💰 Practical Applications
- API-based extraction service charged per document
- Enterprise on-premise deployment for secure data
- Consulting for custom schema design and ontology mapping