Machine Learning–Based Data Extraction Tools in Healthcare: A Systematic Review and Meta-Analysis
Systematic review of ML-based data extraction tools in healthcare.
February 10, 2025
Lifecycle IntelligenceStatus: In Publication / Submitted
Authors: Zain Khalpey, Ujjawal Kumar, Nicholas King, Amina H. Khalpey
Abstract
The exponential growth of healthcare data — spanning electronic health records, clinical trial databases, regulatory submissions, and published literature — has created an urgent need for automated data extraction tools that can reliably transform unstructured and semi-structured information into actionable datasets. This systematic review and meta-analysis evaluates machine learning–based data extraction tools developed for healthcare applications.
Our review encompasses studies published between 2018 and 2024, examining tool performance across key metrics including accuracy, recall, precision, and processing efficiency. We analyze applications spanning clinical note extraction, adverse event detection, literature screening, and regulatory document processing.
The meta-analysis reveals that transformer-based architectures consistently outperform traditional NLP approaches, with domain-specific fine-tuning yielding significant accuracy improvements. However, challenges remain in handling complex medical terminology, multi-language documents, and maintaining performance across diverse healthcare data formats. We propose a framework for evaluating and selecting ML extraction tools based on specific healthcare use case requirements.