MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

arXiv · March 22, 2024 · Significant research

Summary

The paper introduces MedPromptX, a clinical decision support system using multimodal large language models (MLLMs), few-shot prompting (FP), and visual grounding (VG) for chest X-ray diagnosis, integrating imagery with EHR data. MedPromptX refines few-shot data dynamically for real-time adjustment to new patient scenarios and narrows the search area in X-ray images. The study introduces MedPromptX-VQA, a new visual question answering dataset, and demonstrates state-of-the-art performance with an 11% improvement in F1-score compared to baselines.

Keywords

Chest X-ray · MLLM · Few-shot prompting · Visual grounding · EHR

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

A multimodal approach for developing medical diagnoses with AI

MBZUAI · Invalid Date

MBZUAI doctoral student Mai A. Shaaban and colleagues developed MedPromptX, a system that analyzes chest X-rays and patient data to aid lung disease diagnoses. MedPromptX uses multimodal large language models with visual grounding and few-shot prompting, trained on a new dataset of 6,000 patient records (MedPromptX-VQA) derived from MIMIC-IV and MIMIC-CXR. The system addresses the challenge of incomplete electronic health records by leveraging the knowledge embedded in large language models to interpret lab results. Why it matters: This research advances AI-driven medical diagnostics by integrating diverse data sources and addressing data gaps, potentially leading to quicker and more accurate diagnoses.

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

arXiv · Jun 13

MBZUAI researchers introduce XrayGPT, a conversational medical vision-language model for analyzing chest radiographs and answering open-ended questions. The model aligns a medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna) using a linear transformation. To enhance performance, the LLM was fine-tuned using 217k interactive summaries generated from radiology reports.

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

arXiv · Jun 28

This paper introduces MOTOR, a multimodal retrieval and re-ranking approach for medical visual question answering (MedVQA) that uses grounded captions and optimal transport to capture relationships between queries and retrieved context, leveraging both textual and visual information. MOTOR identifies clinically relevant contexts to augment VLM input, achieving higher accuracy on MedVQA datasets. Empirical analysis shows MOTOR outperforms state-of-the-art methods by an average of 6.45%.

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

arXiv · Dec 13

MBZUAI researchers introduce UniMed-CLIP, a unified Vision-Language Model (VLM) for diverse medical imaging modalities, trained on the new large-scale, open-source UniMed dataset. UniMed comprises over 5.3 million image-text pairs across six modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus, created using LLMs to transform classification datasets into image-text formats. UniMed-CLIP significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs in zero-shot evaluations, improving over BiomedCLIP by +12.61 on average across 21 datasets while using 3x less training data.