When models see what isn’t there: Reducing hallucinations with FarSight

MBZUAI · Significant research

Summary

MBZUAI researchers developed FarSight, a plugin to reduce hallucinations in Multimodal Large Language Models (MLLMs). FarSight addresses the issue where MLLMs generate inaccurate text by losing focus on relevant image details, leading to snowball hallucinations. Testing on models like LLaVA-1.5-7B showed FarSight's effectiveness in reducing initial mistakes, thereby minimizing overall hallucinations. Why it matters: Improving the reliability of MLLMs is crucial for applications requiring high accuracy, enhancing their utility in various real-world scenarios.

Keywords

MLLM · hallucination · FarSight · MBZUAI · LLaVA-1.5

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Advancing computer vision with common sense

MBZUAI · Invalid Date

MBZUAI researchers are working to improve computer vision models by incorporating common sense knowledge. They aim to address issues like the generation of unrealistic human features, such as hands with incorrect numbers of fingers. By integrating common-sense knowledge, like the fact that humans typically have five fingers per hand, they seek to make deep learning models more reliable. Why it matters: This research could improve the accuracy and trustworthiness of AI-generated content, making it more suitable for real-world applications.

FAID: Fine-Grained AI-Generated Text Detection Using Multi-Task Auxiliary and Multi-Level Contrastive Learning

arXiv · May 20

MBZUAI researchers introduce FAID, a fine-grained AI-generated text detection framework capable of classifying text as human-written, LLM-generated, or collaboratively written. FAID utilizes multi-level contrastive learning and multi-task auxiliary classification to capture authorship and model-specific characteristics, and can identify the underlying LLM family. The framework outperforms existing baselines, especially in generalizing to unseen domains and new LLMs, and includes a multilingual, multi-domain dataset called FAIDSet.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.