Middle East AI

This Week arXiv

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

arXiv · · Significant research

Summary

Researchers at MBZUAI have developed GeoChat, a new vision-language model (VLM) specifically designed for remote sensing imagery. GeoChat addresses the limitations of general-domain VLMs in accurately interpreting high-resolution remote sensing data, offering both image-level and region-specific dialogue capabilities. The model is trained on a novel remote sensing multimodal instruction-following dataset and demonstrates strong zero-shot performance across tasks like image captioning and visual question answering.

Keywords

VLM · remote sensing · GeoChat · MBZUAI · multimodal

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

arXiv ·

MBZUAI researchers introduce TerraFM, a scalable self-supervised learning model for Earth observation that uses Sentinel-1 and Sentinel-2 imagery. The model unifies radar and optical inputs through modality-specific patch embeddings and adaptive cross-attention fusion. TerraFM achieves strong generalization on classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

arXiv ·

MBZUAI researchers introduce PG-Video-LLaVA, a large multimodal model with pixel-level grounding capabilities for videos, integrating audio cues for enhanced understanding. The model uses an off-the-shelf tracker and grounding module to localize objects in videos based on user prompts. PG-Video-LLaVA is evaluated on video question-answering and grounding benchmarks, using Vicuna instead of GPT-3.5 for reproducibility.

From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region

arXiv ·

A new study compares vision-language models (VLMs) to YOLOv8 for wastewater treatment plant (WWTP) identification in satellite imagery across the MENA region. VLMs like Gemma-3 demonstrate superior zero-shot performance compared to YOLOv8, trained on a dataset of 83,566 satellite images from Egypt, Saudi Arabia, and UAE. The research suggests VLMs offer a scalable, annotation-free alternative for remote sensing of WWTPs.