Skip to content
GCC AI Research

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

arXiv · · Significant research

Summary

Researchers at MBZUAI have developed GeoChat, a new vision-language model (VLM) specifically designed for remote sensing imagery. GeoChat addresses the limitations of general-domain VLMs in accurately interpreting high-resolution remote sensing data, offering both image-level and region-specific dialogue capabilities. The model is trained on a novel remote sensing multimodal instruction-following dataset and demonstrates strong zero-shot performance across tasks like image captioning and visual question answering.

Keywords

VLM · remote sensing · GeoChat · MBZUAI · multimodal

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Changing the landscape: A vision language model to revolutionize remote sensing

MBZUAI ·

MBZUAI, in partnership with IBM Research, is developing GeoChat+, a vision-language model (VLM) for multi-modal, temporal remote sensing image analysis. GeoChat+ builds on the previous GeoChat model, enhancing capabilities with multi-modal images from various Earth observation systems like Sentinel-1, Sentinel-2, Landsat, and high-resolution imagery. GeoChat+ will integrate data from multiple satellites at different times to detect environmental changes and analyze the impact on soil quality, air quality, and erosion. Why it matters: This advancement promises to revolutionize geographic data analysis, providing detailed reports for high-risk regions and aiding reforestation efforts.

A new vision-language model for analyzing remote sensing data | CVPR

MBZUAI ·

Researchers at MBZUAI, IBM Research, and other institutions have developed EarthDial, a new vision-language model (VLM) specifically designed to process geospatial data from remote sensing technologies. EarthDial handles data in multiple modalities and resolutions, processing images captured at different times to observe environmental changes. The model outperformed others on over 40 tasks including image classification, object detection, and change detection. Why it matters: This unified model bridges the gap between generic VLMs and domain-specific models, enabling complex geospatial data analysis for applications like disaster assessment and climate monitoring in the region.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

New multimodal model brings pixel-level precision to satellite imagery

MBZUAI ·

MBZUAI researchers have developed GeoPixel, a new multimodal model for pixel grounding in remote sensing images. GeoPixel associates individual pixels with object categories, enabling detailed image analysis by linking language to objects at the pixel level. The model was trained on a new dataset and benchmark, outperforming existing systems in precision. Why it matters: This advancement enhances the utility of remote sensing data for critical applications like environmental management and disaster response by providing more granular and accurate image interpretation.