TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

arXiv · June 6, 2025 · Significant research

Summary

MBZUAI researchers introduce TerraFM, a scalable self-supervised learning model for Earth observation that uses Sentinel-1 and Sentinel-2 imagery. The model unifies radar and optical inputs through modality-specific patch embeddings and adaptive cross-attention fusion. TerraFM achieves strong generalization on classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench.

Keywords

Earth observation · Foundation Model · Multisensor · Self-supervised learning · Remote sensing

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Changing the landscape: A vision language model to revolutionize remote sensing

MBZUAI · Invalid Date

MBZUAI, in partnership with IBM Research, is developing GeoChat+, a vision-language model (VLM) for multi-modal, temporal remote sensing image analysis. GeoChat+ builds on the previous GeoChat model, enhancing capabilities with multi-modal images from various Earth observation systems like Sentinel-1, Sentinel-2, Landsat, and high-resolution imagery. GeoChat+ will integrate data from multiple satellites at different times to detect environmental changes and analyze the impact on soil quality, air quality, and erosion. Why it matters: This advancement promises to revolutionize geographic data analysis, providing detailed reports for high-risk regions and aiding reforestation efforts.

A new vision-language model for analyzing remote sensing data | CVPR

MBZUAI · Invalid Date

Researchers at MBZUAI, IBM Research, and other institutions have developed EarthDial, a new vision-language model (VLM) specifically designed to process geospatial data from remote sensing technologies. EarthDial handles data in multiple modalities and resolutions, processing images captured at different times to observe environmental changes. The model outperformed others on over 40 tasks including image classification, object detection, and change detection. Why it matters: This unified model bridges the gap between generic VLMs and domain-specific models, enabling complex geospatial data analysis for applications like disaster assessment and climate monitoring in the region.

New multimodal model brings pixel-level precision to satellite imagery

MBZUAI · Invalid Date

MBZUAI researchers have developed GeoPixel, a new multimodal model for pixel grounding in remote sensing images. GeoPixel associates individual pixels with object categories, enabling detailed image analysis by linking language to objects at the pixel level. The model was trained on a new dataset and benchmark, outperforming existing systems in precision. Why it matters: This advancement enhances the utility of remote sensing data for critical applications like environmental management and disaster response by providing more granular and accurate image interpretation.

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

arXiv · Nov 24

Researchers at MBZUAI have developed GeoChat, a new vision-language model (VLM) specifically designed for remote sensing imagery. GeoChat addresses the limitations of general-domain VLMs in accurately interpreting high-resolution remote sensing data, offering both image-level and region-specific dialogue capabilities. The model is trained on a novel remote sensing multimodal instruction-following dataset and demonstrates strong zero-shot performance across tasks like image captioning and visual question answering.

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

Summary

Keywords

Related

Changing the landscape: A vision language model to revolutionize remote sensing

A new vision-language model for analyzing remote sensing data | CVPR

New multimodal model brings pixel-level precision to satellite imagery

GeoChat: Grounded Large Vision-Language Model for Remote Sensing