Skip to content
GCC AI Research

Search

Results for "SM-ViT"

Making computer vision more efficient with state-space models

MBZUAI ·

MBZUAI researchers developed GroupMamba, a new set of state-space models (SSMs) for computer vision that addresses limitations in existing SSMs related to computational efficiency and optimization challenges. GroupMamba introduces a new layer called modulated group mamba, improving efficiency and stability. In benchmark tests, GroupMamba performed as well as similar SSM systems, but more efficiently, offering a backbone for tasks like image classification, object detection, and segmentation. Why it matters: This research aims to bridge the gap between vision transformers and CNNs by improving SSMs, potentially leading to more efficient and powerful computer vision models.

MBZUAI team win industry computer vision award for best student paper

MBZUAI ·

An MBZUAI team led by Ph.D. student Dmitry Demidov won the Best Student Paper Award at VISAPP 2023 for their work on fine-grained visual classification. Their paper, 'Salient Mask-Guided Vision Transformer for Fine-Grained Classification,' introduces SM-ViT, a technique using a salient mask to improve Vision Transformer accuracy. The model focuses on defining characteristics of objects, outperforming standard ViT architecture, even with fewer or lower-resolution images. Why it matters: This award recognizes MBZUAI's contribution to advancing computer vision, particularly in applications requiring nuanced object recognition, such as robotics and automated systems.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv ·

The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.

Generative models, manifolds and symmetries: From QFT to molecules

MBZUAI ·

A DeepMind researcher presented work on incorporating symmetries into machine learning models, with applications to lattice-QCD and molecular dynamics. The work includes permutation and translation-invariant normalizing flows for free-energy estimation in molecular dynamics. They also presented U(N) and SU(N) Gauge-equivariant normalizing flows for pure Gauge simulations and its extensions to incorporate fermions in lattice-QCD. Why it matters: Applying symmetry principles to generative models could improve AI's ability to model complex physical systems relevant to materials science and other fields in the region.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

arXiv ·

MBZUAI researchers introduce VideoGPT+, a novel video Large Multimodal Model (LMM) that integrates image and video encoders to leverage both spatial and temporal information in videos. They also introduce VCGBench-Diverse, a comprehensive benchmark for evaluating video LMMs across 18 video categories. VideoGPT+ demonstrates improved performance on multiple video benchmarks, including VCGBench and MVBench.

Multimodality for story-level understanding and generation of visual data

MBZUAI ·

Vicky Kalogeiton from École Polytechnique discussed the importance of multimodality for story-level recognition and generation using video, audio, text, masks and clinical data. She presented on multimodal video understanding using FunnyNet-W and Short Film Dataset. She further showed examples of visual generation from text and other modalities (ET, CAD, DynamicGuidance). Why it matters: Multimodal AI research is growing globally, and this talk highlights the potential of combining different data types for enhanced understanding and generation, which could have implications for various applications, including those relevant to the Middle East.