Search

Results for "SM-ViT"

MBZUAI team win industry computer vision award for best student paper

MBZUAI · Invalid Date

An MBZUAI team led by Ph.D. student Dmitry Demidov won the Best Student Paper Award at VISAPP 2023 for their work on fine-grained visual classification. Their paper, 'Salient Mask-Guided Vision Transformer for Fine-Grained Classification,' introduces SM-ViT, a technique using a salient mask to improve Vision Transformer accuracy. The model focuses on defining characteristics of objects, outperforming standard ViT architecture, even with fewer or lower-resolution images. Why it matters: This award recognizes MBZUAI's contribution to advancing computer vision, particularly in applications requiring nuanced object recognition, such as robotics and automated systems.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

arXiv · Jun 13

MBZUAI researchers introduce VideoGPT+, a novel video Large Multimodal Model (LMM) that integrates image and video encoders to leverage both spatial and temporal information in videos. They also introduce VCGBench-Diverse, a comprehensive benchmark for evaluating video LMMs across 18 video categories. VideoGPT+ demonstrates improved performance on multiple video benchmarks, including VCGBench and MVBench.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv · Jun 8

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

Making computer vision more efficient with state-space models

MBZUAI · Invalid Date

MBZUAI researchers developed GroupMamba, a new set of state-space models (SSMs) for computer vision that addresses limitations in existing SSMs related to computational efficiency and optimization challenges. GroupMamba introduces a new layer called modulated group mamba, improving efficiency and stability. In benchmark tests, GroupMamba performed as well as similar SSM systems, but more efficiently, offering a backbone for tasks like image classification, object detection, and segmentation. Why it matters: This research aims to bridge the gap between vision transformers and CNNs by improving SSMs, potentially leading to more efficient and powerful computer vision models.

Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

arXiv · Sep 3

The researchers introduce KAU-CSSL, the first continuous Saudi Sign Language (SSL) dataset focusing on complete sentences. They propose a transformer-based model using ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies. The model achieved 99.02% accuracy in signer-dependent mode and 77.71% in signer-independent mode, advancing communication tools for the SSL community.

Deep Surface Meshes

MBZUAI · Invalid Date

Pascal Fua from EPFL presented an approach to implementing convolutional neural nets that output complex 3D surface meshes. The method overcomes limitations in converting implicit representations to explicit surface representations. Applications include single view reconstruction, physically-driven shape optimization, and bio-medical image segmentation. Why it matters: This research advances geometric deep learning by enabling end-to-end trainable models for 3D surface mesh generation, with potential impact on various applications in computer vision and biomedical imaging in the region.