Skip to content
GCC AI Research

Search

Results for "UniMorph"

A Glass Bead Game of *-ology: Contemporary Computational Approaches to Linguistic Morphology, Typology and Social Psychology

MBZUAI ·

Ekaterina Vylomova from the University of Melbourne gave a talk on using NLP models to advance research in linguistic morphology, typology, and social psychology. The talk covered using models to study morphology, phonetic changes in words over time, and diachronic changes in language semantics. Vylomova presented the UniMorph project, a cross-lingual annotation schema and database with morphological paradigms for over 150 languages. Why it matters: This research demonstrates the potential of NLP to contribute to a deeper understanding of language evolution and structure, with applications in linguistic research and the study of social and cultural changes.

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

arXiv ·

MBZUAI researchers introduce UniMed-CLIP, a unified Vision-Language Model (VLM) for diverse medical imaging modalities, trained on the new large-scale, open-source UniMed dataset. UniMed comprises over 5.3 million image-text pairs across six modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus, created using LLMs to transform classification datasets into image-text formats. UniMed-CLIP significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs in zero-shot evaluations, improving over BiomedCLIP by +12.61 on average across 21 datasets while using 3x less training data.

Teaching AI to predict what cells will look like before running any experiments

MBZUAI ·

MBZUAI researchers have developed MorphDiff, a diffusion model that predicts cell morphology from gene expression data. MorphDiff uses the transcriptome to generate realistic post-perturbation images, either from scratch or by transforming a control image. The model combines a Morphology Variational Autoencoder (MVAE) with a Latent Diffusion Model, enabling both gene-to-image generation and image-to-image transformation. Why it matters: This could significantly accelerate drug discovery and biological research by allowing scientists to preview cellular changes before conducting experiments.

User-Centric Gender Rewriting

MBZUAI ·

NYU and NYU Abu Dhabi researchers are working on user-centric gender rewriting in NLP, especially for Arabic. They are building an Arabic Parallel Gender Corpus and developing models for gender rewriting tasks. The work aims to address representational harms caused by NLP systems that don't account for user preferences regarding grammatical gender. Why it matters: This research promotes fairness and inclusivity in Arabic NLP by enabling systems to generate gender-specific outputs based on user preferences, mitigating biases present in training data.

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

arXiv ·

This paper explores language-independent alternatives to morphological segmentation for Arabic NLP using data-driven sub-word units, characters as a unit of learning, and word embeddings learned using a character CNN. The study evaluates these methods on machine translation and POS tagging tasks. Results show these methods achieve performance close to or surpassing state-of-the-art approaches. Why it matters: By offering simpler, more adaptable segmentation techniques, this research can help improve Arabic NLP applications across diverse domains and dialects.

An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

arXiv ·

This paper introduces a new non-statistical Arabic lemmatizer algorithm designed for information retrieval systems. The lemmatizer leverages Arabic language knowledge resources to generate accurate lemma forms and relevant features. The algorithm achieves a maximum accuracy of 94.8% and 89.15% on first seen documents, outperforming the Stanford Arabic model's 76.7% on the same dataset. Why it matters: Accurate Arabic lemmatization is crucial for improving the performance of Arabic information retrieval systems, which can enhance access to Arabic language content.

Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

arXiv ·

The paper introduces UAE-3D, a multi-modal VAE for 3D molecule generation that compresses molecules into a unified latent space, maintaining near-zero reconstruction error. This approach simplifies latent diffusion modeling by eliminating the need to handle multi-modality and equivariance separately. Experiments on GEOM-Drugs and QM9 datasets show UAE-3D establishes new benchmarks in de novo and conditional 3D molecule generation, with significant improvements in efficiency and quality.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.