Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

arXiv · March 26, 2026

Summary

This research presents a post-hoc framework to explain, verify, and align semantic hierarchies within vision-language model (VLM) encoders like CLIP. The method involves agglomerative clustering of class centroids, naming internal nodes via dictionary matching, and quantifying plausibility against human ontologies. Findings across 13 VLMs and 4 datasets reveal that image encoders are more discriminative, while text encoders induce hierarchies better matching human taxonomies, showing a trade-off between zero-shot accuracy and ontological plausibility. Why it matters: This work provides critical insights into the internal organization of VLM embeddings, which is essential for improving their explainability, reliability, and alignment with human knowledge, leading to more trustworthy AI systems.

Keywords

Vision-language models · Semantic hierarchies · Embedding spaces · Ontology alignment · Explainability

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.