Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
arXiv ·
Summary
This research presents a post-hoc framework to explain, verify, and align semantic hierarchies within vision-language model (VLM) encoders like CLIP. The method involves agglomerative clustering of class centroids, naming internal nodes via dictionary matching, and quantifying plausibility against human ontologies. Findings across 13 VLMs and 4 datasets reveal that image encoders are more discriminative, while text encoders induce hierarchies better matching human taxonomies, showing a trade-off between zero-shot accuracy and ontological plausibility. Why it matters: This work provides critical insights into the internal organization of VLM embeddings, which is essential for improving their explainability, reliability, and alignment with human knowledge, leading to more trustworthy AI systems.
Keywords
Vision-language models · Semantic hierarchies · Embedding spaces · Ontology alignment · Explainability
Get the weekly digest
Top AI stories from the GCC region, every week.