Search

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

arXiv · Mar 26

This research presents a post-hoc framework to explain, verify, and align semantic hierarchies within vision-language model (VLM) encoders like CLIP. The method involves agglomerative clustering of class centroids, naming internal nodes via dictionary matching, and quantifying plausibility against human ontologies. Findings across 13 VLMs and 4 datasets reveal that image encoders are more discriminative, while text encoders induce hierarchies better matching human taxonomies, showing a trade-off between zero-shot accuracy and ontological plausibility. Why it matters: This work provides critical insights into the internal organization of VLM embeddings, which is essential for improving their explainability, reliability, and alignment with human knowledge, leading to more trustworthy AI systems.

Results for "content classification"

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings