Jan Buchmann from TU Darmstadt presented research on NLP for long, structured documents at MBZUAI. The research addresses gaps in using document structure and improving the verifiability of LM responses. Experiments showed that models learn to represent document structure during pre-training, and larger models can cite sources well. Why it matters: This research contributes to making NLP more effective for complex documents like scientific articles and legal texts, which is crucial for information accessibility.
The InterText project, funded by the European Research Council, aims to advance NLP by developing a framework for modeling fine-grained relationships between texts. This approach enables tracing the origin and evolution of texts and ideas. Iryna Gurevych from the Technical University of Darmstadt presented the intertextual approach to NLP, covering data modeling, representation learning, and practical applications. Why it matters: This research could enable a new generation of AI applications for text work and critical reading, with potential applications in collaborative knowledge construction and document revision assistance.
This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.
A novel agent-based framework called FIRE is introduced for fact-checking long-form text. FIRE iteratively integrates evidence retrieval and claim verification, deciding whether to provide a final answer or generate a subsequent search query. Experiments show FIRE achieves comparable performance to existing methods while reducing LLM costs by 7.6x and search costs by 16.5x.
This paper introduces a Regulatory Knowledge Graph (RKG) for the Abu Dhabi Global Market (ADGM) regulations, constructed using language models and graph technologies. A portion of the regulations was manually tagged to train BERT-based models, which were then applied to the rest of the corpus. The resulting knowledge graph, stored in Neo4j, and code are open-sourced on GitHub to promote advancements in compliance automation.