Jan Buchmann from TU Darmstadt presented research on NLP for long, structured documents at MBZUAI. The research addresses gaps in using document structure and improving the verifiability of LM responses. Experiments showed that models learn to represent document structure during pre-training, and larger models can cite sources well. Why it matters: This research contributes to making NLP more effective for complex documents like scientific articles and legal texts, which is crucial for information accessibility.
This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.
The InterText project, funded by the European Research Council, aims to advance NLP by developing a framework for modeling fine-grained relationships between texts. This approach enables tracing the origin and evolution of texts and ideas. Iryna Gurevych from the Technical University of Darmstadt presented the intertextual approach to NLP, covering data modeling, representation learning, and practical applications. Why it matters: This research could enable a new generation of AI applications for text work and critical reading, with potential applications in collaborative knowledge construction and document revision assistance.