NLP for Long, Structured Documents

MBZUAI · Notable

Summary

Jan Buchmann from TU Darmstadt presented research on NLP for long, structured documents at MBZUAI. The research addresses gaps in using document structure and improving the verifiability of LM responses. Experiments showed that models learn to represent document structure during pre-training, and larger models can cite sources well. Why it matters: This research contributes to making NLP more effective for complex documents like scientific articles and legal texts, which is crucial for information accessibility.

Keywords

NLP · long documents · document structure · language models · MBZUAI

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation

arXiv · Nov 8

This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.

Modeling Text as a Living Object

MBZUAI · Invalid Date

The InterText project, funded by the European Research Council, aims to advance NLP by developing a framework for modeling fine-grained relationships between texts. This approach enables tracing the origin and evolution of texts and ideas. Iryna Gurevych from the Technical University of Darmstadt presented the intertextual approach to NLP, covering data modeling, representation learning, and practical applications. Why it matters: This research could enable a new generation of AI applications for text work and critical reading, with potential applications in collaborative knowledge construction and document revision assistance.

NLP for Long, Structured Documents

Summary

Keywords

Related

Cross-Document Topic-Aligned Chunking for Retrieval-Augmented Generation

Modeling Text as a Living Object