Researchers from Alexandria University introduce AlexU-Word, a new dataset for offline Arabic handwriting recognition. The dataset contains 25,114 samples of 109 unique Arabic words, covering all letter shapes, collected from 907 writers. The dataset is designed for closed-vocabulary word recognition and to support segmented letter recognition-based systems. Why it matters: This dataset can help advance Arabic handwriting recognition systems, addressing a need for high-quality Arabic datasets in NLP research.
The Qatar Computing Research Institute (QCRI) has introduced PDNS-Net, a large heterogeneous graph dataset for malicious domain classification, containing 447K nodes and 897K edges. It is significantly larger than existing heterogeneous graph datasets like IMDB and DBLP. Preliminary evaluations using graph neural networks indicate that further research is needed to improve model performance on large heterogeneous graphs. Why it matters: This dataset will enable researchers to develop and benchmark graph learning algorithms on a scale relevant to real-world cybersecurity applications, particularly for identifying and mitigating malicious online activity.
A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.