News from Sapienza NLP

Sapienza NLP @ CLiC-it 2023

4 papers at CLiC-it!

Sapienza NLP will be at CLiC-it 2023 with 4 papers! Our works cover a set of diverse topics such as entity disambiguation, the meaning of superhuman performances in NLU, multilingual relation extraction, and multilingual sentence alignment.
  • ExtEnD: Extractive Entity Disambiguation
  • What's the Meaning of Superhuman Performance in Today's NLU?
  • REDFM: a Filtered and Multilingual Relation Extraction Dataset
  • XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs

ExtEnD: Extractive Entity Disambiguation

by E. Barba, L. Procopio, R. Navigli

Local models for Entity Disambiguation (ED) have today become extremely powerful, in most part thanks to the advent of large pre-trained language models. However, despite their significant performance achievements, most of these approaches frame ED through classification formulations that have intrinsic limitations, both computationally and from a modeling perspective. In contrast with this trend, here we propose EXTEND, a novel local formulation for ED where we frame this task as a text extraction problem, and present two Transformer-based architectures that implement it. Based on experiments in and out of domain, and training over two different data regimes, we find our approach surpasses all its competitors in terms of both data efficiency and raw performance. EXTEND outperforms its alternatives by as few as 6 F 1 points on the more constrained of the two data regimes and, when moving to the other higher-resourced regime, sets a new state of the art on 4 out of 6 benchmarks under consideration, with average improvements of 0.7 F 1 points overall and 1.1 F 1 points out of domain. In addition, to gain better insights from our results, we also perform a fine-grained evaluation of our performances on different classes of label frequency, along with an ablation study of our architectural choices and an error analysis. We release our code and models for research purposes at https://

What's the Meaning of Superhuman Performance in Today's NLU?

by S. Tedeschi, J. Bos, T. Declerck, J. Hajic, D. Hershcovich, E. H. Hovy, A. Koller, S. Krek, S. Schockaert, R. Sennrich, Ekaterina S., R. Navigli

In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Models (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning, and reading comprehension. These PLMs have achieved impressive results on these benchmarks, even surpassing human performance in some cases. This has led to claims of superhuman capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abilities and what the current benchmarks are really evaluating. We show that these benchmarks have serious limitations affecting the comparison between humans and PLMs and provide recommendations for fairer and more transparent benchmarks.

REDFM: a Filtered and Multilingual Relation Extraction Dataset

by P. H. Cabot, S. Tedeschi, A. Ngonga, R. Navigli

Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English. In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems. First, we present SREDFM, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose REDFM, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at

XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs

by F. Martelli, A. S. Bejgu, C. Campagnano, J. Čibej, R. Costa, A. Gantar, J. Kallas, S. Koeva, K.Koppel, S. Krek, M. Langemets, V. Lipp, S. Nimb, S. Olsen, B. S. Pedersen, V. Quochi, A. Salgado, L. Simon, C. Tiberius, R. Ureña-Ruiz, R. Navigli

Word alignment plays a crucial role in several NLP tasks, such as lexicon injection and cross-lingual label projection. The evaluation of word alignment systems relies heavily on manually-curated datasets, which are not always available, especially in mid- and low-resource languages. In order to address this limitation, we propose XL-WA, a novel entirely manually-curated evaluation benchmark for word alignment covering 14 language pairs. We illustrate the creation process of our benchmark and compare statistical and neural approaches to word alignment in both language-specific and zero-shot settings, thus investigating the ability of state-of-the-art models to generalize on unseen language pairs. We release our new benchmark at: