Home

Awesome

AMR metrics: a suite

This repo collects AMR graph matching metrics

Content of this repo

  1. Scripts for calculating Smatch and S2match (Soft Semantic match, speak [estuːmætʃ]). And scripts for calculating the structure error of two AMR graph banks.

  2. Current state-of-the-art performance of AMR parsers wrt to these metrics and some additional information.

run Smatch or S2match with python2 and python3

Preparations

for S2match, word vectors (e.g. Glove) need to be downloaded and stored in the vectors/ directory:

e.g.:

./download_glove.sh

and we need to install python packages scipy and numpy (for similarity calculation) and networkx and penman to calculate graph structure error.

Quickstart, extensive AMR evaluation

Using python 2.7:

./easy_eval_py2.sh <file1> <file2>

Using python 3.x:

./easy_eval_py3.sh <file1> <file2>

filex is a file in standard AMR format, i.e., AMRs separated by empty line. See examples/ for examples.

More details (e.g. run only s2match with different vectors)

see py2-Smatch-and-S2match or py3-Smatch-and-S2match

AMR state-of-the-art

System IDs and short description

Evaluation Results on AMR 2.0 test (higher=better)

SystemSmatchS2matchYearCode available
S2S-pretrainM81.482.52020yes
GSII80.381.52020yes
S2S-pretrain80.281.52020yes
GSII-noRecat78.679.92020yes
TBWT77.078.32020yes
STOG-BERT76.3,77.92019yes
STOG74.6,?2019yes
GPLA74.5,76.22018yes
TOP-DOWN73.2,75.02019yes

Structure error evaluation on AMR 2.0 test (lower=better)

SystemDegreeDensitysize(V)size(E)
S2S-pretrainM0.0700.00591.862.65
GSII0.0710.00701.872.59
S2S-pretrain0.0710.00622.032.80
TWBT0.1000.00582.764.17
GPLA0.0830.00681.992.90
GSII-noRecat0.1020.00732.142.75
STOG-BERT0.0820.00692.423.19
STOG????
TOP-DOWN0.1100.00782.373.32

Structure error defined as mean absolute deviation from gold graph over all graph pairs.

System dependencies

Systemexternal dataword-embedding typecopying (src)copying (tgt)attention (src)attention (tgt)PreProrecategorizeanonnotes
GSIInoBERTyesnoyesyesCoreNLP, lemma/pos/neryesyessame pre/post proc as STOG
S2S-pretrainMMT(de-en), AMR-silverrandomnonoyesyesnonono
S2S-pretrainMT(de-en), AMR-silver, Constituencyrandomnonoyesyesnonono
GSII-noRecatnoBERTyesnoyesyesCoreNLP, lemma/pos/nernono
TWBTnoBERTnonoyesnoAMR2tree decomp (Lindeman 2019)no(?)no(?)
STOG-BERTnoBERTyesyesyesyesCoreNLP, lemma/pos/neryesyes
STOGnoGloVe 300dyesyesyesyesCoreNLP, lemma/pos/neryesyes
GPLAnoGlove 300dyesnononoCoreNLP, lemma/pos/neryesno
TOP-DOWNnoGlove 300dyesnoyesnoCoreNLP, lemma/pos/nernono

References

S2S-pretrain: Dongqin Xu et al. "Improving AMR Parsing with Sequence-to-Sequence Pre-training" arXiv preprint arXiv:2010.01771 (2020). github

GSII: Deng Cai and Wai Lam. "AMR Parsing via Graph-Sequence Iterative Inference". arXiv preprint arXiv:2004.05572 (2020). github

TBWT: Lindemann et al. "Fast semantic parsing with well-typedness guarantees". arXiv preprint arXiv:2009.07365. github

GPLA: Chunchuan Lyu and Ivan Titov. "Amr parsing as graph prediction with latent alignment." arXiv preprint arXiv:1805.05286 (2018). github

STOG: Sheng Zhang et al. "AMR Parsing as Sequence-to-Graph Transduction." arXiv preprint arXiv:1905.08704 (2019). github

TOP-DOWN: Deng Cai and Wai Lam. "Core Semantic First: A Top-down Approach for AMR Parsing." arXiv preprint arXiv:1909.04303 (2019). github

Smatch: Shu Cai and Kevin Knight. "Smatch: an evaluation metric for semantic feature structures." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.

S2match: Juri Opitz et al. "AMR Similarity Metrics from Principles" arXiv preprint arXiv:2001.10929 (2020).

Citation

If you like the idea please consider citing

article{TACL2205,
	author = {Juri Opitz and Anette Frank and Letitia Parcalabescu},
	title = {AMR Similarity Metrics from Principles},
	journal = {Transactions of the Association for Computational Linguistics},
	volume = {8},
	number = {0},
	year = {2020},
	keywords = {},
	abstract = {Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical SMATCH metric (Cai and Knight, 2013) aligns the variables of two graphs and assesses triple matches. The recent SEMBLEU metric (Song and Gildea, 2019) is based on the machine-translation metric BLEU (Papineni et al., 2002) and increases computational efficiency by ablating the variable-alignment.In this paper, i) we establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR; ii) we undertake a thorough analysis of SMATCH and SEMBLEU where we show that the latter exhibits some undesirable properties.  For example, it does not conform to the identity of indiscernibles rule and introduces biases that are hard to control; iii) we propose a novel metric S2MATCH that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria. We assess its suitability and show its advantages over SMATCH and SEMBLEU. },
	issn = {2307-387X},	pages = {522--538},	url = {https://transacl.org/index.php/tacl/article/view/2205}
}

Changelog

Basis: extended Smatch metrics cloned from Lyu's extended Smatch metric repository that again had been adapted from Marco Damonte's extended metrics.

Major consecutive changes

  1. added s2match

  2. added script to calculate graph structure deviation (see graph-strucutre-error)

  3. adapted for calculating extended metrics with s2match alignment (see scores-enhanced-s2align.py)

  4. Improved extended metrics for single-graph comparison. Previously: if A (e.g, predicted graph) and B (e.g., gold graph) both have a feature absent (e.g., they do not contain a polarity edge) ---> default score for polarity 0.0. Now: this default score is changed to 1.0, since both A and B agree in the absence of polarity. This is highly unlikely to have an effect on corpus-level evaluation.

  5. copied all python2 scripts and made them python3 compatible