Home

Awesome

🎋BAMBOO Meaning Graph Similarity Benchmark

➡️ Here's updated results on BAMBOO.

Contribute your results:

  1. evaluate your metric
  2. open an issue or a pull request
    • pull request: update both tables below
    • issue: report evaluation results, link to your metric, commit number (optional: paper link)

What do you need for evaluation on BAMBOO? It's simple

You need a metric that can input two files with n parallel AMR graphs each (in ususal AMR SemBank Penman format) and output n scores of meaning similarity (one per line). That's it.

Evaluation example based on simple BOW toy metric

We have prepared an example that shows how to test your AMR metric on the full benchmark.

See evaluation-suite/README.md

Alternatively, metrics can be tested on parts of benchmark, such as, e.g., STS MAIN.

Version notes

Benchmark Results of Current Metrics and Evaluation Setup

latest update: 07/12/2023:

MetricSTSSICKPARASTS(reify)SICK(reify)PARA(reify)STS(Syno)SICK(Syno)PARA(Syno)STS(role)SICK(role)PARA(role)AMEAN
SemBleu(k1)66.0362.8839.7261.7662.1038.1761.8358.8337.107.593.3617.4843.07
Sema55.9053.3233.4355.5156.1632.3350.1648.8729.1178.4890.7674.9354.91
SemBleu(k3)56.5458.0632.8254.9658.5333.6653.1953.7228.9681.0193.2877.7956.88
SemBleu(k2)60.6259.8636.8857.6859.6436.2457.3456.1833.2681.0193.2877.8859.16
WLK-k265.5761.3636.2163.7762.5536.2360.1456.4032.5179.7590.7677.6160.24
Smatch58.3959.7541.3258.0361.7939.4756.1357.3739.5489.8798.3288.1462.34
SmatchPP,58.5459.7541.3858.3559.7541.3956.2857.3739.6689.8798.3288.3162.41
S2match(def)56.3958.1142.4055.7859.9740.6756.0457.1540.9393.6798.3291.2662.56
S2match58.7060.4742.5258.1962.3740.5556.6257.8841.1589.8798.3292.2463.24
WWLK-k267.3167.5338.3764.5667.1637.1762.1061.8934.3092.4199.1686.5364.87
WWLK-k2-train67.9067.8938.6264.9567.3837.7862.4262.2534.4492.41100.0091.2665.61

For setup of W(W)LKs, see this info

Metric version:

Metriccommit
SemBleu530bc05
Sema9a4911c
Smatch, S2match711a231
SmatchPPf7e206d
(W)WLK51624e2

Citation

@article{10.1162/tacl_a_00435,
    author = {Opitz, Juri and Daza, Angel and Frank, Anette},
    title = "{Weisfeiler-Leman in the Bamboo: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {9},
    pages = {1425-1441},
    year = {2021},
    month = {12},
    abstract = "{Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: Some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step.In this work we propose new Weisfeiler-Leman AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses. Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes. Furthermore, we introduce a Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity metrics. Bamboo maximizes the interpretability of results by defining multiple overt objectives that range from sentence similarity objectives to stress tests that probe a metric’s robustness against meaning-altering and meaning- preserving graph transformations. We show the benefits of Bamboo by profiling previous metrics and our own metrics. Results indicate that our novel metrics may serve as a strong baseline for future work.}",
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00435},
    url = {https://doi.org/10.1162/tacl\_a\_00435},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00435/1979290/tacl\_a\_00435.pdf},
}