Awesome
Membership Inference Attacks on Sequence-to-Sequence Models
Public data from the paper "Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System?" (Sorami Hisamoto, Matt Post, and Kevin Duh. TACL 2020).
- Paper: https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00299
- Data Documentation: https://github.com/sorami/tacl-membership (The latest edition of this README)
Data
You can download the data from the following URL: http://www.cs.jhu.edu/~kevinduh/a/tacl-membership.tar.gz
Directory
We have two main directories, alice/
and bob/
, for Alice and Bob.
For alice/
, we have four subdirectories, MT model training data (train/
), in probes (in_probe/
), out probes (out_probe/
), and out-of-domain probes (ood/
). In each probe subdirectories, there are original/
for the original corpus data, and mt/
which contains the translation text file (*.de-en.en
) and the MT model score file (*.de-en.score.en
) from the Alice MT model.
alice
|-- in_probe
| |-- mt
| `-- original
|-- ood
| |-- mt
| `-- original
|-- out_probe
| |-- mt
| `-- original
`-- train
For bob/
, we have all/
for the entire data Bob gets that he can use for whatever way he wants for his attacks. We also have 10 subdirectories shadow_{1..5}{a,b}/
from the "shadow models" we used for the attacks in our experiments. Similar to Alice each shadow model subdirectory contains train
, in_probe/
, out_probe/
, and ood
, and each probe subdirectories contain both original/
and mt/
.
bob
|-- all
|-- shadow_1a
| |-- in_probe
| | |-- mt
| | `-- original
| |-- ood
| | |-- mt
| | `-- original
| |-- out_probe
| | |-- mt
| | `-- original
| `-- train
|-- shadow_1b
| |-- in_probe
| | |-- mt
| | `-- original
...
File name
Each terminal subdirectories contain text files (and MT model score files if applicable). The file name indicates the subcorpus, file type, and the language.
For example, commoncrawl.de-en.tok.de
is a CommonCrawl (commoncrawl.de-en
) tokenized (.tok
) German (.de
) data.
- Subcorpus
- Main Subcorpora
paracrawl.filtered
commoncrawl.de-en
europarl-v7.de-en
news-commentary-v13.de-en
rapid2016.de-en
- Out-of-Domain Subcorpora
emea.de-en
koran.de-en
subtitles18.de-en
ted.de-en
- Main Subcorpora
- File Type
.tok
: Tokenized text.score
: MT model score
- Language
de
: German (source)en
: English (target)
All text are tokenized. We used the same tokenziation for both Alice and Bob.
Detail of Data Splits
Corpus
<img src="figures/data_and_splits.png" width="600px">For our experiment, we had 5 main subcorpora. We used corpora from the Conference on Machine Translation (WMT18) (Bojar et al., 2018).
Carol, the judge, gives Alice 4 subcorpora as her MT model training data, namely CommonCrawl
, Europarl v7
, News Commentary v13
, and Rapid 2016
. Carol excludes 5,000 samples from each training sets as out probes, and select 5,000 samples within the training sets as in probes.
Bob gets a subset of the data Alice has, excluding Alice in probes. He can use this data in whatever way he wants for his attack.
In addition, Carol gives Alice ParaCrawl
, but not to Bob. We can think of it as an in-house data the MT service provider holds.
Out-of-domain Subcorpora
Additionally, we have 4 out-of-domain (OOD) subcorpora, namely EMEA
and Subtitles 18
(Tiedemann, 2012), Koran
(Tanzil), and TED
(Duh, 2018).