Home

Awesome

Practical Comparable Data Collection for Low-Resource Languages via Images

Source

python src/make_dict.py -i data/alignments/wys.fastalign.input -a data/alignments/symmetric.align -l1 hin -l2 eng 
python src/find_token_alignments.py data/alignments/wys.fastalign.input data/alignments/symmetric.align output_path

Data

data/
├── alignments
│   ├── forward.align
│   ├── reverse.align
│   ├── symmetric.align
│   └── wys.fastalign.input
├── captions.tsv
└── dict.hin-eng.txt
ContentsPath
Captions in both English and Hindi, as well as image idsdata/captions.tsv
Flickr8khttp://academictorrents.com/details/9dea07ba660a722ae1008c4c8afdd303b6f6e53b
Generated Dictionarydata/dict.hin-eng.txt
Fastalign input/outputdata/alignments/

Task Instructions

Citation

If you use our work, please cite:

@inproceedings{madaan2020practical,
  title={Practical Comparable Data Collection for Low-Resource Languages via Images},
  author={Madaan, Aman and Rijhwani, Shruti and Anastasopoulos, Antonios and Yang, Yiming and Neubig, Graham},
  booktitle={Proceedings of the Practical ML for Developing Countries Workshop, ICLR 2020},
  year={2020}
}