Home

Awesome

Machine Translation Corpus for Turkic Languages

<p align="center"> <kbd> <img width="300" height="300" src="./logo.jpg"> </kbd> </p>

Getting started

Simplest option!

If you are using GPU-enabled (local) machine

Install the necessary libraries

pip install -r requirements.txt

Run the baseline script by passing in two language codes. This will automatically download the data, process it, install the necesssary libraries and framework and start the training process. The script assumes you are on a GPU-enabled device with CUDA support.

bash train_baseline.sh <source_language> <target_language>

If you are using free preemptible GPUs on Google Colab

You can download the file joeynmt_colab_bilingual.ipynb and upload it onto the Google Colab system. You can change the languages codes in the script and the data will be automatically downloaded. It is recommended that you connect your Google Drive account to the Colab to save your progress. Google Colab deletes everything from its workspace periodically (~12 hours).

Visualize your results

Make sure tensorboard is installed and launch the visualization server (for example for uz and ru):

pip install tensorboard

tensorboard --logdir=experiments/uz-ru-bilingual_baseline/models/uzru_transformer/tensorboard

After launching the visualization server, you can view your visualizations in a web browser at http://localhost:6006.

You should see something like this: alt text

Create a submission to the leaderboard

Once you have your amazing model ready, you can create a submission (.zip file) by simply running create_submission.sh script along with some parameters:

bash create_submission.sh <path_to_joeynmt_config.yaml> <source_language_code> <target_language_code>
# For example:
bash create_submission.sh joeynmt/configs/transformer_uzru.yaml uz ru

The script will automatically download the needed test files, load the model specified in the config file, run the test and output the predictions under \submissions folder.

Useful scripts

Download the parallel data

To get started, download the data for a pair that you are interested

python download_data.py --source_language=<language code> --target_language=<language_code> --split=<train,dev,test,all>

Download the monolingual data

You can also download monolingual data for any of the languages in the table below. Monolingual data are crawled from our parallel corpus, Wikipedia dumps, news websites and a few manual crawls whenever possible.

python download_monolingual.py --language=<language code>

Install JoeyNMT

git clone https://github.com/joeynmt/joeynmt.git
cd joeynmt; pip3 install .
pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
RankSourceTargetTraining size
1entr35879592
2tren35879592
3rutr15092464
4trru15092464
5kkru4403385
6rukk4403385
7ruuz1321013
8uzru1321013
9cvru794654
10rucv794654
11enkk564760
12kken564760
13azen548901
14enaz548901
15enuz529574
16uzen529574
17baru523719
18ruba523719
19aztr410140
20traz410140
21azru331144
22ruaz331144
23entt320323
24tten320323
25enky312644
26kyen312644
27kyru293652
28ruky293652
29trtt289604
30tttr289604
31kytr275028
32trky275028
33rutt270462
34ttru270462
35kytt220203
36ttky220203
37azuz217159
38uzaz217159
39truz217078
40uztr217078
41azky205758
42kyaz205758
43aztt201280
44ttaz201280
45entk130480
46tken130480
47tktr126803
48trtk126803
49ttuz126249
50uztt126249
51kyuz119946
52uzky119946
53tktt118578
54tttk118578
55aztk114895
56tkaz114895
57rutk111913
58tkru111913
59kkuz111519
60uzkk111519
61kytk110942
62tkky110942
63enug96898
64ugen96898
65cvtt85317
66ttcv85317
67cvtr81451
68trcv81451
69cvky79700
70kycv79700
71azcv78310
72cvaz78310
73cven78288
74encv78288
75cvtk71263
76tkcv71263
77kaauz65527
78uzkaa65527
79kjhru60295
80rukjh60295
81trug58083
82ugtr58083
83cvuz57451
84uzcv57451
85kktr55815
86trkk55815
87ruug41867
88ugru41867
89batt40086
90ttba40086
91batr35910
92trba35910
93baky35651
94kyba35651
95baen34308
96enba34308
97bacv33001
98cvba33001
99azba32184
100baaz32184
101kaaru29882
102rukaa29882
103batk28528
104tkba28528
105bauz27478
106uzba27478
107uguz17661
108uzug17661
109enkaa17071
110kaaen17071
111crhen15377
112encrh15377
113crhtr14497
114trcrh14497
115altba12613
116baalt12613
117batyv12531
118tyvba12531
119crhru12401
120rucrh12401
121alttt12372
122ttalt12372
123kkky12216
124kykk12216
125altky12123
126kyalt12123
127trtyv12065
128tyvtr12065
129kytyv12053
130tyvky12053
131tttyv11929
132tyvtt11929
133alttr11768
134tralt11768
135entyv11482
136tyven11482
137altuz11394
138uzalt11394
139cvtyv11338
140tyvcv11338
141alten11174
142enalt11174
143altcv11033
144cvalt11033
145altaz10738
146azalt10738
147alttk10553
148tkalt10553
149aztyv10352
150tyvaz10352
151tktyv9881
152tyvtk9881
153crhkaa9377
154kaacrh9377
155crhtt9362
156ttcrh9362
157crhuz9299
158uzcrh9299
159kjhkrc9254
160krckjh9254
161rusah9237
162sahru9237
163kaakum9173
164kumkaa9173
165kjhsah9162
166sahkjh9162
167azcrh9153
168crhaz9153
169kumuz9131
170uzkum9131
171crhkjh9107
172kjhcrh9107
173kktt9103
174ttkk9103
175azkk9093
176kkaz9093
177crhtk9066
178tkcrh9066
179kaaky9064
180kykaa9064
181crhkum9041
182kumcrh9041
183altkaa9012
184kaaalt9012
185crhcv9003
186cvcrh9003
187kumtr9000
188trkum9000
189tkug8997
190ugtk8997
191enkjh8991
192kjhen8991
193azkrc8988
194krcaz8988
195cvkrc8981
196krccv8981
197tkuz8966
198uztk8966
199kaatt8962
200ttkaa8962
201crhug8958
202ugcrh8958
203bakaa8947
204kaaba8947
205kaatk8939
206tkkaa8939
207kjhkum8938
208kumkjh8938
209kaaug8927
210ugkaa8927
211cvkjh8921
212kjhcv8921
213kkkum8921
214kumkk8921
215azkum8908
216kumaz8908
217azkaa8902
218kaaaz8902
219kaatr8894
220trkaa8894
221kjhug8884
222ugkjh8884
223kaakk8874
224kkkaa8874
225altcrh8867
226crhalt8867
227azgag8848
228gagaz8848
229kjhtr8845
230trkjh8845
231bakjh8814
232kjhba8814
233kumtt8813
234ttkum8813
235kumru8807
236rukum8807
237altkum8798
238crhgag8798
239gagcrh8798
240kumalt8798
241crhkk8795
242kkcrh8795
243kumky8795
244kykum8795
245kjhtt8793
246ttkjh8793
247bakum8777
248kumba8777
249bacrh8774
250crhba8774
251krctk8753
252tkkrc8753
253kjhtk8736
254tkkjh8736
255krcsah8732
256sahkrc8732
257kjhuz8727
258uzkjh8727
259kaakjh8722
260kjhkaa8722
261azug8717
262ugaz8717
263kyug8706
264ugky8706
265krcuz8693
266uzkrc8693
267altkrc8684
268krcalt8684
269azkjh8677
270kjhaz8677
271gagtr8670
272trgag8670
273altkjh8663
274kjhalt8663
275cvug8662
276ugcv8662
277kkug8650
278ugkk8650
279kaakrc8644
280krckaa8644
281altru8643
282rualt8643
283tyvuz8637
284uztyv8637
285kumtk8572
286tkkum8572
287cvkaa8568
288kaacv8568
289cvgag8556
290gagcv8556
291crhky8552
292enkum8552
293kumen8552
294kycrh8552
295krctr8550
296trkrc8550
297baug8526
298kjhky8526
299kykjh8526
300ugba8526
301cvkum8520
302kumcv8520
303kumug8519
304ugkum8519
305bakk8518
306gagtk8518
307kkba8518
308tkgag8518
309bakrc8509
310krcba8509
311altkk8506
312kkalt8506
313crhkrc8501
314krccrh8501
315cvsah8488
316cvkk8488
317kkcv8488
318sahcv8488
319ttug8488
320ugtt8488
321krckum8473
322kumkrc8473
323krctt8471
324ttkrc8471
325gagkjh8468
326kjhgag8468
327kjhkk8452
328kkkjh8452
329kktk8448
330tkkk8448
331krcru8442
332rukrc8442
333gaguz8419
334uzgag8419
335azsah8403
336sahaz8403
337crhsah8367
338sahcrh8367
339gagkrc8358
340krcgag8358
341engag8350
342gagen8350
343gagkaa8342
344kaagag8342
345sahuz8339
346uzsah8339
347altgag8328
348gagalt8328
349krcug8327
350ugkrc8327
351sahtt8293
352ttsah8293
353altug8270
354ugalt8270
355enkrc8220
356krcen8220
357sahtr8210
358trsah8210
359bagag8167
360gagba8167
361altsah8163
362sahalt8163
363gagtt8143
364ttgag8143
365sahtk8134
366tksah8134
367kkkrc8114
368krckk8114
369ensah8106
370sahen8106
371basah8082
372sahba8082
373gagru8080
374rugag8080
375krcky8069
376kykrc8069
377kaasah8021
378sahkaa8021
379gagkum7999
380kumgag7999
381gagug7988
382uggag7988
383kksah7970
384sahkk7970
385sahug7904
386ugsah7904
387kumsah7853
388sahkum7853
389gagsah7852
390sahgag7852
391kysah7820
392sahky7820
393gagky7795
394kygag7795
395gagkk7694
396kkgag7694
397alttyv2922
398tyvalt2922
399cjsru2294
400rucjs2294
401enslr766
402slren766
403enuum491
404uumen491