Awesome
Summary
A treebank of Scottish Gaelic based on the Annotated Reference Corpus Of Scottish Gaelic (ARCOSG).
Introduction
The Scottish Gaelic treebank takes data from ARCOSG, the Annotated Reference Corpus of Scottish Gaelic (Lamb et al. 2016) with the annotation scheme based on that in the Irish UD treebank. Full bibliographic details are to be had there.
It contains eight subcorpora of a varying number of original files, each of approximately 1000 tokens. All files listed below are in the training set unless they are explicitly marked as being in test or dev. In the ARCOSG documentation the names of contributors are largely given in Gaelic, which I have kept and glossed with their names in English where they will be familiar to non-Gaelic speakers.
- Conversation. c01 is in test, c03 in dev and the rest in train. These are transcripts of interviews in the Western Isles from 1998 to 2000. In c03 and c04 speakers 2, 4 and 5 are children.
- Sport. s06 is in test, s08 in dev and the rest in train. s01 to s05 are Radio nan Gàidheal commentary on a match between Scotland and Australia; s06 to s10 on Scotland vs. Yugoslavia.
- Oral narrative.
- n01: Na Trì Leinntean Canaich (test)
- n02: Conall Gulban (dev)
- n03: Na Fiantaichean
- n04: Gille an Fheadain Duibh
- n05: Bodach Ròcabarraigh
- n06: Iain Beag MacAnndra
- n07: Fear a' Churracain Ghlais
- n08: Boban Saor
- n09: Bean 'ic Odrum
- n10: Blàr Chàirinis
- News scripts from Radio nan Gàidheal in the early 1990s.
- ns01: Màiri Anna NicUalraig (Mary Ann Kennedy)
- ns02: Dòmhnall Moireasdan
- ns03: Iseabail NicIllinnein
- ns04: Innes Rothach
- ns05: Innes Rothach (test)
- ns06: Pàdraig MacAmhlaigh (dev)
- ns07: Dòmhnall Moireasdan (test)
- ns08: Màiri Anna NicUalraig (dev)
- ns09: Seumas Domhnallach
- ns10: Seumas Domhnallach
- Public interview
- p01: Peataichean, conversation on Coinneach MacÌomhair's programme
- p02: Fred MacAulay and Martin MacDonald
- p03: John MacInnes and William Matheson
- p04: Geamaichean Sholais 1, conversation on Coinneach MacÌomhair's programme (test)
- p05: Geamaichean Sholais 2 (dev)
- p06: Bonn Comhraidh, 1980s political discussion programme
- p07: Conversation on Coinneach MacÌomhair's programme 2000-01-17 part 1
- p08: Conversation on Coinneach MacÌomhair's programme 2000-01-17 part 2
- Fiction
- f01: Am Fainne by Eilidh Watt
- f02: from Cùmhnantan by Tormod MacGill-Eain
- f03: Droch Àm by Pòl MacAonghais (test)
- f04: Spàl Tìm by Cailean T. MacCoinneach
- f05: Teine a Loisgeas by Eilidh Watt
- f06: Beul na h-Oidhche by Somhairle MacGill-Eain (Sorley Maclean)
- f07: from An t-Aonaran by Iain Mac a' Ghobhainn (Iain Crichton Smith)
- f08: Briseadh na Cloiche by Iain Moireach (dev)
- Formal prose:
- fp01: Trì Ginealaichean by D. E. Dòmhnallach
- fp02: Nua-Bhàrdachd Ghàidhlig by Dòmhnall MacAmhlaigh (Donald MacAulay)
- fp03: Mairead N. Lachlainn by Somhairle MacGill-Eain (test)
- fp04: from Bith-eòlas ('Biology'), a translation by Ruairidh MacThòmais (Derick Thomson)
- fp05: Aramach am Bearnaraidh
- fp06: Blàr a' Chumhaing by Iain A. MacDonald
- fp07: Na Marbhrannan by Coinneach D. MacDhòmhnaill
- fp08: Cainnt is Cànan by J. MacInnes
- fp09: from Dòmhnall Uilleam Stiùbhart (Donald William Stewart)'s unpublished PhD thesis (dev)
- Popular writing: columns from The Scotsman:
- pw01: An Cuir am Papa... by Aileig O Hianlaidh (Alex O'Henley)
- pw02: A bith mar Chorra... by Joina NicDhomnaill (test)
- pw03: Pàdraig Sellar by Ùisdean MacIllinnein
- pw04: A' Cur Às Dhuinn Fhìn by Aonghas Mac-a-Phì
- pw05: Aon Dùthaich by Murchadh MacLeòid
- pw06: Blas a' Ghuga by Coinneach MacLeòid (dev)
- pw07: Luchd-ciùil by Criosaidh Dick
- pw08: Na Gàidheil Ùra by Criosaidh Dick
- pw09: A' Siubhail gu Rèidh by Tormod Domhnallach (dev)
- pw10: Poileaticeans by Niall M. Brownlie
- pw11: Oifigeir Gàidhlig by Aileig O Hianlaidh (test)
See https://universaldependencies.org/gd/index.html for detailed linguistic documentation.
Acknowledgments
We wish to thank all of the contributors to ARCOSG and fellow Celtic language UD developers Teresa Lynn, Kevin Scannell, Johannes Heinecke and Fran Tyers.
References
- Colin Batchelor, 2019. Universal dependencies for Scottish Gaelic: syntax, in Proceedings of CLTW2019 at Machine Translation Summit XVII, Dublin, August
- Lamb, William, Sharon Arbuthnot, Susanna Naismith, and Samuel Danso. 2016. Annotated Reference Corpus of Scottish Gaelic (ARCOSG), 1997–2016 [dataset]. Technical report, University of Edinburgh; School of Literatures, Languages and Cultures; Celtic and Scottish Studies. https://doi.org/10.7488/ds/1411.
- Lynn, Teresa and Jennifer Foster, [Universal Dependencies for Irish] (http://www.nclt.dcu.ie/~tlynn/Lynn_CLTW2016.pdf), CLTW 2016, Paris, France, July 2016
Changelog
- 2024-11-15 v2.15
- Added PronType, VerbForm and Mood features systematically.
- 2024-05-15 v2.14
- Restricted the use of flat:foreign to where there are extended phrases rather than just two-word expressions.
- Fixed some appositions.
- 2023-11-15 v2.13
- Particles and numbers now lemmatised.
- NumType and NumForm features added.
- PartType on cha, chan and nach fixed.
- dè cho annotated consistently.
- 2023-05-15 v2.12
- Content clauses should all be
acl
now. - Anonymised places and people have
Anonymised=Yes
in the MISC column.
- Content clauses should all be
- 2022-11-15 v2.11
- Passives formed with rach 'to go' now mirror those in English and other languages using the
aux:pass
,nsubj:pass
and very occasionally thensubj:outer
deprels.
- Passives formed with rach 'to go' now mirror those in English and other languages using the
- 2022-05-15 v2.10
- All of ARCOSG now in the treebank.
- 2021-11-15 v2.9
- Small fixes to README.md
- Some missing sentences added.
- Added
PronType=Int
for interrogative pronouns andPronType=Art
for articles. - Made sure interrogative pronouns were all pronouns and adjusted trees and documentation accordingly.
- 2021-05-15 v2.8
- ri linn 's is a fixed expression now.
- the ' in, for example, 'dol is no longer a separate token.
flat
has been replaced withflat:name
in personal names andflat:foreign
in foreign expressions. It remains for placenames, dates and telephone numbers.nmod
andobl
have been reviewed and corrected throughout the corpus and now replacecompound
for f(h)(è)in and a/ri chèile.- Documents identified with
newdoc
.
- 2020-11-15 v2.7
Poss=Yes
added in line with Irish.- Tokens in the original with XPOS beginning
Sap
andSpp
are divided into their component words. - Systematic tidying of
acl:relcl
,advcl
andccomp
. PronType=Emp
replaced withForm=Emp
in line with Irish and extended to other parts of speech.PART
s with XPOSQa
now tagged correctlyPartType=Cmpl
- Words with UPOS
AUX
now have full features. - The English borrowing so is
CCONJ
notSCONJ
. - 's in fad 's and the like is now related to fad or o chionn by
fixed
. - Cosubordinative agus and is are now
SCONJ
like in Irish. - ach is
PART
where it is a focus particle rather than a preposition or a conjunction. - Use of
xcomp:pred
consistent in the sport subcorpora where the root is a footballer rather than bi.
- 2020-05-15 v2.6
- Small fixes to README.md.
- Some missing sentences added to dev and test, bringing them both over 10000 words.
- 2019-11-15 v2.5
- Initial release in Universal Dependencies.