Home

Awesome

Polish ASR speech datasets survey and catalog.

Polish ASR speech data survey goals

Visit Hugging Face Space for the interactive data catalog.

Results overview:

How to cite?

If you use only the survey results, please cite the corresponding article:

@article{Junczyk+2024+27+52, <br>
url = {https://doi.org/10.1515/psicl-2023-0019}, <br>
title = {A survey of Polish ASR speech datasets}, <br>
author = {Micha{\l} Junczyk},<br>
pages = {27--52},<br>
volume = {60},<br>
number = {1},<br>
journal = {Poznan Studies in Contemporary Linguistics},<br>
doi = {doi:10.1515/psicl-2023-0019},<br>
year = {2024},<br>
lastchecked = {2024-03-10}<br>
}

If you use the raw data from the catalog, please cite.

@Misc{pl-asr-speech-data-survey,
  author =       {Micha{\l} Junczyk},
  title =        {Polish ASR speech data catalog},
  howpublished = {Github},
  year =         {2023},
  url =          {https://github.com/goodmike31/pl-asr-speech-data-survey}
}

If you use the interactive speech data catalog, please cite.

@Misc{pl-asr-speech-data-survey,
  author =       {Micha{\l} Junczyk},
  title =        {Polish ASR speech survey},
  howpublished = {Hugging face},
  year =         {2024},
  url =          {https://huggingface.co/spaces/amu-cai/pl-asr-survey}
}

How to contribute to the Polish ASR speech datasets catalog

1. To report a bug in the catalog:

2. To propose adding or modifying catalog attributes:

How to identify public domain PL ASR speech datasets

  1. Open the catalog
  2. Set filter on Usage cost column to free
<br>

Request for feedback

Before getting familiar with the Polish ASR speech data survey and catalog, please consider completing the short (5 min) feedback form.<br> Your feedback will help to assess the state of Polish ASR datasets from the community perspective. Each response is awarded by donation of 50 PLN to the chosen charity organization. Thank you! Feedback form link

<br> # Addendum - Survey design

Dataset characteristics analysis process

Information sources

Attributes

Limitations

The metadata collected for the catalog is primarily based on information available online in language repositories and scientific articles. When feasible, it was further verified through manual inspection of dataset content and discussion with dataset authors. Despite author's best efforts, some essential metadata values may be missing. Naturally, if the metadata source contained errors, the resulting values in the catalog, metrics, and derived observations may be biased or inaccurate. Additionally, despite best efforts to manually validate all collected data before publication, the catalog may contain inaccuracies resulting from curation-related errors. Hopefully making the catalog publicly available to the community will enable collective curation of the catalog and taxonomy, particularly by addressing any errors in the catalog metadata that have a practical impact. Engaging in discussions is also crucial for filling in missing values, establishing the availability of datasets with unknown status, and determining licensing and re-usage of existing datasets for Polish ASR benchmarking purposes. Lastly, all datasets available online are subject to automatic analysis to draw new insights from the data and verify values reported by the authors in the documentation. This process, however, may introduce additional limitations in terms of the accuracy and reliability of the resulting insights.

Changelog

VersionDateScope
1.129 March 2024Added exhaustive list of URLs LINK for batch download of freely available datasets
1.010 March 2024Added citation information. Added reference to interactive browser and dashboard with survey insights on Hugging Face.
0.819 December 2023Updated SpokeBiz corpus metadata (hours, speakers, recordings) based on http://docs.pelcra.pl/doku.php?id=spokesbiz
0.723 July 2023Added form for collecting feedback and assessment of catalog usability from the community perspective
0.622 July 2023Added FLEURS and SpokesBiz corpora to the catalog. Updated survey summary statistics. Added "speakers education" attribute.
0.56 July 2023Changed "Creator or publisher" attribute to "UL" (University of Łódź) for corpora created by the PELCRA group.
0.421 March 2023Fixed taxonomy link
0.312 March 2023Added github.io page
0.222 February 2023Added datasets from Shaip company catalog
0.115 January 2023Initial release