Skip to content

Survey of available speech datasets for Polish ASR development

Notifications You must be signed in to change notification settings

goodmike31/pl-asr-speech-data-survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polish ASR speech datasets survey and catalog.

Polish ASR speech data survey goals

  • Organization of publicly available information about Polish ASR speech datasets into the catalog.
  • Providing the ASR community with a collectively maintained, up-to-date source of information about available datasets.
  • Identification of ASR speech datasets for Polish, which are freely available for research and commercial purposes.

Request for feedback

Before getting familiar with the Polish ASR speech data survey and catalog, please consider completing the short (5 min) feedback form.
Your feedback will help to assess the state of Polish ASR datasets from the community perspective. Each response is awarded by donation of 50 PLN to the chosen charity organization. Thank you! Feedback form link

How to cite?

If you use the raw data from the catalog, please cite.

@Misc{pl-asr-speech-data-survey,
  author =       {Micha{\l} Junczyk},
  title =        {Polish ASR speech data catalog},
  howpublished = {Github},
  year =         {2023},
  url =          {https://github.com/goodmike31/pl-asr-speech-data-survey}
}

If you use the survey results, please cite the corresponding article:

@article{Junczyk+2024+27+52, <br>
url = {https://doi.org/10.1515/psicl-2023-0019}, <br>
title = {A survey of Polish ASR speech datasets}, <br>
author = {Micha{\l} Junczyk},<br>
pages = {27--52},<br>
volume = {60},<br>
number = {1},<br>
journal = {Poznan Studies in Contemporary Linguistics},<br>
doi = {doi:10.1515/psicl-2023-0019},<br>
year = {2024},<br>
lastchecked = {2024-03-10}<br>
}

Survey results:

  • Fifty-three datasets (53) intended for Polish ASR development were identified based on the information available in the public domain.

  • Fourty-four (44) out of 53 identified datasets are accessible.

  • The total size of documented Polish ASR speech data amounts to approx. 26860 hours of audio and 5744 hours of transcribed speech.

  • For thousand five hundred and seventy (4570) hours of speech is validated as accessible (79.5% of identified and cataloged speech material)

  • One thousand and four hundred (1400) hours of transcribed speech data is available in a public domain.

  • Three thousand one hundred seventy-one (3171) hours of transcribed speech is available through commercial providers.

  • Each dataset in the catalog is described with up to 65 attributes. See the Catalog attributes for details.

  • Detailed survey results can be found:
    • In the article published in PSICL journal (Volume 60 Issue 1) LINK
    • On interactive dashboard available on Hugging Face.
  • Resulting Polish ASR speech datasets catalog is available in 3 formats:

How to contribute to the Polish ASR speech datasets catalog

1. To report a bug in the catalog:

  • Open an issue, describe it, and label it as a "bug".
  • (alternatively) Leave a comment in the Google Sheet.

2. To propose adding or modifying catalog attributes:

  • Open an issue, describe your proposal (attribute name, meaning, and motivation), and label it as "enhancement".
  • (alternatively) Leave a comment in the Google Sheet and describe your proposal.

How to identify public domain PL ASR speech datasets

  1. Open the catalog
  2. Set filter on Usage cost column to free

# Addendum - Survey design

Dataset characteristics analysis process

  • Publicly available information sources were analyzed, annotated, and standardized with 60 attributes.
  • Additionally, the content of publicly available datasets was analyzed automatically.

Information sources

Attributes

  • Each dataset is described with up to 60 attributes covering:
    • authorship, creation date, funding, etc.
    • availability, licensing, pricing, etc.
    • content e.g. audio format, number of speakers recorded, meta-data distributions, etc.
  • Values of attributes not reported by dataset authors, publishers, or researchers are recreated, when feasible (e.g. number of recordings)
  • Full list of attributes, their purpose, and allowed values are available as:

Limitations

The metadata collected for the catalog is primarily based on information available online in language repositories and scientific articles. When feasible, it was further verified through manual inspection of dataset content and discussion with dataset authors. Despite author's best efforts, some essential metadata values may be missing. Naturally, if the metadata source contained errors, the resulting values in the catalog, metrics, and derived observations may be biased or inaccurate. Additionally, despite best efforts to manually validate all collected data before publication, the catalog may contain inaccuracies resulting from curation-related errors. Hopefully making the catalog publicly available to the community will enable collective curation of the catalog and taxonomy, particularly by addressing any errors in the catalog metadata that have a practical impact. Engaging in discussions is also crucial for filling in missing values, establishing the availability of datasets with unknown status, and determining licensing and re-usage of existing datasets for Polish ASR benchmarking purposes. Lastly, all datasets available online are subject to automatic analysis to draw new insights from the data and verify values reported by the authors in the documentation. This process, however, may introduce additional limitations in terms of the accuracy and reliability of the resulting insights.

Changelog

Version Date Scope
1.1 29 March 2024 Added exhaustive list of URLs LINK for batch download of freely available datasets
1.0 10 March 2024 Added citation information. Added reference to interactive browser and dashboard with survey insights on Hugging Face.
0.8 19 December 2023 Updated SpokeBiz corpus metadata (hours, speakers, recordings) based on http://docs.pelcra.pl/doku.php?id=spokesbiz
0.7 23 July 2023 Added form for collecting feedback and assessment of catalog usability from the community perspective
0.6 22 July 2023 Added FLEURS and SpokesBiz corpora to the catalog. Updated survey summary statistics. Added "speakers education" attribute.
0.5 6 July 2023 Changed "Creator or publisher" attribute to "UL" (University of Łódź) for corpora created by the PELCRA group.
0.4 21 March 2023 Fixed taxonomy link
0.3 12 March 2023 Added github.io page
0.2 22 February 2023 Added datasets from Shaip company catalog
0.1 15 January 2023 Initial release