The Wayback Machine - https://web.archive.org/web/20220219145730/https://github.com/google/language-resources
Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
af
 
 
 
 
bn
 
 
 
 
es
 
 
 
 
 
 
is
 
 
jv
 
 
km
 
 
lo
 
 
 
 
my
 
 
ne
 
 
 
 
si
 
 
su
 
 
ta
 
 
 
 
 
 
 
 
xh
 
 
zu
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Language Resources and Tools

Build Status

Datasets and scripts for basic natural language and speech processing.

This is not an official Google product.

Natural Languages

Directory Language Available
af Afrikaans
bn Bengali / Bangla
hi_ur Hindi & Urdu
is Icelandic
jv Javanese
km Khmer
lo Lao
my Burmese / Myanmar
ne Nepali
si Sinhala
su Sundanese
xh Xhosa
zu Zulu

Tools

We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).

Opensourced Audio Data

Resource Link
Sinhala TTS recordings (~3K) https://www.openslr.org/30/
TTS recordings for four South African languages (af, st, tn, xh) https://www.openslr.org/32/
Large Javanese ASR training data set (~185K) https://www.openslr.org/35/
Large Sundanese ASR training data set (~220K) https://www.openslr.org/36/
High quality TTS data for Bengali languages https://www.openslr.org/37/
High quality TTS data for Javanese https://www.openslr.org/41/
High quality TTS data for Khmer https://www.openslr.org/42/
High quality TTS data for Nepali https://www.openslr.org/43/
High quality TTS data for Sundanese https://www.openslr.org/44/
Large Sinhala ASR training data set https://www.openslr.org/52/
Large Bengali ASR training data set https://www.openslr.org/53/
Large Nepali ASR training data set https://www.openslr.org/54/
Crowdsourced high-quality Argentinian Spanish speech data set https://www.openslr.org/61/
Crowdsourced high-quality Malayalam multi-speaker speech data set https://www.openslr.org/63/
Crowdsourced high-quality Marathi multi-speaker speech data set https://www.openslr.org/64/
Crowdsourced high-quality Tamil multi-speaker speech data set https://www.openslr.org/65/
Crowdsourced high-quality Telugu multi-speaker speech data set https://www.openslr.org/66/
Data set which contains recordings of Catalan https://www.openslr.org/69
Crowdsourced high-quality Nigerian English speech data set https://www.openslr.org/70
Crowdsourced high-quality Chilean Spanish speech data set https://www.openslr.org/71
Crowdsourced high-quality Colombian Spanish speech data set https://www.openslr.org/72
Crowdsourced high-quality Peruvian Spanish speech data set https://www.openslr.org/73
Crowdsourced high-quality Puerto Rico Spanish speech data set https://www.openslr.org/74
Crowdsourced high-quality Venezuelan Spanish speech data set https://www.openslr.org/75
Crowdsourced high-quality Basque speech data set https://www.openslr.org/76
Crowdsourced high-quality Galician speech data set https://www.openslr.org/77
Crowdsourced high-quality Gujarati multi-speaker speech data set https://www.openslr.org/78
Crowdsourced high-quality Kannada multi-speaker speech data set https://www.openslr.org/79
Crowdsourced high-quality Burmese speech data set https://www.openslr.org/80
Data set which contains male and female recordings of English from various dialects of the UK and Ireland. https://www.openslr.org/83
Crowdsourced high-quality Yoruba speech data set https://www.openslr.org/86

Other reading resources

SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview

Publications

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.

Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.

About

Datasets and tools for basic natural language processing.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published