Google Research Datasets

Datasets released by Google Research

Mountain View, CA
http://research.google.com

Pinned

natural-questions Public

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

Python 686 132
conceptual-captions Public

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

Shell 311 16
ToTTo Public

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, p…

293 27
dakshina Public

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia tex…

131 14
tydiqa Public

TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trai…

Python 209 32
gap-coreference Public

GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practica…

Python 197 72

Repositories

WikipediaAbbreviationData Public
This data set consists of 24,000 English sentences, extracted from Wikipedia in 2017, annotated to support development of an abbreviation expansion system for text-to-speech synthesis (e.g., a systm tht cn prnounc txt lk ths).

Python 2 Apache-2.0 0 0 0 Updated Oct 28, 2021
EnronPersonalizationValidation Public
Enron Personalization Validation Set

Starlark 0 0 0 0 Updated Oct 26, 2021
dstc8-schema-guided-dialogue Public
The Schema-Guided Dialogue Dataset

Python 331 CC-BY-SA-4.0 87 2 0 Updated Oct 20, 2021
ToTTo Public
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

293 27 4 1 Updated Oct 12, 2021
cats4ml-2021-dataset Public
This dataset is a result of the CATS4ML (Crowdsourcing Adverse Test Sets for Machine Learning) Data Challenge - an adversarial test-set sampling images and labels from the Open Images Dataset for state-of-the-art image classification models. The challenge invited participants to sample this existing publicly available dataset for images that are…

1 0 0 0 Updated Sep 28, 2021
wit Public
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

564 16 1 0 Updated Sep 24, 2021
RxR Public
Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual per…

HTML 56 CC-BY-4.0 5 1 1 Updated Sep 15, 2021
C4_200M-synthetic-dataset-for-grammatical-error-correction Public
This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

Python 58 CC-BY-4.0 15 1 1 Updated Sep 15, 2021
numbert Public

0 0 0 0 Updated Sep 11, 2021
wikipedia-intrinsic-capitalization Public
This is a corpus of sentences extracted from Wikipedia edit history with the sentence-initial words restored to their non-positional cases manually. For example, in "during the Soviet era...", the word "during" is in lowercase, while in "United will enter the Europa League at the group stage.", the word "United" is in uppercase, according to the…

0 CC0-1.0 0 0 0 Updated Sep 1, 2021

View all repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Python Jupyter Notebook Shell HTML Starlark

Most used topics

deep-learning nlp nlp-machine-learning deep-neural-networks wikipedia

Oct	NOV	Dec
	04
2020	2021	2022