The Wayback Machine - https://web.archive.org/web/20211104043010/https://github.com/google-research-datasets
Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned

  1. Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 686 132

  2. Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 311 16

  3. ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, p…

    293 27

  4. dakshina Public

    The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia tex…

    131 14

  5. tydiqa Public

    TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trai…

    Python 209 32

  6. GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practica…

    Python 197 72

Repositories

  • WikipediaAbbreviationData Public

    This data set consists of 24,000 English sentences, extracted from Wikipedia in 2017, annotated to support development of an abbreviation expansion system for text-to-speech synthesis (e.g., a systm tht cn prnounc txt lk ths).

    Python 2 Apache-2.0 0 0 0 Updated Oct 28, 2021
  • EnronPersonalizationValidation Public

    Enron Personalization Validation Set

    Starlark 0 0 0 0 Updated Oct 26, 2021
  • dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 331 CC-BY-SA-4.0 87 2 0 Updated Oct 20, 2021
  • ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

    293 27 4 1 Updated Oct 12, 2021
  • cats4ml-2021-dataset Public

    This dataset is a result of the CATS4ML (Crowdsourcing Adverse Test Sets for Machine Learning) Data Challenge - an adversarial test-set sampling images and labels from the Open Images Dataset for state-of-the-art image classification models. The challenge invited participants to sample this existing publicly available dataset for images that are…

    1 0 0 0 Updated Sep 28, 2021
  • wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    564 16 1 0 Updated Sep 24, 2021
  • RxR Public

    Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual per…

    HTML 56 CC-BY-4.0 5 1 1 Updated Sep 15, 2021
  • C4_200M-synthetic-dataset-for-grammatical-error-correction Public

    This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

    Python 58 CC-BY-4.0 15 1 1 Updated Sep 15, 2021
  • numbert Public
    0 0 0 0 Updated Sep 11, 2021
  • wikipedia-intrinsic-capitalization Public

    This is a corpus of sentences extracted from Wikipedia edit history with the sentence-initial words restored to their non-positional cases manually. For example, in "during the Soviet era...", the word "during" is in lowercase, while in "United will enter the Europa League at the group stage.", the word "United" is in uppercase, according to the…

    0 CC0-1.0 0 0 0 Updated Sep 1, 2021

People

This organization has no public members. You must be a member to see who’s a part of this organization.