COLLECTED BY
Organization:
Internet Archive
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
The Wayback Machine - https://web.archive.org/web/20200617211028/https://github.com/topics/nlp-datasets
Here are
61 public repositories
matching this topic...
curated collection of papers for the nlp practitioner 📖 👩🔬
Tensorflow and Keras Implementation of Very Deep Convolutional Neural Network for Text Classification
Updated
Nov 14, 2019
Python
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集
Updated
May 20, 2019
Python
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Updated
Jun 15, 2020
Python
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Updated
Apr 17, 2020
Python
What Twitter reveals about the differences between cities and the monoculture of the Bay Area
Updated
May 31, 2019
Jupyter Notebook
The release of the FreebaseQA data set (NAACL 2019).
A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
Updated
Jan 30, 2020
Python
Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
Updated
Feb 4, 2018
Jupyter Notebook
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
Updated
Dec 4, 2019
Python
Reading the data from OPIEC - an Open Information Extraction corpus
Updated
Jun 12, 2019
Java
Model training, custom generative function and training for raplyrics.eu - A rap music lyrics generation project
Updated
Oct 20, 2019
Python
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
Updated
Feb 1, 2019
Jupyter Notebook
The Mueller Report Corpus V 0.1
Updated
Jun 12, 2019
Java
This project is submitted as python implementation in the contest of Analytics Vidhya called "Identify the Sentiments". I enjoyed the joining of this competition and all its process. This submited solution got the rank 118 in the public leaderboard.
Updated
May 29, 2020
Python
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
Updated
May 6, 2020
Python
The E2E Dataset, packed as a PyTorch DataSet subclass
Updated
Jul 12, 2018
Python
A Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference
Updated
Apr 17, 2020
Python
Use of State of the Art FLAIR library for the NLP datasets
Updated
Apr 22, 2020
Jupyter Notebook
Library for generation of russian names
Updated
Apr 23, 2019
Python
Question Answering System using BiDAF Model on SQuAD v2.0
Updated
Apr 24, 2020
Python
English loanwords in Japanese
Extract Abstract and Title Dataset from arXiv articles
Updated
Dec 9, 2019
Python
Implementation of the semi-structured inference model in our ACL 2020 paper. INFOTABS: Inference on Tables as Semi-structured Data
Updated
May 7, 2020
Python
Improve this page
Add a description, image, and links to the
nlp-datasets
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
nlp-datasets
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
Rather than the current system of each sub-corpora it is own folder with its own code. Create a top-level
downloads.shwhich can re-assemble the sub-corpora.Separately, have the downloaded & pre-processed sub-corpora ready to be referenced from ADR, and NMT repos as submodules etc.