COLLECTED BY
Organization:
Internet Archive
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
The Wayback Machine - https://web.archive.org/web/20200812013855/https://github.com/topics/corpus
Here are
497 public repositories
matching this topic...
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
A collection of small corpuses of interesting data for the creation of bots and similar stuff.
Updated
Aug 4, 2020
JavaScript
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Deep Learning and deep reinforcement learning research papers and some codes
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Updated
Oct 25, 2019
Python
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Updated
Feb 10, 2020
Python
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Updated
Nov 13, 2017
Python
中文语言理解基准测评 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Updated
Jul 15, 2020
Python
Updated
Mar 1, 2020
Python
A multilingual dialog corpus
Updated
Aug 9, 2020
Python
OpenData in insurance area for Machine Learning Tasks, 保险行业语料库
Updated
Jul 13, 2018
Python
Chatbot in 200 lines of code using TensorLayer
Updated
Oct 6, 2019
Python
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
An R package for the Quantitative Analysis of Textual Data
Some useful Chinese corpus datasets 中文语料小数据
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Fuzzing resources for feeding various fuzzers with input. 🔧
Updated
Apr 28, 2020
HTML
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Updated
Jul 8, 2020
Python
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Updated
Aug 9, 2020
Python
Collections of Chinese NLP corpus
Updated
Jul 21, 2020
Python
Updated
May 20, 2020
Python
A dataset of millions of news articles scraped from a curated list of data sources.
Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Updated
Sep 11, 2019
HTML
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Updated
Jan 10, 2018
Ruby
Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Updated
Jul 13, 2020
Python
Updated
Sep 9, 2019
Jupyter Notebook
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Updated
Oct 30, 2019
Python
❤️ Emotional First Aid Dataset, 心理咨询问答语料库
Updated
Jun 17, 2020
Python
Improve this page
Add a description, image, and links to the
corpus
topic page so that developers can more easily learn about it.
Curate this topic
Add this topic to your repo
To associate your repository with the
corpus
topic, visit your repo's landing page and select "manage topics."
Learn more
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.