language-model

First good issue

A current error is that a user forwards a batched tensor of input_ids that include a padding token, e.g. input_ids = torch.tensor([["hello", "this", "is", "a", "long", "string"], ["hello", "<pad>", "<pad>", "<pad>", "<pad>"]]

In this case, the attention_mask should be provided as well. Otherwise the output hidden_states will be incorrectly computed. This is

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

Problem
Currently FARMReader will ask users to raise max_seq_length every time some samples are longer than the value set to it. However, this can be confusing if max_seq_length is already set to the maximum value allowed by the model, because raising it further will cause hard-to-read CUDA errors.

See #2177.

Solution
We should find a way to query the model for the maximum va

目前的多音字使用 pypinyin 或者 g2pM，精度有限，想做一个基于 BERT (或者 ERNIE) 多音字预测模型，简单来说就是假设某语言有 100 个多音字，每个多音字最多有 3 个发音，那么可以在 BERT 后面接 100 个 3 分类器（简单的 fc 层即可），在预测时，找到对应的分类器进行分类即可。
参考论文：
tencent_polyphone.pdf

数据可以用 https://github.com/kakaobrain/g2pM 提供的数据

进阶：多任务的 BERT
![image](https://user-images.githubusercontent.com/24568452

Issue to track tutorial requests:

Deep Learning with PyTorch: A 60 Minute Blitz - #69
Sentence Classification - #79

Feb	MAR	Apr
	20
2021	2022	2023

language-model

Here are 864 public repositories matching this topic...

huggingface / transformers

First good issue

brightmart / nlp_chinese_corpus

EleutherAI / gpt-neo

huggingface / tokenizers

codertimo / BERT-pytorch

deepset-ai / haystack

speechbrain / speechbrain

PaddlePaddle / PaddleSpeech

CLUEbenchmark / CLUE

tensorflow / lingvo

CyberZHG / keras-bert

zzw922cn / awesome-speech-recognition-speech-synthesis-papers

chiphuyen / lazynlp

Separius / awesome-sentence-embedding

salesforce / awd-lstm-lm

EleutherAI / gpt-neox

NVIDIA / OpenSeq2Seq

huggingface / pytorch-openai-transformer-lm

prabhuomkar / pytorch-cpp

nlpodyssey / spago

ymcui / Chinese-ELECTRA

explosion / spacy-transformers

mihail911 / nlp-library

brightmart / bert_language_understanding

microsoft / DeBERTa

pykaldi / pykaldi

LiyuanLucasLiu / LM-LSTM-CRF

SKTBrain / KoBERT

smilelight / lightNLP

IsaacChanghau / DL-NLP-Readings

Improve this page

Add this topic to your repo