The Wayback Machine - https://web.archive.org/web/20200528215536/https://github.com/google-research/bert/issues/1019
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain the variables in the checkpoint #1019

Open
dhruvsakalley opened this issue Mar 2, 2020 · 2 comments
Open

Explain the variables in the checkpoint #1019

dhruvsakalley opened this issue Mar 2, 2020 · 2 comments

Comments

@dhruvsakalley
Copy link

@dhruvsakalley dhruvsakalley commented Mar 2, 2020

When you look at the variables in the pretrained base uncased BERT the varibles look like list 1. When you do the training from scratch, 2 additional variables per layer are introduced, with suffixes adam_m and adam_v. It would be nice for someone to explain what these variables are? and what is their significance to the process of training?
If one were to manually initialize variables from a prior checkpoint what would be the translation into these variables?

List1: Pretrained BERT base uncased variables
[('bert/embeddings/LayerNorm/beta', [768]),
('bert/embeddings/LayerNorm/gamma', [768]),
('bert/embeddings/position_embeddings', [512, 768]),
('bert/embeddings/token_type_embeddings', [2, 768]),
('bert/embeddings/word_embeddings', [30522, 768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/attention/output/dense/bias', [768]),
('bert/encoder/layer_0/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/key/bias', [768]),
('bert/encoder/layer_0/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/query/bias', [768]),
('bert/encoder/layer_0/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/value/bias', [768]),
('bert/encoder/layer_0/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_0/intermediate/dense/bias', [3072]),
('bert/encoder/layer_0/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_0/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/output/dense/bias', [768]),
('bert/encoder/layer_0/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/attention/output/dense/bias', [768]),
('bert/encoder/layer_1/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/key/bias', [768]),
('bert/encoder/layer_1/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/query/bias', [768]),
('bert/encoder/layer_1/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/value/bias', [768]),
('bert/encoder/layer_1/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_1/intermediate/dense/bias', [3072]),
('bert/encoder/layer_1/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_1/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/output/dense/bias', [768]),
('bert/encoder/layer_1/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/attention/output/dense/bias', [768]),
('bert/encoder/layer_10/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/key/bias', [768]),
('bert/encoder/layer_10/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/query/bias', [768]),
('bert/encoder/layer_10/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/value/bias', [768]),
('bert/encoder/layer_10/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_10/intermediate/dense/bias', [3072]),
('bert/encoder/layer_10/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_10/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/output/dense/bias', [768]),
('bert/encoder/layer_10/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/attention/output/dense/bias', [768]),
('bert/encoder/layer_11/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/key/bias', [768]),
('bert/encoder/layer_11/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/query/bias', [768]),
('bert/encoder/layer_11/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/value/bias', [768]),
('bert/encoder/layer_11/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_11/intermediate/dense/bias', [3072]),
('bert/encoder/layer_11/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_11/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/output/dense/bias', [768]),
('bert/encoder/layer_11/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/attention/output/dense/bias', [768]),
('bert/encoder/layer_2/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/key/bias', [768]),
('bert/encoder/layer_2/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/query/bias', [768]),
('bert/encoder/layer_2/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/value/bias', [768]),
('bert/encoder/layer_2/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_2/intermediate/dense/bias', [3072]),
('bert/encoder/layer_2/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_2/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/output/dense/bias', [768]),
('bert/encoder/layer_2/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/attention/output/dense/bias', [768]),
('bert/encoder/layer_3/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/key/bias', [768]),
('bert/encoder/layer_3/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/query/bias', [768]),
('bert/encoder/layer_3/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/value/bias', [768]),
('bert/encoder/layer_3/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_3/intermediate/dense/bias', [3072]),
('bert/encoder/layer_3/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_3/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/output/dense/bias', [768]),
('bert/encoder/layer_3/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/attention/output/dense/bias', [768]),
('bert/encoder/layer_4/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/key/bias', [768]),
('bert/encoder/layer_4/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/query/bias', [768]),
('bert/encoder/layer_4/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/value/bias', [768]),
('bert/encoder/layer_4/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_4/intermediate/dense/bias', [3072]),
('bert/encoder/layer_4/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_4/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/output/dense/bias', [768]),
('bert/encoder/layer_4/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/attention/output/dense/bias', [768]),
('bert/encoder/layer_5/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/key/bias', [768]),
('bert/encoder/layer_5/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/query/bias', [768]),
('bert/encoder/layer_5/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/value/bias', [768]),
('bert/encoder/layer_5/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_5/intermediate/dense/bias', [3072]),
('bert/encoder/layer_5/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_5/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/output/dense/bias', [768]),
('bert/encoder/layer_5/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/attention/output/dense/bias', [768]),
('bert/encoder/layer_6/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/key/bias', [768]),
('bert/encoder/layer_6/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/query/bias', [768]),
('bert/encoder/layer_6/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/value/bias', [768]),
('bert/encoder/layer_6/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_6/intermediate/dense/bias', [3072]),
('bert/encoder/layer_6/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_6/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/output/dense/bias', [768]),
('bert/encoder/layer_6/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/attention/output/dense/bias', [768]),
('bert/encoder/layer_7/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/key/bias', [768]),
('bert/encoder/layer_7/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/query/bias', [768]),
('bert/encoder/layer_7/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/value/bias', [768]),
('bert/encoder/layer_7/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_7/intermediate/dense/bias', [3072]),
('bert/encoder/layer_7/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_7/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/output/dense/bias', [768]),
('bert/encoder/layer_7/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/attention/output/dense/bias', [768]),
('bert/encoder/layer_8/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/key/bias', [768]),
('bert/encoder/layer_8/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/query/bias', [768]),
('bert/encoder/layer_8/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/value/bias', [768]),
('bert/encoder/layer_8/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_8/intermediate/dense/bias', [3072]),
('bert/encoder/layer_8/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_8/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/output/dense/bias', [768]),
('bert/encoder/layer_8/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/attention/output/dense/bias', [768]),
('bert/encoder/layer_9/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/key/bias', [768]),
('bert/encoder/layer_9/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/query/bias', [768]),
('bert/encoder/layer_9/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/value/bias', [768]),
('bert/encoder/layer_9/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_9/intermediate/dense/bias', [3072]),
('bert/encoder/layer_9/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_9/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/output/dense/bias', [768]),
('bert/encoder/layer_9/output/dense/kernel', [3072, 768]),
('bert/pooler/dense/bias', [768]),
('bert/pooler/dense/kernel', [768, 768]),
('cls/predictions/output_bias', [30522]),
('cls/predictions/transform/LayerNorm/beta', [768]),
('cls/predictions/transform/LayerNorm/gamma', [768]),
('cls/predictions/transform/dense/bias', [768]),
('cls/predictions/transform/dense/kernel', [768, 768]),
('cls/seq_relationship/output_bias', [2]),
('cls/seq_relationship/output_weights', [2, 768])]

List2: Pretrained From Scratch with new vocabulary BERT base uncased variables:
[('bert/embeddings/LayerNorm/beta', [768]),
('bert/embeddings/LayerNorm/beta/adam_m', [768]),
('bert/embeddings/LayerNorm/beta/adam_v', [768]),
('bert/embeddings/LayerNorm/gamma', [768]),
('bert/embeddings/LayerNorm/gamma/adam_m', [768]),
('bert/embeddings/LayerNorm/gamma/adam_v', [768]),
('bert/embeddings/position_embeddings', [512, 768]),
('bert/embeddings/position_embeddings/adam_m', [512, 768]),
('bert/embeddings/position_embeddings/adam_v', [512, 768]),
('bert/embeddings/token_type_embeddings', [2, 768]),
('bert/embeddings/token_type_embeddings/adam_m', [2, 768]),
('bert/embeddings/token_type_embeddings/adam_v', [2, 768]),
('bert/embeddings/word_embeddings', [60812, 768]),
('bert/embeddings/word_embeddings/adam_m', [60812, 768]),
('bert/embeddings/word_embeddings/adam_v', [60812, 768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_0/attention/output/dense/bias', [768]),
('bert/encoder/layer_0/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_0/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/attention/self/key/bias', [768]),
('bert/encoder/layer_0/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/attention/self/query/bias', [768]),
('bert/encoder/layer_0/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/attention/self/value/bias', [768]),
('bert/encoder/layer_0/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/intermediate/dense/bias', [3072]),
('bert/encoder/layer_0/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_0/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_0/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_0/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_0/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_0/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_0/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_0/output/dense/bias', [768]),
('bert/encoder/layer_0/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_0/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_0/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_0/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_0/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_1/attention/output/dense/bias', [768]),
('bert/encoder/layer_1/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_1/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/attention/self/key/bias', [768]),
('bert/encoder/layer_1/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/attention/self/query/bias', [768]),
('bert/encoder/layer_1/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/attention/self/value/bias', [768]),
('bert/encoder/layer_1/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/intermediate/dense/bias', [3072]),
('bert/encoder/layer_1/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_1/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_1/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_1/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_1/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_1/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_1/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_1/output/dense/bias', [768]),
('bert/encoder/layer_1/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_1/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_1/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_1/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_1/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_10/attention/output/dense/bias', [768]),
('bert/encoder/layer_10/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_10/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/attention/self/key/bias', [768]),
('bert/encoder/layer_10/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/attention/self/query/bias', [768]),
('bert/encoder/layer_10/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/attention/self/value/bias', [768]),
('bert/encoder/layer_10/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/intermediate/dense/bias', [3072]),
('bert/encoder/layer_10/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_10/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_10/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_10/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_10/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_10/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_10/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_10/output/dense/bias', [768]),
('bert/encoder/layer_10/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_10/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_10/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_10/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_10/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_11/attention/output/dense/bias', [768]),
('bert/encoder/layer_11/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_11/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/attention/self/key/bias', [768]),
('bert/encoder/layer_11/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/attention/self/query/bias', [768]),
('bert/encoder/layer_11/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/attention/self/value/bias', [768]),
('bert/encoder/layer_11/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/intermediate/dense/bias', [3072]),
('bert/encoder/layer_11/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_11/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_11/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_11/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_11/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_11/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_11/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_11/output/dense/bias', [768]),
('bert/encoder/layer_11/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_11/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_11/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_11/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_11/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_2/attention/output/dense/bias', [768]),
('bert/encoder/layer_2/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_2/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/attention/self/key/bias', [768]),
('bert/encoder/layer_2/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/attention/self/query/bias', [768]),
('bert/encoder/layer_2/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/attention/self/value/bias', [768]),
('bert/encoder/layer_2/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/intermediate/dense/bias', [3072]),
('bert/encoder/layer_2/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_2/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_2/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_2/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_2/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_2/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_2/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_2/output/dense/bias', [768]),
('bert/encoder/layer_2/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_2/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_2/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_2/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_2/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_3/attention/output/dense/bias', [768]),
('bert/encoder/layer_3/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_3/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/attention/self/key/bias', [768]),
('bert/encoder/layer_3/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/attention/self/query/bias', [768]),
('bert/encoder/layer_3/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/attention/self/value/bias', [768]),
('bert/encoder/layer_3/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/intermediate/dense/bias', [3072]),
('bert/encoder/layer_3/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_3/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_3/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_3/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_3/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_3/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_3/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_3/output/dense/bias', [768]),
('bert/encoder/layer_3/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_3/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_3/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_3/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_3/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_4/attention/output/dense/bias', [768]),
('bert/encoder/layer_4/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_4/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/attention/self/key/bias', [768]),
('bert/encoder/layer_4/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/attention/self/query/bias', [768]),
('bert/encoder/layer_4/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/attention/self/value/bias', [768]),
('bert/encoder/layer_4/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/intermediate/dense/bias', [3072]),
('bert/encoder/layer_4/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_4/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_4/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_4/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_4/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_4/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_4/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_4/output/dense/bias', [768]),
('bert/encoder/layer_4/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_4/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_4/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_4/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_4/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_5/attention/output/dense/bias', [768]),
('bert/encoder/layer_5/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_5/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/attention/self/key/bias', [768]),
('bert/encoder/layer_5/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/attention/self/query/bias', [768]),
('bert/encoder/layer_5/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/attention/self/value/bias', [768]),
('bert/encoder/layer_5/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/intermediate/dense/bias', [3072]),
('bert/encoder/layer_5/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_5/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_5/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_5/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_5/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_5/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_5/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_5/output/dense/bias', [768]),
('bert/encoder/layer_5/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_5/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_5/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_5/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_5/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_6/attention/output/dense/bias', [768]),
('bert/encoder/layer_6/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_6/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/attention/self/key/bias', [768]),
('bert/encoder/layer_6/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/attention/self/query/bias', [768]),
('bert/encoder/layer_6/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/attention/self/value/bias', [768]),
('bert/encoder/layer_6/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/intermediate/dense/bias', [3072]),
('bert/encoder/layer_6/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_6/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_6/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_6/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_6/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_6/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_6/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_6/output/dense/bias', [768]),
('bert/encoder/layer_6/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_6/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_6/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_6/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_6/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_7/attention/output/dense/bias', [768]),
('bert/encoder/layer_7/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_7/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/attention/self/key/bias', [768]),
('bert/encoder/layer_7/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/attention/self/query/bias', [768]),
('bert/encoder/layer_7/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/attention/self/value/bias', [768]),
('bert/encoder/layer_7/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/intermediate/dense/bias', [3072]),
('bert/encoder/layer_7/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_7/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_7/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_7/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_7/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_7/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_7/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_7/output/dense/bias', [768]),
('bert/encoder/layer_7/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_7/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_7/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_7/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_7/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_8/attention/output/dense/bias', [768]),
('bert/encoder/layer_8/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_8/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/attention/self/key/bias', [768]),
('bert/encoder/layer_8/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/attention/self/query/bias', [768]),
('bert/encoder/layer_8/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/attention/self/value/bias', [768]),
('bert/encoder/layer_8/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/intermediate/dense/bias', [3072]),
('bert/encoder/layer_8/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_8/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_8/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_8/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_8/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_8/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_8/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_8/output/dense/bias', [768]),
('bert/encoder/layer_8/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_8/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_8/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_8/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_8/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_9/attention/output/dense/bias', [768]),
('bert/encoder/layer_9/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_9/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/attention/self/key/bias', [768]),
('bert/encoder/layer_9/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/attention/self/query/bias', [768]),
('bert/encoder/layer_9/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/attention/self/value/bias', [768]),
('bert/encoder/layer_9/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/intermediate/dense/bias', [3072]),
('bert/encoder/layer_9/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_9/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_9/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_9/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_9/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_9/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_9/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_9/output/dense/bias', [768]),
('bert/encoder/layer_9/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_9/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_9/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_9/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_9/output/dense/kernel/adam_v', [3072, 768]),
('bert/pooler/dense/bias', [768]),
('bert/pooler/dense/bias/adam_m', [768]),
('bert/pooler/dense/bias/adam_v', [768]),
('bert/pooler/dense/kernel', [768, 768]),
('bert/pooler/dense/kernel/adam_m', [768, 768]),
('bert/pooler/dense/kernel/adam_v', [768, 768]),
('cls/predictions/output_bias', [60812]),
('cls/predictions/output_bias/adam_m', [60812]),
('cls/predictions/output_bias/adam_v', [60812]),
('cls/predictions/transform/LayerNorm/beta', [768]),
('cls/predictions/transform/LayerNorm/beta/adam_m', [768]),
('cls/predictions/transform/LayerNorm/beta/adam_v', [768]),
('cls/predictions/transform/LayerNorm/gamma', [768]),
('cls/predictions/transform/LayerNorm/gamma/adam_m', [768]),
('cls/predictions/transform/LayerNorm/gamma/adam_v', [768]),
('cls/predictions/transform/dense/bias', [768]),
('cls/predictions/transform/dense/bias/adam_m', [768]),
('cls/predictions/transform/dense/bias/adam_v', [768]),
('cls/predictions/transform/dense/kernel', [768, 768]),
('cls/predictions/transform/dense/kernel/adam_m', [768, 768]),
('cls/predictions/transform/dense/kernel/adam_v', [768, 768]),
('cls/seq_relationship/output_bias', [2]),
('cls/seq_relationship/output_bias/adam_m', [2]),
('cls/seq_relationship/output_bias/adam_v', [2]),
('cls/seq_relationship/output_weights', [2, 768]),
('cls/seq_relationship/output_weights/adam_m', [2, 768]),
('cls/seq_relationship/output_weights/adam_v', [2, 768]),
('global_step', [])]

@dhruvsakalley
Copy link
Author

@dhruvsakalley dhruvsakalley commented Mar 6, 2020

Alright after some digging around this what I found, would love some confirmation on it:

Based on the response provided by a member from the BERT team, the fine-tuned model is 3 times larger than the distributed checkpoint due to the inclusion of Adam momentum and variance variables for each weight variable. Both variables are needed to be able to pause and resume training. In order words, if you intend to serve your fine-tuned model without any further training, you can remove both variables and and the size will be more or less similar to the distributed model.
https://towardsdatascience.com/3-ways-to-optimize-and-export-bert-model-for-online-serving-8f49d774a501

So it seems I can ignore those for the init checkpoint model, it seems like the code handles those variables being absent, would love some confirmation from Bert folks though.

@mananeau
Copy link

@mananeau mananeau commented Mar 24, 2020

These variables are indeed the momentum and variance from the Adam optimizer. These variables are built during training and are needed to pause and resume training. The distributed checkpoints (the ones you can download on the repos) don't contain these Adam variables and this is why they are three times smaller. See answer of Jacob Devlin here.

I personally faced issues trying to initialize from distributed checkpoints because of the absence of these Adam variables, happy to know more if you managed to do it! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.