Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upExplain the variables in the checkpoint #1019
Comments
|
Alright after some digging around this what I found, would love some confirmation on it:
So it seems I can ignore those for the init checkpoint model, it seems like the code handles those variables being absent, would love some confirmation from Bert folks though. |
|
These variables are indeed the momentum and variance from the Adam optimizer. These variables are built during training and are needed to pause and resume training. The distributed checkpoints (the ones you can download on the repos) don't contain these Adam variables and this is why they are three times smaller. See answer of Jacob Devlin here. I personally faced issues trying to initialize from distributed checkpoints because of the absence of these Adam variables, happy to know more if you managed to do it! :) |


When you look at the variables in the pretrained base uncased BERT the varibles look like list 1. When you do the training from scratch, 2 additional variables per layer are introduced, with suffixes adam_m and adam_v. It would be nice for someone to explain what these variables are? and what is their significance to the process of training?
If one were to manually initialize variables from a prior checkpoint what would be the translation into these variables?
List1: Pretrained BERT base uncased variables
[('bert/embeddings/LayerNorm/beta', [768]),
('bert/embeddings/LayerNorm/gamma', [768]),
('bert/embeddings/position_embeddings', [512, 768]),
('bert/embeddings/token_type_embeddings', [2, 768]),
('bert/embeddings/word_embeddings', [30522, 768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/attention/output/dense/bias', [768]),
('bert/encoder/layer_0/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/key/bias', [768]),
('bert/encoder/layer_0/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/query/bias', [768]),
('bert/encoder/layer_0/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/value/bias', [768]),
('bert/encoder/layer_0/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_0/intermediate/dense/bias', [3072]),
('bert/encoder/layer_0/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_0/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/output/dense/bias', [768]),
('bert/encoder/layer_0/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/attention/output/dense/bias', [768]),
('bert/encoder/layer_1/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/key/bias', [768]),
('bert/encoder/layer_1/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/query/bias', [768]),
('bert/encoder/layer_1/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/value/bias', [768]),
('bert/encoder/layer_1/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_1/intermediate/dense/bias', [3072]),
('bert/encoder/layer_1/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_1/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/output/dense/bias', [768]),
('bert/encoder/layer_1/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/attention/output/dense/bias', [768]),
('bert/encoder/layer_10/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/key/bias', [768]),
('bert/encoder/layer_10/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/query/bias', [768]),
('bert/encoder/layer_10/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/value/bias', [768]),
('bert/encoder/layer_10/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_10/intermediate/dense/bias', [3072]),
('bert/encoder/layer_10/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_10/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/output/dense/bias', [768]),
('bert/encoder/layer_10/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/attention/output/dense/bias', [768]),
('bert/encoder/layer_11/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/key/bias', [768]),
('bert/encoder/layer_11/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/query/bias', [768]),
('bert/encoder/layer_11/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/value/bias', [768]),
('bert/encoder/layer_11/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_11/intermediate/dense/bias', [3072]),
('bert/encoder/layer_11/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_11/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/output/dense/bias', [768]),
('bert/encoder/layer_11/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/attention/output/dense/bias', [768]),
('bert/encoder/layer_2/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/key/bias', [768]),
('bert/encoder/layer_2/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/query/bias', [768]),
('bert/encoder/layer_2/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/value/bias', [768]),
('bert/encoder/layer_2/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_2/intermediate/dense/bias', [3072]),
('bert/encoder/layer_2/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_2/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/output/dense/bias', [768]),
('bert/encoder/layer_2/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/attention/output/dense/bias', [768]),
('bert/encoder/layer_3/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/key/bias', [768]),
('bert/encoder/layer_3/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/query/bias', [768]),
('bert/encoder/layer_3/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/value/bias', [768]),
('bert/encoder/layer_3/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_3/intermediate/dense/bias', [3072]),
('bert/encoder/layer_3/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_3/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/output/dense/bias', [768]),
('bert/encoder/layer_3/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/attention/output/dense/bias', [768]),
('bert/encoder/layer_4/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/key/bias', [768]),
('bert/encoder/layer_4/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/query/bias', [768]),
('bert/encoder/layer_4/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/value/bias', [768]),
('bert/encoder/layer_4/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_4/intermediate/dense/bias', [3072]),
('bert/encoder/layer_4/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_4/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/output/dense/bias', [768]),
('bert/encoder/layer_4/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/attention/output/dense/bias', [768]),
('bert/encoder/layer_5/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/key/bias', [768]),
('bert/encoder/layer_5/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/query/bias', [768]),
('bert/encoder/layer_5/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/value/bias', [768]),
('bert/encoder/layer_5/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_5/intermediate/dense/bias', [3072]),
('bert/encoder/layer_5/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_5/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/output/dense/bias', [768]),
('bert/encoder/layer_5/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/attention/output/dense/bias', [768]),
('bert/encoder/layer_6/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/key/bias', [768]),
('bert/encoder/layer_6/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/query/bias', [768]),
('bert/encoder/layer_6/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/value/bias', [768]),
('bert/encoder/layer_6/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_6/intermediate/dense/bias', [3072]),
('bert/encoder/layer_6/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_6/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/output/dense/bias', [768]),
('bert/encoder/layer_6/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/attention/output/dense/bias', [768]),
('bert/encoder/layer_7/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/key/bias', [768]),
('bert/encoder/layer_7/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/query/bias', [768]),
('bert/encoder/layer_7/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/value/bias', [768]),
('bert/encoder/layer_7/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_7/intermediate/dense/bias', [3072]),
('bert/encoder/layer_7/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_7/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/output/dense/bias', [768]),
('bert/encoder/layer_7/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/attention/output/dense/bias', [768]),
('bert/encoder/layer_8/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/key/bias', [768]),
('bert/encoder/layer_8/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/query/bias', [768]),
('bert/encoder/layer_8/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/value/bias', [768]),
('bert/encoder/layer_8/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_8/intermediate/dense/bias', [3072]),
('bert/encoder/layer_8/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_8/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/output/dense/bias', [768]),
('bert/encoder/layer_8/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/attention/output/dense/bias', [768]),
('bert/encoder/layer_9/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/key/bias', [768]),
('bert/encoder/layer_9/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/query/bias', [768]),
('bert/encoder/layer_9/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/value/bias', [768]),
('bert/encoder/layer_9/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_9/intermediate/dense/bias', [3072]),
('bert/encoder/layer_9/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_9/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/output/dense/bias', [768]),
('bert/encoder/layer_9/output/dense/kernel', [3072, 768]),
('bert/pooler/dense/bias', [768]),
('bert/pooler/dense/kernel', [768, 768]),
('cls/predictions/output_bias', [30522]),
('cls/predictions/transform/LayerNorm/beta', [768]),
('cls/predictions/transform/LayerNorm/gamma', [768]),
('cls/predictions/transform/dense/bias', [768]),
('cls/predictions/transform/dense/kernel', [768, 768]),
('cls/seq_relationship/output_bias', [2]),
('cls/seq_relationship/output_weights', [2, 768])]
List2: Pretrained From Scratch with new vocabulary BERT base uncased variables:
[('bert/embeddings/LayerNorm/beta', [768]),
('bert/embeddings/LayerNorm/beta/adam_m', [768]),
('bert/embeddings/LayerNorm/beta/adam_v', [768]),
('bert/embeddings/LayerNorm/gamma', [768]),
('bert/embeddings/LayerNorm/gamma/adam_m', [768]),
('bert/embeddings/LayerNorm/gamma/adam_v', [768]),
('bert/embeddings/position_embeddings', [512, 768]),
('bert/embeddings/position_embeddings/adam_m', [512, 768]),
('bert/embeddings/position_embeddings/adam_v', [512, 768]),
('bert/embeddings/token_type_embeddings', [2, 768]),
('bert/embeddings/token_type_embeddings/adam_m', [2, 768]),
('bert/embeddings/token_type_embeddings/adam_v', [2, 768]),
('bert/embeddings/word_embeddings', [60812, 768]),
('bert/embeddings/word_embeddings/adam_m', [60812, 768]),
('bert/embeddings/word_embeddings/adam_v', [60812, 768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_0/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_0/attention/output/dense/bias', [768]),
('bert/encoder/layer_0/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_0/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/attention/self/key/bias', [768]),
('bert/encoder/layer_0/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/attention/self/query/bias', [768]),
('bert/encoder/layer_0/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/attention/self/value/bias', [768]),
('bert/encoder/layer_0/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_0/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_0/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_0/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_0/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_0/intermediate/dense/bias', [3072]),
('bert/encoder/layer_0/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_0/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_0/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_0/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_0/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_0/output/LayerNorm/beta', [768]),
('bert/encoder/layer_0/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_0/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_0/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_0/output/dense/bias', [768]),
('bert/encoder/layer_0/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_0/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_0/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_0/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_0/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_1/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_1/attention/output/dense/bias', [768]),
('bert/encoder/layer_1/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_1/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/attention/self/key/bias', [768]),
('bert/encoder/layer_1/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/attention/self/query/bias', [768]),
('bert/encoder/layer_1/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/attention/self/value/bias', [768]),
('bert/encoder/layer_1/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_1/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_1/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_1/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_1/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_1/intermediate/dense/bias', [3072]),
('bert/encoder/layer_1/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_1/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_1/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_1/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_1/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_1/output/LayerNorm/beta', [768]),
('bert/encoder/layer_1/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_1/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_1/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_1/output/dense/bias', [768]),
('bert/encoder/layer_1/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_1/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_1/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_1/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_1/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_10/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_10/attention/output/dense/bias', [768]),
('bert/encoder/layer_10/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_10/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/attention/self/key/bias', [768]),
('bert/encoder/layer_10/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/attention/self/query/bias', [768]),
('bert/encoder/layer_10/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/attention/self/value/bias', [768]),
('bert/encoder/layer_10/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_10/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_10/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_10/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_10/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_10/intermediate/dense/bias', [3072]),
('bert/encoder/layer_10/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_10/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_10/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_10/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_10/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_10/output/LayerNorm/beta', [768]),
('bert/encoder/layer_10/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_10/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_10/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_10/output/dense/bias', [768]),
('bert/encoder/layer_10/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_10/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_10/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_10/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_10/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_11/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_11/attention/output/dense/bias', [768]),
('bert/encoder/layer_11/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_11/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/attention/self/key/bias', [768]),
('bert/encoder/layer_11/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/attention/self/query/bias', [768]),
('bert/encoder/layer_11/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/attention/self/value/bias', [768]),
('bert/encoder/layer_11/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_11/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_11/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_11/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_11/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_11/intermediate/dense/bias', [3072]),
('bert/encoder/layer_11/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_11/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_11/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_11/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_11/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_11/output/LayerNorm/beta', [768]),
('bert/encoder/layer_11/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_11/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_11/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_11/output/dense/bias', [768]),
('bert/encoder/layer_11/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_11/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_11/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_11/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_11/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_2/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_2/attention/output/dense/bias', [768]),
('bert/encoder/layer_2/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_2/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/attention/self/key/bias', [768]),
('bert/encoder/layer_2/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/attention/self/query/bias', [768]),
('bert/encoder/layer_2/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/attention/self/value/bias', [768]),
('bert/encoder/layer_2/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_2/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_2/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_2/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_2/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_2/intermediate/dense/bias', [3072]),
('bert/encoder/layer_2/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_2/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_2/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_2/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_2/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_2/output/LayerNorm/beta', [768]),
('bert/encoder/layer_2/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_2/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_2/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_2/output/dense/bias', [768]),
('bert/encoder/layer_2/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_2/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_2/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_2/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_2/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_3/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_3/attention/output/dense/bias', [768]),
('bert/encoder/layer_3/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_3/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/attention/self/key/bias', [768]),
('bert/encoder/layer_3/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/attention/self/query/bias', [768]),
('bert/encoder/layer_3/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/attention/self/value/bias', [768]),
('bert/encoder/layer_3/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_3/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_3/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_3/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_3/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_3/intermediate/dense/bias', [3072]),
('bert/encoder/layer_3/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_3/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_3/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_3/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_3/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_3/output/LayerNorm/beta', [768]),
('bert/encoder/layer_3/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_3/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_3/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_3/output/dense/bias', [768]),
('bert/encoder/layer_3/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_3/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_3/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_3/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_3/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_4/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_4/attention/output/dense/bias', [768]),
('bert/encoder/layer_4/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_4/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/attention/self/key/bias', [768]),
('bert/encoder/layer_4/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/attention/self/query/bias', [768]),
('bert/encoder/layer_4/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/attention/self/value/bias', [768]),
('bert/encoder/layer_4/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_4/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_4/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_4/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_4/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_4/intermediate/dense/bias', [3072]),
('bert/encoder/layer_4/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_4/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_4/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_4/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_4/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_4/output/LayerNorm/beta', [768]),
('bert/encoder/layer_4/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_4/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_4/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_4/output/dense/bias', [768]),
('bert/encoder/layer_4/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_4/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_4/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_4/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_4/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_5/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_5/attention/output/dense/bias', [768]),
('bert/encoder/layer_5/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_5/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/attention/self/key/bias', [768]),
('bert/encoder/layer_5/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/attention/self/query/bias', [768]),
('bert/encoder/layer_5/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/attention/self/value/bias', [768]),
('bert/encoder/layer_5/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_5/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_5/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_5/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_5/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_5/intermediate/dense/bias', [3072]),
('bert/encoder/layer_5/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_5/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_5/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_5/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_5/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_5/output/LayerNorm/beta', [768]),
('bert/encoder/layer_5/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_5/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_5/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_5/output/dense/bias', [768]),
('bert/encoder/layer_5/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_5/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_5/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_5/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_5/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_6/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_6/attention/output/dense/bias', [768]),
('bert/encoder/layer_6/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_6/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/attention/self/key/bias', [768]),
('bert/encoder/layer_6/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/attention/self/query/bias', [768]),
('bert/encoder/layer_6/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/attention/self/value/bias', [768]),
('bert/encoder/layer_6/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_6/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_6/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_6/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_6/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_6/intermediate/dense/bias', [3072]),
('bert/encoder/layer_6/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_6/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_6/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_6/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_6/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_6/output/LayerNorm/beta', [768]),
('bert/encoder/layer_6/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_6/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_6/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_6/output/dense/bias', [768]),
('bert/encoder/layer_6/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_6/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_6/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_6/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_6/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_7/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_7/attention/output/dense/bias', [768]),
('bert/encoder/layer_7/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_7/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/attention/self/key/bias', [768]),
('bert/encoder/layer_7/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/attention/self/query/bias', [768]),
('bert/encoder/layer_7/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/attention/self/value/bias', [768]),
('bert/encoder/layer_7/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_7/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_7/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_7/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_7/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_7/intermediate/dense/bias', [3072]),
('bert/encoder/layer_7/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_7/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_7/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_7/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_7/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_7/output/LayerNorm/beta', [768]),
('bert/encoder/layer_7/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_7/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_7/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_7/output/dense/bias', [768]),
('bert/encoder/layer_7/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_7/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_7/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_7/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_7/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_8/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_8/attention/output/dense/bias', [768]),
('bert/encoder/layer_8/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_8/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/attention/self/key/bias', [768]),
('bert/encoder/layer_8/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/attention/self/query/bias', [768]),
('bert/encoder/layer_8/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/attention/self/value/bias', [768]),
('bert/encoder/layer_8/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_8/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_8/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_8/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_8/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_8/intermediate/dense/bias', [3072]),
('bert/encoder/layer_8/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_8/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_8/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_8/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_8/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_8/output/LayerNorm/beta', [768]),
('bert/encoder/layer_8/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_8/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_8/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_8/output/dense/bias', [768]),
('bert/encoder/layer_8/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_8/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_8/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_8/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_8/output/dense/kernel/adam_v', [3072, 768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_9/attention/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_9/attention/output/dense/bias', [768]),
('bert/encoder/layer_9/attention/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/output/dense/kernel', [768, 768]),
('bert/encoder/layer_9/attention/output/dense/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/output/dense/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/attention/self/key/bias', [768]),
('bert/encoder/layer_9/attention/self/key/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/self/key/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/self/key/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/key/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/self/key/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/attention/self/query/bias', [768]),
('bert/encoder/layer_9/attention/self/query/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/self/query/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/self/query/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/query/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/self/query/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/attention/self/value/bias', [768]),
('bert/encoder/layer_9/attention/self/value/bias/adam_m', [768]),
('bert/encoder/layer_9/attention/self/value/bias/adam_v', [768]),
('bert/encoder/layer_9/attention/self/value/kernel', [768, 768]),
('bert/encoder/layer_9/attention/self/value/kernel/adam_m', [768, 768]),
('bert/encoder/layer_9/attention/self/value/kernel/adam_v', [768, 768]),
('bert/encoder/layer_9/intermediate/dense/bias', [3072]),
('bert/encoder/layer_9/intermediate/dense/bias/adam_m', [3072]),
('bert/encoder/layer_9/intermediate/dense/bias/adam_v', [3072]),
('bert/encoder/layer_9/intermediate/dense/kernel', [768, 3072]),
('bert/encoder/layer_9/intermediate/dense/kernel/adam_m', [768, 3072]),
('bert/encoder/layer_9/intermediate/dense/kernel/adam_v', [768, 3072]),
('bert/encoder/layer_9/output/LayerNorm/beta', [768]),
('bert/encoder/layer_9/output/LayerNorm/beta/adam_m', [768]),
('bert/encoder/layer_9/output/LayerNorm/beta/adam_v', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma/adam_m', [768]),
('bert/encoder/layer_9/output/LayerNorm/gamma/adam_v', [768]),
('bert/encoder/layer_9/output/dense/bias', [768]),
('bert/encoder/layer_9/output/dense/bias/adam_m', [768]),
('bert/encoder/layer_9/output/dense/bias/adam_v', [768]),
('bert/encoder/layer_9/output/dense/kernel', [3072, 768]),
('bert/encoder/layer_9/output/dense/kernel/adam_m', [3072, 768]),
('bert/encoder/layer_9/output/dense/kernel/adam_v', [3072, 768]),
('bert/pooler/dense/bias', [768]),
('bert/pooler/dense/bias/adam_m', [768]),
('bert/pooler/dense/bias/adam_v', [768]),
('bert/pooler/dense/kernel', [768, 768]),
('bert/pooler/dense/kernel/adam_m', [768, 768]),
('bert/pooler/dense/kernel/adam_v', [768, 768]),
('cls/predictions/output_bias', [60812]),
('cls/predictions/output_bias/adam_m', [60812]),
('cls/predictions/output_bias/adam_v', [60812]),
('cls/predictions/transform/LayerNorm/beta', [768]),
('cls/predictions/transform/LayerNorm/beta/adam_m', [768]),
('cls/predictions/transform/LayerNorm/beta/adam_v', [768]),
('cls/predictions/transform/LayerNorm/gamma', [768]),
('cls/predictions/transform/LayerNorm/gamma/adam_m', [768]),
('cls/predictions/transform/LayerNorm/gamma/adam_v', [768]),
('cls/predictions/transform/dense/bias', [768]),
('cls/predictions/transform/dense/bias/adam_m', [768]),
('cls/predictions/transform/dense/bias/adam_v', [768]),
('cls/predictions/transform/dense/kernel', [768, 768]),
('cls/predictions/transform/dense/kernel/adam_m', [768, 768]),
('cls/predictions/transform/dense/kernel/adam_v', [768, 768]),
('cls/seq_relationship/output_bias', [2]),
('cls/seq_relationship/output_bias/adam_m', [2]),
('cls/seq_relationship/output_bias/adam_v', [2]),
('cls/seq_relationship/output_weights', [2, 768]),
('cls/seq_relationship/output_weights/adam_m', [2, 768]),
('cls/seq_relationship/output_weights/adam_v', [2, 768]),
('global_step', [])]