在使用<代码>torch.nn.modules.transformer. Transformer模块/object时,第一层是encoder.layers.0. Self_attn
。 等级为<代码>MultiheadAttention,即
from torch.nn.modules.transformer import Transformer
bumblebee = Transformer()
bumblee.parameters
[不适用]:
<bound method Module.parameters of Transformer(
(encoder): TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
如果我们勾画层面积,我们看到:
for name in bumblebee.encoder.state_dict():
print(name, , bumblebee.encoder.state_dict()[name].shape)
[不适用]:
layers.0.self_attn.in_proj_weight torch.Size([1536, 512])
layers.0.self_attn.in_proj_bias torch.Size([1536])
layers.0.self_attn.out_proj.weight torch.Size([512, 512])
layers.0.self_attn.out_proj.bias torch.Size([512])
layers.0.linear1.weight torch.Size([2048, 512])
layers.0.linear1.bias torch.Size([2048])
layers.0.linear2.weight torch.Size([512, 2048])
layers.0.linear2.bias torch.Size([512])
layers.0.norm1.weight torch.Size([512])
layers.0.norm1.bias torch.Size([512])
layers.0.norm2.weight torch.Size([512])
layers.0.norm2.bias torch.Size([512])
似乎1536年是512个* 3个,有些是layers.0. Self_attn.in_proj_etter/code>参数可能把所有3个QKV帐篷储存在一个矩阵中。
From https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/activation.py#L649
class MultiheadAttention(Module):
def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None):
super(MultiheadAttention, self).__init__()
self.embed_dim = embed_dim
self.kdim = kdim if kdim is not None else embed_dim
self.vdim = vdim if vdim is not None else embed_dim
self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
self.num_heads = num_heads
self.dropout = dropout
self.head_dim = embed_dim // num_heads
assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
if self._qkv_same_embed_dim is False:
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
else:
self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
The note in the docstring of the MultiheadAttention
:
Note: if kdim and vdim are None, they will be set to embed_dim such that query, key, and value have the same number of features.
Is that correct?