这是我对培训过程如何在<条码>内进行的理解。 让我们无视立场,只注重<代码>Q。
batch = 1,num_heads = 2, seq_len = 5, problem_dim = 4.
word_embedding = [5,4]
q_weight = [4x4]
Q = word_embedding*q_weight
Assume,
class MultiHeadAttentionModel(nn.Module):
def __init__(self, problem_dim, num_heads):
super().__init__()
self.multihead_attn = nn.MultiheadAttention(embed_dim=problem_dim,num_heads=num_heads,batch_first=True)
def forward(self, query, key, value):
attn_output, attn_output_weights = self.multihead_attn(query, key, value)
return attn_output, attn_output_weights
model = MultiHeadAttentionModel(problem_dim=problem_dim, num_heads=num_heads)
model.eval() <---------------- forward pass
attn_output, attn_output_weights = model(Q, K, V)
attn_output.backward() <--------------- training (backward pass)
final_linear_weight = model.multihead_attn.out_proj.weight
<代码>output = (软件极限(Q.dot(K_trans.dot(V))*
我的问题是:final_linear_
。 在培训阶段学习的唯一体重?