English 中文(简体)
在N.MultiheadAttention(培训)中,梯度被计算出什么?
原标题:What are the gradients computed when doing backpropagation (training) in nn.MultiheadAttention?
  • 时间:2024-02-11 16:39:19
  •  标签:
  • pytorch

这是我对培训过程如何在<条码>内进行的理解。 让我们无视立场,只注重<代码>Q。

batch = 1,num_heads = 2, seq_len = 5, problem_dim = 4.

word_embedding = [5,4] q_weight = [4x4] Q = word_embedding*q_weight

Assume,

 class MultiHeadAttentionModel(nn.Module):
        def __init__(self, problem_dim, num_heads):
            super().__init__()
            self.multihead_attn = nn.MultiheadAttention(embed_dim=problem_dim,num_heads=num_heads,batch_first=True)
        
        def forward(self, query, key, value):
            attn_output, attn_output_weights = self.multihead_attn(query, key, value)
            return attn_output, attn_output_weights

model = MultiHeadAttentionModel(problem_dim=problem_dim, num_heads=num_heads)
model.eval()          <---------------- forward pass
attn_output, attn_output_weights = model(Q, K, V)
attn_output.backward() <--------------- training (backward pass)

final_linear_weight = model.multihead_attn.out_proj.weight

<代码>output = (软件极限(Q.dot(K_trans.dot(V))* Ignore the scale for now

我的问题是:final_linear_。 在培训阶段学习的唯一体重?

问题回答




相关问题
How to process large dataset in pytorch DDP mode?

I have a large Dataset about 900G. The memory of my machine is 1T. I want to train a model in distributed training mode. I have 10 gpus. I used to use tensorflow and horovod to make it. I split the ...

Tensorflow cannot detect CUDA enabled device

I have an RTX 4070 on my Dell XPS laptop that also comes with an Intel IRIS Xe Graphics card. I am using Windows 11. I have NVIDIA Graphics Driver Version 535.98 installed on my system and has support ...

No module named mmcv._ext

Tried to train the model but got the mmcv error No module named mmcv._ext mmcv library is already installed and imported mmcv version = 1.4.0 Cuda version = 10.0 Any suggestions to fix the issue??

热门标签