Seq2Seq(Attention)的PyTorch实现(超级详细)

    技术2022-07-16  79

    文本主要介绍一下如何使用PyTorch复现Seq2Seq(with Attention),实现简单的机器翻译任务,请先阅读论文Neural Machine Translation by Jointly Learning to Align and Translate,之后花上15分钟阅读我的这两篇文章Seq2Seq 与注意力机制,图解Attention,最后再来看文本,方能达到醍醐灌顶,事半功倍的效果

    数据预处理

    数据预处理的代码其实就是调用各种API,我不希望读者被这些不太重要的部分分散了注意力,因此这里我不贴代码,仅口述一下带过即可

    如下图所示,本文使用的是德语→英语数据集,输入是德语,并且输入的每个句子开头和结尾都带有特殊的标识符。输出是英语,并且输出的每个句子开头和结尾也都带有特殊标识符

    不管是英语还是德语,每句话长度都是不固定的,所以我对于每个batch内的句子,将它们的长度通过加<PAD>变得一样,也就说,一个batch内的句子,长度都是相同的,不同batch内的句子长度不一定相同。下图维度表示分别是[seq_len, batch_size]

    随便打印一条数据,看一下数据封装的形式

    在数据预处理的时候,需要将源句子和目标句子分开构建字典,也就是单独对德语构建一个词库,对英语构建一个词库

    Encoder

    Encoder我是用的单层双向GRU

    双向GRU的隐藏状态输出由两个向量拼接而成,例如 h 1 = [ h 1 → ; h T ← ] h_1=[\overrightarrow{h_1};\overleftarrow{h_T}] h1=[h1 ;hT ], h 2 = [ h 2 → ; h ← T − 1 ] h_2=[\overrightarrow{h_2};\overleftarrow{h}_{T-1}] h2=[h2 ;h T1]…所有时刻的最后一层隐藏状态就构成了GRU的output

    o u t p u t = { h 1 , h 2 , . . . h T } output=\{h_1,h_2,...h_T\} output={h1,h2,...hT}

    假设这是个m层GRU,那么最后一个时刻所有层中的隐藏状态就构成了GRU的final hidden states h i d d e n = { h T 1 , h T 2 , . . . , h T m } hidden=\{h^1_T,h^2_T,...,h^m_T\} hidden={hT1,hT2,...,hTm} 其中 h T i = [ h T i → ; h 0 i ← ] h^i_T=[\overrightarrow{h^i_T};\overleftarrow{h^i_0}] hTi=[hTi ;h0i ] 所以 h i d d e n = { [ h T 1 → ; h 0 1 ← ] , [ h T 2 → ; h 0 2 ← ] , . . . , [ h T m → ; h 0 m ← ] } hidden=\{[\overrightarrow{h^1_T};\overleftarrow{h^1_0}],[\overrightarrow{h^2_T};\overleftarrow{h^2_0}],...,[\overrightarrow{h^m_T};\overleftarrow{h^m_0}]\} hidden={[hT1 ;h01 ],[hT2 ;h02 ],...,[hTm ;h0m ]} 根据论文,或者你看了我的图解Attention这篇文章就会知道,我们需要的是hidden的最后一层输出(包括正向和反向),因此我们可以通过hidden[-2,:,:]和hidden[-1,:,:]取出最后一层的hidden states,将它们拼接起来记作 s 0 s_0 s0

    最后一个细节之处在于, s 0 s_0 s0的维度是[batch_size, en_hid_dim*2],即便是没有Attention机制,将 s 0 s_0 s0作为Decoder的初始隐藏状态也不对,因为维度不匹配,需要将 s 0 s_0 s0的维度转为[batch_size, src_len, dec_hid_dim],中间的src_len暂且不谈,首先要做的是转为[batch_size, dec_hid_dim],所以这里需要将 s 0 s_0 s0通过一个全连接神经网络,进行维度转换

    Encoder的细节就这么多,下面直接上代码,我的代码风格是,注释在上,代码在下

    class Encoder(nn.Module): def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout): super().__init__() self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True) self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim) self.dropout = nn.Dropout(dropout) def forward(self, src): ''' src = [src_len, batch_size] ''' src = src.transpose(0, 1) # src = [batch_size, src_len] embedded = self.dropout(self.embedding(src)).transpose(0, 1) # embedded = [src_len, batch_size, emb_dim] # enc_output = [src_len, batch_size, hid_dim * num_directions] # enc_hidden = [n_layers * num_directions, batch_size, hid_dim] enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescently # enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...] # enc_output are always from the last layer # enc_hidden [-2, :, : ] is the last of the forwards RNN # enc_hidden [-1, :, : ] is the last of the backwards RNN # initial decoder hidden is final hidden state of the forwards and backwards # encoder RNNs fed through a linear layer # s = [batch_size, dec_hid_dim] s = torch.tanh(self.fc(torch.cat((enc_hidden[-2,:,:], enc_hidden[-1,:,:]), dim = 1))) return enc_output, s

    Attention

    attention无非就是三个公式 E t = t a n h ( a t t n ( s t − 1 , H ) ) a t ~ = v E t a t = s o f t m a x ( a t ~ ) E_t=tanh(attn(s_{t-1},H))\\ \tilde{a_t}=vE_t\\ {a_t}=softmax(\tilde{a_t}) Et=tanh(attn(st1,H))at~=vEtat=softmax(at~) 其中 s t − 1 s_{t-1} st1指的就是Encoder中的变量s, H H H指的就是Encoder中的变量enc_output, a t t n ( ) attn() attn()其实就是一个简单的全连接神经网络

    我们可以从最后一个公式反推各个变量的维度是什么,或者维度有什么要求

    首先 a t a_t at的维度应该是[batch_size, src_len],这是毋庸置疑的,那么 a t ~ \tilde{a_t} at~的维度也应该是[batch_size, src_len],或者 a t ~ \tilde{a_t} at~是个三维的,但是某个维度值为1,可以通过squeeze()变成两维的。这里我们先假设 a t ~ \tilde{a_t} at~的维度是[batch_size, src_len, 1],等会儿我再解释为什么要这样假设

    继续往上推,变量 v v v的维度就应该是[?, 1],?表示我暂时不知道它的值应该是多少。 E t E_t Et的维度应该是[batch_size, src_len, ?]

    现在已知 H H H的维度是[batch_size, src_len, enc_hid_dim*2], s t − 1 s_{t-1} st1目前的维度是[batch_size, dec_hid_dim],这两个变量需要做拼接,送入全连接神经网络,因此我们首先需要将 s t − 1 s_{t-1} st1的维度变成[batch_size, src_len, dec_hid_dim],拼接之后的维度就变成[batch_size, src_len, enc_hid_dim*2+enc_hid_dim],于是 a t t n ( ) attn() attn()这个函数的输入输出值也就有了

    attn = nn.Linear(enc_hid_dim*2+enc_hid_dim, ?)

    到此为止,除了?部分的值不清楚,其它所有维度都推导出来了。现在我们回过头思考一下?设置成多少,好像其实并没有任何限制,所以我们可以设置?为任何值(在代码中我设置?为dec_hid_dim)

    Attention细节就这么多,下面给出代码

    class Attention(nn.Module): def __init__(self, enc_hid_dim, dec_hid_dim): super().__init__() self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False) self.v = nn.Linear(dec_hid_dim, 1, bias = False) def forward(self, s, enc_output): # s = [batch_size, dec_hid_dim] # enc_output = [src_len, batch_size, enc_hid_dim * 2] batch_size = enc_output.shape[1] src_len = enc_output.shape[0] # repeat decoder hidden state src_len times # s = [batch_size, src_len, dec_hid_dim] # enc_output = [batch_size, src_len, enc_hid_dim * 2] s = s.unsqueeze(1).repeat(1, src_len, 1) enc_output = enc_output.transpose(0, 1) # energy = [batch_size, src_len, dec_hid_dim] energy = torch.tanh(self.attn(torch.cat((s, enc_output), dim = 2))) # attention = [batch_size, src_len] attention = self.v(energy).squeeze(2) return F.softmax(attention, dim=1)

    Seq2Seq(with Attention)

    我调换一下顺序,先讲Seq2Seq,再讲Decoder的部分

    传统Seq2Seq是直接将句子中每个词连续不断输入Decoder进行训练,而引入Attention机制之后,我需要能够人为控制一个词一个词进行输入(因为输入每个词到Decoder,需要再做一些运算),所以在代码中会看到我使用了for循环,循环trg_len-1次(开头的<SOS>我手动输入,所以循环少一次)

    并且训练过程中我使用了一种叫做Teacher Forcing的机制,保证训练速度的同时增加鲁棒性,如果不了解Teacher Forcing可以看我的这篇文章

    思考一下for循环中应该要做哪些事?首先要将变量传入Decoder,由于Attention的计算是在Decoder的内部进行的,所以我需要将dec_input、s、enc_output这三个变量传入Decoder,Decoder会返回dec_output以及新的s。之后根据概率对dec_output做Teacher Forcing即可

    Seq2Seq细节就这么多,下面给出代码

    class Seq2Seq(nn.Module): def __init__(self, encoder, decoder, device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src, trg, teacher_forcing_ratio = 0.5): # src = [src_len, batch_size] # trg = [trg_len, batch_size] # teacher_forcing_ratio is probability to use teacher forcing batch_size = src.shape[1] trg_len = trg.shape[0] trg_vocab_size = self.decoder.output_dim # tensor to store decoder outputs outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device) # enc_output is all hidden states of the input sequence, back and forwards # s is the final forward and backward hidden states, passed through a linear layer enc_output, s = self.encoder(src) # first input to the decoder is the <sos> tokens dec_input = trg[0,:] for t in range(1, trg_len): # insert dec_input token embedding, previous hidden state and all encoder hidden states # receive output tensor (predictions) and new hidden state dec_output, s = self.decoder(dec_input, s, enc_output) # place predictions in a tensor holding predictions for each token outputs[t] = dec_output # decide if we are going to use teacher forcing or not teacher_force = random.random() < teacher_forcing_ratio # get the highest predicted token from our predictions top1 = dec_output.argmax(1) # if teacher forcing, use actual next token as next input # if not, use predicted token dec_input = trg[t] if teacher_force else top1 return outputs

    Decoder

    Decoder我用的是单向单层GRU

    Decoder部分实际上也就是三个公式 c = a t H s t = G R U ( e m b ( y t ) , c , s t − 1 ) y t ^ = f ( e m b ( y t ) , c , s t ) c=a_tH\\ s_t=GRU(emb(y_t), c, s_{t-1})\\ \hat{y_t}=f(emb(y_t), c, s_t) c=atHst=GRU(emb(yt),c,st1)yt^=f(emb(yt),c,st) H H H指的是Encoder中的变量enc_output, e m b ( y t ) emb(y_t) emb(yt)指的是将dec_input经过WordEmbedding后得到的结果, f ( ) f() f()函数实际上就是为了转换维度,因为需要的输出是TRG_VOCAB_SIZE大小。其中有个细节,GRU的参数只有两个,一个输入,一个隐藏层输入,但是上面的公式有三个变量,所以我们应该选一个作为隐藏层输入,另外两个"整合"一下,作为输入

    我们从第一个公式正推各个变量的维度是什么

    首先在Encoder中最开始先调用一次Attention,得到权重 a t a_t at,它的维度是[batch_size, src_len],而 H H H的维度是[src_len, batch_size, enc_hid_dim*2],它俩要相乘,同时应该保留batch_size这个维度,所以应该先将 a t a_t at扩展一维,然后调换一下 H H H维度的顺序,之后再按照batch相乘(即同一个batch内的矩阵相乘)

    a = a.unsqueeze(1) # [batch_size, 1, src_len] H = H.transpose(0, 1) # [batch_size, src_len, enc_hid_dim*2] c = torch.bmm(a, h) # [batch_size, 1, enc_hid_dim*2]

    前面也说了,由于GRU不需要三个变量,所以需要将 e m b ( y t ) emb(y_t) emb(yt) c c c整合一下, y t y_t yt实际上就是Seq2Seq类中的dec_input变量,它的维度是[batch_size],因此先将 y t y_t yt扩展一个维度,再通过WordEmbedding,这样他就变成[batch_size, 1, emb_dim]。最后对 c c c e m b ( y t ) emb(y_t) emb(yt)进行concat

    y = y.unsqueeze(1) # [batch_size, 1] emb_y = self.emb(y) # [batch_size, 1, emb_dim] rnn_input = torch.cat((emb_y, c), dim=2) # [batch_size, 1, emb_dim+enc_hid_dim*2]

    s t − 1 s_{t-1} st1的维度是[batch_size, dec_hid_dim],所以应该先将其拓展一个维度

    rnn_input = rnn_input.transpose(0, 1) # [1, batch_size, emb_dim+enc_hid_dim*2] s = s.unsqueeze(1) # [batch_size, 1, dec_hid_dim] # dec_output = [1, batch_size, dec_hid_dim] # dec_hidden = [1, batch_size, dec_hid_dim] = s (new, is not s previously) dec_output, dec_hidden = self.rnn(rnn_input, s)

    最后一个公式,需要将三个变量全部拼接在一起,然后通过一个全连接神经网络,得到最终的预测。我们先分析下这个三个变量的维度, e m b ( y t ) emb(y_t) emb(yt)的维度是[batch_size, 1, emb_dim], c c c的维度是[batch_size, 1, enc_hid_dim], s t s_t st的维度是[1, batch_size, dec_hid_dim],因此我们可以像下面这样把他们全部拼接起来

    emd_y = emb_y.squeeze(1) # [batch_size, emb_dim] c = w.squeeze(1) # [batch_size, enc_hid_dim*2] s = s.squeeze(0) # [batch_size, dec_hid_dim] fc_input = torch.cat((emb_y, c, s), dim=1) # [batch_size, enc_hid_dim*2+dec_hid_dim+emb_hid]

    以上就是Decoder部分的细节,下面给出代码(上面的那些只是示例代码,和下面代码变量名可能不一样)

    class Decoder(nn.Module): def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention): super().__init__() self.output_dim = output_dim self.attention = attention self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim) self.dropout = nn.Dropout(dropout) def forward(self, dec_input, s, enc_output): # dec_input = [batch_size] # s = [batch_size, dec_hid_dim] # enc_output = [src_len, batch_size, enc_hid_dim * 2] dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1] embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim] # a = [batch_size, 1, src_len] a = self.attention(s, enc_output).unsqueeze(1) # enc_output = [batch_size, src_len, enc_hid_dim * 2] enc_output = enc_output.transpose(0, 1) # c = [1, batch_size, enc_hid_dim * 2] c = torch.bmm(a, enc_output).transpose(0, 1) # rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim] rnn_input = torch.cat((embedded, c), dim = 2) # dec_output = [src_len(=1), batch_size, dec_hid_dim] # dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim] dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0)) # embedded = [batch_size, emb_dim] # dec_output = [batch_size, dec_hid_dim] # c = [batch_size, enc_hid_dim * 2] embedded = embedded.squeeze(0) dec_output = dec_output.squeeze(0) c = c.squeeze(0) # pred = [batch_size, output_dim] pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1)) return pred, dec_hidden.squeeze(0)

    定义模型

    INPUT_DIM = len(SRC.vocab) OUTPUT_DIM = len(TRG.vocab) ENC_EMB_DIM = 256 DEC_EMB_DIM = 256 ENC_HID_DIM = 512 DEC_HID_DIM = 512 ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5 attn = Attention(ENC_HID_DIM, DEC_HID_DIM) enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT) dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn) model = Seq2Seq(enc, dec, device).to(device) TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token] criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device) optimizer = optim.Adam(model.parameters(), lr=1e-3)

    倒数第二行CrossEntropyLoss()中的参数很少见,ignore_index=TRG_PAD_IDX,这个参数的作用是忽略某一类别,不计算其loss,但是要注意,忽略的是真实值中的类别,例如下面的代码,真实值的类别都是1,而预测值全部预测认为是2(下标从0开始),同时loss function设置忽略第一类的loss,此时会打印出0

    label = torch.tensor([1, 1, 1]) pred = torch.tensor([[0.1, 0.2, 0.6], [0.2, 0.1, 0.8], [0.1, 0.1, 0.9]]) loss_fn = nn.CrossEntropyLoss(ignore_index=1) print(loss_fn(pred, label).item()) # 0

    如果设置loss function忽略第二类,此时loss并不会为0

    label = torch.tensor([1, 1, 1]) pred = torch.tensor([[0.1, 0.2, 0.6], [0.2, 0.1, 0.8], [0.1, 0.1, 0.9]]) loss_fn = nn.CrossEntropyLoss(ignore_index=2) print(loss_fn(pred, label).item()) # 1.359844

    最后给出完整代码链接(需要科学的力量) Github项目地址:nlp-tutorial

    Processed: 0.015, SQL: 9