安装Pytorch
pip3 install torch torchvision torchaudio
或在Conda环境可以使用以下程式码:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
建立一个基于Pytorch的Transformer模型步骤
Step1_汇入相关套件与模组
import torchimport torch.nn as nnimport torch.optim as optimimport torch.utils.data as dataimport mathimport copy
Step2_定义各基础模组:多重注意力、位置向前馈网路、位置编码
Multi-head Attention(多重注意力)
class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads):'''The initialization checks if d_model is divisible by num_heads, and then defines the transformation weights for query, key, value, and output.''' super(MultiHeadAttention, self).__init__() # Ensure that the model dimension (d_model) is divisible by the number of heads assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # Initialize dimensions self.d_model = d_model # Model's dimension self.num_heads = num_heads # Number of attention heads self.d_k = d_model // num_heads # Dimension of each head's key, query, and value # Linear layers for transforming inputs self.W_q = nn.Linear(d_model, d_model) # Query transformation self.W_k = nn.Linear(d_model, d_model) # Key transformation self.W_v = nn.Linear(d_model, d_model) # Value transformation self.W_o = nn.Linear(d_model, d_model) # Output transformation def scaled_dot_product_attention(self, Q, K, V, mask=None): # Calculate attention scores# Calculating Attention Scores. Here, the attention scores are calculated by taking# the dot product of queries (Q) and keys (K), and then scaling by the square root # of the key dimension (d_k). attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # Apply mask if provided (useful for preventing attention to certain parts like padding) # If a mask is provided, it is applied to the attention scores to mask out specific # values.if mask is not None: attn_scores = attn_scores.masked_fill(mask == 0, -1e9) # Softmax is applied to obtain attention probabilities # Calculating Attention Weights: The attention scores are passed through a softmax # function# to convert them into probabilities that sum to 1.attn_probs = torch.softmax(attn_scores, dim=-1) # Multiply by values to obtain the final output # Calculating Output: The final output of the attention is calculated by multiplying the# attention weights by the values (V).output = torch.matmul(attn_probs, V) return output def split_heads(self, x):''' Reshape the input to have num_heads for multi-head attention.This method reshapes the input x into the shape (batch_size, num_heads, seq_length, d_k). It enables the model to process multiple attention heads concurrently, allowing for parallel computation.''' batch_size, seq_length, d_model = x.size() return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2) def combine_heads(self, x):''' Combine the multiple heads back to original shape. After applying attention to each head separately, this method combines the results back into a single tensor of shape (batch_size, seq_length, d_model). This prepares the result for further processing.'''batch_size, _, seq_length, d_k = x.size() return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model) def forward(self, Q, K, V, mask=None):'''The forward method is where the actual computation happens''' # Apply linear transformations and split heads Q = self.split_heads(self.W_q(Q)) K = self.split_heads(self.W_k(K)) V = self.split_heads(self.W_v(V)) # Perform scaled dot-product attention attn_output = self.scaled_dot_product_attention(Q, K, V, mask) # Combine heads and apply output transformation output = self.W_o(self.combine_heads(attn_output)) return output
综上所述,MultiHeadAttention 级别封装了 Transformer 模型中常用的多重注意力机制。 它负责将输入分成多个注意力头,对每个头施加注意力然后组合结果,因此模型可以捕捉不同尺度的输入资料中各种关係,提高模型的表达能力。
Position-wise Feed-Forward Networks(位置向前馈网路)
class PositionWiseFeedForward(nn.Module):'''The class is a subclass of PyTorch's nn.Module, which means it will inherit all functionalities required to work with neural network layers.''' def __init__(self, d_model, d_ff):'''d_model: Dimensionality of the model's input and output.d_ff: Dimensionality of the inner layer in the feed-forward network.self.fc1 and self.fc2: Two fully connected (linear) layers with input and output dimensions as defined by d_model and d_ff.self.relu: ReLU (Rectified Linear Unit) activation function, which introduces non- linearity between the two linear layers.''' super(PositionWiseFeedForward, self).__init__() self.fc1 = nn.Linear(d_model, d_ff) self.fc2 = nn.Linear(d_ff, d_model) self.relu = nn.ReLU() def forward(self, x):'''x: The input to the feed-forward network.self.fc1(x): The input is first passed through the first linear layer (fc1).self.relu(...): The output of fc1 is then passed through a ReLU activation function. ReLU replaces all negative values with zeros, introducing non-linearity into the model.self.fc2(...): The activated output is then passed through the second linear layer (fc2), producing the final output.''' return self.fc2(self.relu(self.fc1(x)))
PositionWiseFeedForward 级别定义了一个位置向前馈神经网络,它由两个线性层组成,中间有一个 ReLU 活化函数。 在Transformer模型的背景下,此前馈网路被单独且相同地应用于每个位置。 它有助于转换Transformer内注意力机制学到的特徵,作为注意力输出的附加处理步骤。
Positional Encoding(位置编码)
class PositionalEncoding(nn.Module):'''The class is defined as a subclass of PyTorch's nn.Module, allowing it to be used as a standard PyTorch layer.''' def __init__(self, d_model, max_seq_length):'''d_model: The dimension of the model's input.max_seq_length: The maximum length of the sequence for which positional encodings are pre-computed.pe: A tensor filled with zeros, which will be populated with positional encodings.position: A tensor containing the position indices for each position in the sequence.div_term: A term used to scale the position indices in a specific way.The sine function is applied to the even indices and the cosine function to the odd indices of pe.Finally, pe is registered as a buffer, which means it will be part of the module's state but will not be considered a trainable parameter.''' super(PositionalEncoding, self).__init__() pe = torch.zeros(max_seq_length, d_model) position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) self.register_buffer('pe', pe.unsqueeze(0)) def forward(self, x):'''The forward method simply adds the positional encodings to the input x.It uses the first x.size(1) elements of pe to ensure that the positional encodings match the actual sequence length of x.''' return x + self.pe[:, :x.size(1)]
PositionalEncoding 级别新增了有关序列中标记位置的资讯。 由于 Transformer 模型缺乏对 token 顺序的了解(由于其自注意力机制),因此位置编码帮助模型考虑 token 在序列中的位置。 选择使用的正弦函数是为了使模型能够轻鬆学习专注于相对位置,因为它们为序列中的每个位置产生独特且平滑的编码。
Step3_建立编码模组
class EncoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout):'''d_model: The dimensionality of the input.num_heads: The number of attention heads in the multi-head attention.d_ff: The dimensionality of the inner layer in the position-wise feed-forward network.dropout: The dropout rate used for regularization.self.self_attn: Multi-head attention mechanism.self.feed_forward: Position-wise feed-forward neural network.self.norm1 and self.norm2: Layer normalization, applied to smooth the layer's input.self.dropout: Dropout layer, used to prevent overfitting by randomly setting some activations to zero during training.''' super(EncoderLayer, self).__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = PositionWiseFeedForward(d_model, d_ff) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, mask):'''x: The input to the encoder layer.mask: Optional mask to ignore certain parts of the input.Self-Attention: The input x is passed through the multi-head self-attention mechanism.Add & Normalize (after Attention): The attention output is added to the original input (residual connection), followed by dropout and normalization using norm1.Feed-Forward Network: The output from the previous step is passed through the position- wise feed-forward network.Add & Normalize (after Feed-Forward): Similar to step 2, the feed-forward output is added to the input of this stage (residual connection), followed by dropout and normalization using norm2.Output: The processed tensor is returned as the output of the encoder layer.''' attn_output = self.self_attn(x, x, x, mask) x = self.norm1(x + self.dropout(attn_output)) ff_output = self.feed_forward(x) x = self.norm2(x + self.dropout(ff_output)) return x
EncoderLayer 级别定义了Transformer单层编码器。 它封装了多重自注意力机制,还有位置前馈神经网络,并根据需要应用残差连接、归一化层和丢失层。 这些组件共同允许编码器捕获输入资料中的複杂关係,并将其转换为下游任务的有用表示。 通常,多个这样的编码器层被堆叠以形成变压器模型的完整编码器部分。
Step4_建立解码模组
class DecoderLayer(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout):'''d_model: The dimensionality of the input.num_heads: The number of attention heads in the multi-head attention.d_ff: The dimensionality of the inner layer in the feed-forward network.dropout: The dropout rate for regularization.self.self_attn: Multi-head self-attention mechanism for the target sequence.self.cross_attn: Multi-head attention mechanism that attends to the encoder's output.self.feed_forward: Position-wise feed-forward neural network.self.norm1, self.norm2, self.norm3: Layer normalization components.self.dropout: Dropout layer for regularization.''' super(DecoderLayer, self).__init__() self.self_attn = MultiHeadAttention(d_model, num_heads) self.cross_attn = MultiHeadAttention(d_model, num_heads) self.feed_forward = PositionWiseFeedForward(d_model, d_ff) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.norm3 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, enc_output, src_mask, tgt_mask):'''x: The input to the decoder layer.enc_output: The output from the corresponding encoder (used in the cross-attention step).src_mask: Source mask to ignore certain parts of the encoder's output.tgt_mask: Target mask to ignore certain parts of the decoder's input.Self-Attention on Target Sequence: The input x is processed through a self-attention mechanism.Add & Normalize (after Self-Attention): The output from self-attention is added to the original x, followed by dropout and normalization using norm1.Cross-Attention with Encoder Output: The normalized output from the previous step is processed through a cross-attention mechanism that attends to the encoder's output enc_output.Add & Normalize (after Cross-Attention): The output from cross-attention is added to the input of this stage, followed by dropout and normalization using norm2.Feed-Forward Network: The output from the previous step is passed through the feed- forward network.Add & Normalize (after Feed-Forward): The feed-forward output is added to the input of this stage, followed by dropout and normalization using norm3.Output: The processed tensor is returned as the output of the decoder layer.''' attn_output = self.self_attn(x, x, x, tgt_mask) x = self.norm1(x + self.dropout(attn_output)) attn_output = self.cross_attn(x, enc_output, enc_output, src_mask) x = self.norm2(x + self.dropout(attn_output)) ff_output = self.feed_forward(x) x = self.norm3(x + self.dropout(ff_output)) return x
DecoderLayer 级别定义了Transformer单层解码器。 它由多重自注意力机制、多头交叉注意力机制(关注编码器的输出)、位置前馈神经网路以及相应的残差连接、归一化层和 dropout 组成层。 这种组合使解码器能够基于编码器的表示来产生有意义的输出,同时考虑目标序列和来源序列。 与编码器一样,通常堆叠多个解码器层以形成变压器模型的完整解码器部分。
Step5_组合编码及解码层成为完整的Transformer网路
class Transformer(nn.Module): def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):'''src_vocab_size: Source vocabulary size.tgt_vocab_size: Target vocabulary size.d_model: The dimensionality of the model's embeddings.num_heads: Number of attention heads in the multi-head attention mechanism.num_layers: Number of layers for both the encoder and the decoder.d_ff: Dimensionality of the inner layer in the feed-forward network.max_seq_length: Maximum sequence length for positional encoding.dropout: Dropout rate for regularization.self.encoder_embedding: Embedding layer for the source sequence.self.decoder_embedding: Embedding layer for the target sequence.self.positional_encoding: Positional encoding component.self.encoder_layers: A list of encoder layers.self.decoder_layers: A list of decoder layers.self.fc: Final fully connected (linear) layer mapping to target vocabulary size.self.dropout: Dropout layer.''' super(Transformer, self).__init__() self.encoder_embedding = nn.Embedding(src_vocab_size, d_model) self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model) self.positional_encoding = PositionalEncoding(d_model, max_seq_length) self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]) self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)]) self.fc = nn.Linear(d_model, tgt_vocab_size) self.dropout = nn.Dropout(dropout) def generate_mask(self, src, tgt):'''This method is used to create masks for the source and target sequences, ensuring that padding tokens are ignored and that future tokens are not visible during training for the target sequence.''' src_mask = (src != 0).unsqueeze(1).unsqueeze(2) tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3) seq_length = tgt.size(1) nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool() tgt_mask = tgt_mask & nopeak_mask return src_mask, tgt_mask def forward(self, src, tgt):'''This method defines the forward pass for the Transformer, taking source and target sequences and producing the output predictions.Input Embedding and Positional Encoding: The source and target sequences are first embedded using their respective embedding layers and then added to their positional encodings.Encoder Layers: The source sequence is passed through the encoder layers, with the final encoder output representing the processed source sequence.Decoder Layers: The target sequence and the encoder's output are passed through the decoder layers, resulting in the decoder's output.Final Linear Layer: The decoder's output is mapped to the target vocabulary size using a fully connected (linear) layer.''' src_mask, tgt_mask = self.generate_mask(src, tgt) src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src))) tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt))) enc_output = src_embedded for enc_layer in self.encoder_layers: enc_output = enc_layer(enc_output, src_mask) dec_output = tgt_embedded for dec_layer in self.decoder_layers: dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask) output = self.fc(dec_output) return output
Transformer 级别汇集了 Transformer 模型的各个元件,包括嵌入、位置编码、编码器层和解码器层。 它为训练和推理提供了方便的接口,封装了多重注意力、前馈网路和层归一化的複杂性。此实作遵循标準 Transformer 架构,使其适合序列到序列的任务,例如机器翻译、文字摘要等。包含遮罩可确保模型遵循序列内的因果依赖性,忽略填充标记并防止资讯洩漏来自未来的token。这些连续的步骤使 Transformer 模型能够有效地处理输入序列并产生相应的输出序列。