样本资料準备
'''Hyperparameters:These values define the architecture and behavior of the transformer model:src_vocab_size, tgt_vocab_size: Vocabulary sizes for source and target sequences, both set to 5000.d_model: Dimensionality of the model's embeddings, set to 512.num_heads: Number of attention heads in the multi-head attention mechanism, set to 8.num_layers: Number of layers for both the encoder and the decoder, set to 6.d_ff: Dimensionality of the inner layer in the feed-forward network, set to 2048.max_seq_length: Maximum sequence length for positional encoding, set to 100.dropout: Dropout rate for regularization, set to 0.1.'''src_vocab_size = 5000tgt_vocab_size = 5000d_model = 512num_heads = 8num_layers = 6d_ff = 2048max_seq_length = 100dropout = 0.1'''This line creates an instance of the Transformer class, initializing it with the givenhyperparameters. The instance will have the architecture and behavior defined by these hyperparameters.'''transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)'''Generate random sample data.src_data: Random integers between 1 and src_vocab_size, representing a batch of source sequences with shape (64, max_seq_length).tgt_data: Random integers between 1 and tgt_vocab_size, representing a batch of target sequences with shape (64, max_seq_length).These random sequences can be used as inputs to the transformer model, simulating a batch of data with 64 examples and sequences of length 100.'''src_data = torch.randint(1, src_vocab_size, (64, max_seq_length)) # (batch_size, seq_length)tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length)) # (batch_size, seq_length)
此程式码片段示範如何初始化Transformer模型并产生可输入模型的随机来源序列和目标序列。 所选的超参数决定了变压器的具体结构和属性。 此设定可以是较大脚本的一部分,其中模型根据实际的序列到序列任务(例如机器翻译或文字摘要)进行训练和评估。
训练模型
'''criterion = nn.CrossEntropyLoss(ignore_index=0): Defines the loss function as cross-entropy loss. The ignore_index argument is set to 0, meaning the loss will not consider targets with an index of 0 (typically reserved for padding tokens).optimizer = optim.Adam(...): Defines the optimizer as Adam with a learning rate of 0.0001 and specific beta values.'''criterion = nn.CrossEntropyLoss(ignore_index=0)optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)'''transformer.train(): Sets the transformer model to training mode, enabling behaviors like dropout that only apply during training.'''transformer.train()'''The code snippet trains the model for 100 epochs using a typical training loop:for epoch in range(100): Iterates over 100 training epochs.optimizer.zero_grad(): Clears the gradients from the previous iteration.output = transformer(src_data, tgt_data[:, :-1]): Passes the source data and the target data (excluding the last token in each sequence) through the transformer. This is common in sequence-to-sequence tasks where the target is shifted by one token.loss = criterion(...): Computes the loss between the model's predictions and the target data (excluding the first token in each sequence). The loss is calculated by reshaping the data into one-dimensional tensors and using the cross-entropy loss function.loss.backward(): Computes the gradients of the loss with respect to the model's parameters.optimizer.step(): Updates the model's parameters using the computed gradients.print(f"Epoch: {epoch+1}, Loss: {loss.item()}"): Prints the current epoch number and the loss value for that epoch.'''for epoch in range(100): optimizer.zero_grad() output = transformer(src_data, tgt_data[:, :-1]) loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1)) loss.backward() optimizer.step() print(f"Epoch: {epoch+1}, Loss: {loss.item()}")
此程式码片段在随机产生的来源序列和目标序列上训练 Transformer 模型 100 个时期。 它使用 Adam 优化器和交叉熵损失函数。 每个时期都会显示出损失,以便您监控训练进度。 在现实场景中,您可以将随机来源序列和目标序列替换为任务(例如机器翻译)中的实际资料。
评估模型表现
'''transformer.eval(): Puts the transformer model in evaluation mode. This is important becauseit turns off certain behaviors like dropout that are only used during training.'''transformer.eval()'''Generate random sample validation data.val_src_data: Random integers between 1 and src_vocab_size, representing a batch of validation source sequences with shape (64, max_seq_length).val_tgt_data: Random integers between 1 and tgt_vocab_size, representing a batch of validation target sequences with shape (64, max_seq_length).'''val_src_data = torch.randint(1, src_vocab_size, (64, max_seq_length)) # (batch_size, seq_length)val_tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length)) # (batch_size, seq_length)'''Validation Loop:with torch.no_grad(): Disables gradient computation, as we don't need to compute gradients during validation. This can reduce memory consumption and speed up computations.val_output = transformer(val_src_data, val_tgt_data[:, :-1]): Passes the validation source data and the validation target data (excluding the last token in each sequence) through the transformer.val_loss = criterion(...): Computes the loss between the model's predictions and the validation target data (excluding the first token in each sequence). The loss is calculated by reshaping the data into one-dimensional tensors and using the previously defined cross-entropy loss function.print(f"Validation Loss: {val_loss.item()}"): Prints the validation loss value.'''with torch.no_grad(): val_output = transformer(val_src_data, val_tgt_data[:, :-1]) val_loss = criterion(val_output.contiguous().view(-1, tgt_vocab_size), val_tgt_data[:, 1:].contiguous().view(-1)) print(f"Validation Loss: {val_loss.item()}")
此程式码片段在随机生成的验证资料集上评估Transformer模型,计算验证损失并显示它。 在现实场景中,随机验证资料应替换为您正在处理的任务中的实际验证资料。 验证损失可以让您了解模型在未见过的数据上的表现如何,这是模型泛化能力的关键衡量标準。