Sequence to Sequence Learning.

Sequence to Sequence Learning.

There are many implementtion of the sequence to sequence on internet, Then why this one even exist? Because I believe in “What I cannot create, I do not understand.”. I have put my efforts in understanding it from scratch and finished its implementation.I have taken care and made this implementation easy to understand.This implementation of sequence to sequence is without batching. Follow through below given steps to understand it better.

Environment Setup

!pip install torch==0.4.1 -q
!pip install torchtext -q

The data for this tutorial is a set of many thousands of French to English translation pairs. These pairs are shared from

Fra Eng
Va ! Go.
Cours ! Run!
Saute. Jump.
Ça suffit ! Stop!
Stop ! Stop!
Arrête-toi ! Stop!
Attends ! Wait!
J’ai froid. I am cold.

Such 11,000+ pairs for french to English traslation are given in original dataset.

!wget -c
import random
import re
import string
import unicodedata

import torch
from import Dataset

import time

import matplotlib.pyplot as plt
import numpy as np
from torch import nn, optim
from torch.autograd import Variable
from import DataLoader
import torch.nn.functional as F

Data Preparation

It better to reuse, I am reusing official text preprocessing code written for seq2seq tutorial.

SOS_token = 0
EOS_token = 1

To keep track of all this we will use a helper class called Lang which has word → index (word2index) and index → word (index2word) dictionaries, as well as a count of each word word2count to use to later replace rare words.

class Lang(object):
    def __init__(self, name): = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
            self.word2count[word] += 1

The files are all in Unicode, to simplify we will turn Unicode characters to ASCII, make everything lowercase, and trim most punctuation.

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

# Lowercase, trim, and remove non-letter characters

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

To read the data file we will split the file into lines, and then split lines into pairs. The files are all English → Other Language, so if we want to translate from Other Language → English I added the reverse flag to reverse the pairs.

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('fra.txt', encoding='utf-8').read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

Since there are a lot of example sentences and we want to train something quickly, we’ll trim the data set to only relatively short and simple sentences. Here the maximum length is 10 words (that includes ending punctuation) and we’re filtering to sentences that translate to the form “I am” or “He is” etc. (accounting for apostrophes replaced earlier).

eng_prefixes = ("i am ", "i m ", "he is", "he s ", "she is", "she s",
                "you are", "you re ", "we are", "we re ", "they are",
                "they re ")

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

The full process for preparing the data is:

  • Read text file and split into lines, split lines into pairs
  • Normalize text, filter by length and content
  • Make word lists from sentences in pairs
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    pairs = filterPairs(pairs)
    for pair in pairs:
    print("Input Language : ",, ", Number of words : " ,input_lang.n_words)
    print("Target Language : ",, ", Number of words : " ,output_lang.n_words)
    print("A Random Pair : ",random.choice(pairs))
    return input_lang, output_lang, pairs

def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    result = torch.LongTensor(indexes)
    return result

def tensorFromPair(input_lang, output_lang, pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return input_tensor, target_tensor

To train, for each pair we will need an input tensor (indexes of the words in the input sentence) and target tensor (indexes of the words in the target sentence). While creating these vectors we will append the EOS token to both sequences.

class TextDataset(Dataset):
    def __init__(self, dataload=prepareData, lang=['eng', 'fra']):
        self.input_lang, self.output_lang, self.pairs = dataload(
            lang[0], lang[1], reverse=True)
        self.input_lang_words = self.input_lang.n_words
        self.output_lang_words = self.output_lang.n_words

    def __getitem__(self, index):
        return tensorFromPair(self.input_lang, self.output_lang,

    def __len__(self):
        return len(self.pairs)
lang_dataset = TextDataset()
lang_dataloader = DataLoader(lang_dataset, shuffle=True)

Reading lines… Input Language : fra , Number of words : 4714 Target Language : eng , Number of words : 3081 A Random Pair : [‘je suis chatouilleuse .’, ‘i m ticklish .’]

One can access pairs of French and English by usinf iterator lang_dataloader.

for i, data in enumerate(lang_dataloader):
    in_lang, out_lang = data
    print(in_lang, out_lang)

tensor([[132, 14, 133, 36, 5, 1]]) tensor([[81, 82, 23, 4, 1]])

input_size = lang_dataset.input_lang_words
hidden_size = 256
output_size = lang_dataset.output_lang_words

Some Learning

**Some basic understanding regarding PyTorch componenets used in Encoder or Decoder are explianed below : **


To learn language one must convert words to fixed size vectors. This can be done is two ways:

  1. Using pretrained word vectors like Word2vec or Glove Vectors
  2. Learning Vector from Scratch

Here we will be using learning Vector from Scratch, Pytorch has the Embedding function for the same. These embeddings will be trained as when learning takes place. Below given is the example how one can insert PyTorch Embedding layer in the model. For Example: We have two words in vocab “hello” and “world” and we want to have 5 dimentional vector for each word then PyTorch Embedding can be defined in following way: [2]

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)

tensor([[-0.2056, 1.0885, 2.1518, -0.5143, -0.9437]], grad_fn=)

Unsqueeze & Squeeze


unsqueeze() inserts singleton dim at position given as parameter. Insert a new axis that will appear at the axis position in the expanded array shape.[3]

input = torch.Tensor(2, 4, 3) # input: 2 x 4 x 3

torch.Size([1, 2, 4, 3])


Returns a tensor with all the dimensions of input of size 1 removed. For example, if input is of shape: $(A×1×B×C×1×D)$ then the out tensor will be of shape: $(A×B×C×D)$.[4]

x = torch.zeros(2, 1, 2, 1, 2) # input: 2 x 4 x 3
y = torch.squeeze(x)

torch.Size([2, 2, 2])


Interchange different axis. [5]

x = torch.randn(2, 3, 5)
torch.Size([2, 3, 5])
print(x.permute(2, 0, 1).size())

torch.Size([5, 2, 3])


Log Softmax applies logarithm after softmax : $log( exp(x_i) / exp(x).sum() )$ [6]


Returns the k largest elements of the given input tensor along a given dimension. If dim is not given, the last dimension of the input is chosen. If largest is False then the k smallest elements are returned.A tuple of (values, indices) is returned, where the indices are the indices of the elements in the original input tensor. [7]

x = torch.arange(1., 10.) #tensor([ 1.,  2.,  3.,  4.,  5.])
top_value, top_index = torch.topk(x, 3)
print(top_value, top_index)
tensor([9., 8., 7.]) tensor([8, 7, 6])


Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence. [8]

Learning Model

With understanding of above elements, we are good to go with defining Encoder and Decoder Network.

Figure 1. Encoder Decoder architecure while training

An Encoder takes input sequence passes through the Embedding Layer. Embedding Layer converts word in to a fixed size vector, this Embedding Layer is trainable and get trained along with encoder and decoder. Here onward each word is in form of Embedded representation and passed to GRU. GRU takes 2 inputs 1) Encoder Hidden states and 2) Embedded word input. All words of the input sentence are passed sequentially to GRU so that it takes for a word it takes hidden state of previous state as input and gives one output. For now I will be ignoring encoder outputs shown by :negative_squared_cross_mark: . (These Encoder outputs, we will be using when we implement attention mechanism)

Encoder and Decoder states are represented by Dotted Blue arrow throughout. Embedding representaion are shown in pink.

After Encoder phase has finished, we her final encoder hidden state as output. This encoder hidden state is of value and it is estimated that as it has seen all the word of the sequence it will be having information regarding entire sequence.

This Encoder hidden state is passed to Decoder as first hidden state.

Concept Of Teacher Forcing :

Teacher Forcing is kind of Hint in training recurrent neural networks that uses model output from a prior time step as an input to next time step. For Example,

  1. Input sentence for encoder is “Mon nom est Sunil” <- (French)
  2. Actual Target Sentence is “My name is Sunil” <- (English)
  3. In Decoding step at first time step <Go> token is given as input to GRU along with context vector.<GO> **or **<SOS> mark begining of decoding and looking at context vector it will generate first output word in English.Then GRU will generate (probably)“My”.
  4. Now in second time step “My” will be given as input along with previous time step hidden step. This will give output as “name”.
  5. Generation wil continues as long as <EOS> token is encountered.

At training time we know the output sequence for the input sequecne so we provide words of actual output for teacher forcing. At test time we dont know the output sequence for the input sequecne so output token generated at time step $t$ is given as input to $t+1$ step (As shown in Figure 2)


class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        input = input.unsqueeze(1)
        embedded = self.embedding(input)  # batch, hidden
        output = embedded.permute(1, 0, 2)
        for i in range(self.n_layers):
            output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        return result

Testing Encoder

encoder_test = EncoderRNN(input_size, hidden_size)
test_encoder_input = torch.tensor([15])
encoder_hidden = encoder_test.initHidden()
encoder_output, encoder_hidden = encoder_test(test_encoder_input,encoder_hidden)

torch.Size([1, 1, 256])


class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1):
        super(DecoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax()

    def forward(self, input, hidden):
        output = self.embedding(input)  # batch, 1, hidden
        output = output.permute(1, 0, 2)  # 1, batch, hidden
        for i in range(self.n_layers):
            output = F.relu(output)
            output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        return result

Testing Decoder

decoder_test = DecoderRNN(hidden_size,output_size)
decoder_hidden = decoder_test.initHidden()
test_decoder_input = torch.tensor([[13]])
decoder_output, test_hidden = decoder_test(test_decoder_input,encoder_hidden)

torch.Size([1, 3081])

Plotting Function

def showPlot(points):
    x = np.arange(len(points))
    plt.plot(x, points)

Actual Training

Starting with actual training, making two object of encoder and decoder class respectively.

encoder = EncoderRNN(input_size, hidden_size)
decoder = DecoderRNN(hidden_size, output_size, n_layers=2)
def train(encoder, decoder, total_epoch):
    # Collecting all trianable parameters
    param = list(encoder.parameters()) + list(decoder.parameters())
    # Defining Optimizer
    optimizer = optim.Adam(param, lr=1e-3)
    # Defining Loss Function
    criterion = nn.NLLLoss()

    plot_losses = []

    # Learning for defined total_epoch
    for epoch in range(total_epoch):
        since = time.time()
        running_loss = 0
        print_loss_total = 0
        total_loss = 0

        # getting french and english pairs by iterating over data iterator
        for i, data in enumerate(lang_dataloader):
            in_lang, out_lang = data

            in_lang = Variable(in_lang)
            out_lang = Variable(out_lang)

            # Passing data to Encoder 
            encoder_outputs = Variable(torch.zeros(MAX_LENGTH, encoder.hidden_size))
            encoder_hidden = encoder.initHidden()

            for encoder_input in range(in_lang.size(1)):
                encoder_output, encoder_hidden = encoder(torch.tensor([in_lang[0][encoder_input]]), encoder_hidden)

            # Passing data to Decoder 
            decoder_input = Variable(torch.LongTensor([[SOS_token]]))
            # using Encoder's last state as Decoder's last state
            decoder_hidden = encoder_hidden
            loss = 0
            for decoder_input_no in range(out_lang.size(1)):
                decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
                loss += criterion(decoder_output, torch.tensor([out_lang[0][decoder_input_no]]))
                topv, topi =
                ni = int(topi[0][0])
                decoder_input = Variable(torch.LongTensor([[ni]]))
                if ni == EOS_token:

            # backward operations

            running_loss +=[0]
            print_loss_total +=[0]
            total_loss +=[0]
            if (i + 1) % 500 == 0:
                print('{}/{}, Loss:{:.6f}'.format(i + 1, len(lang_dataloader), running_loss / 5000))
                running_loss = 0
            if (i + 1) % 100 == 0:
                plot_loss = print_loss_total / 100
                print_loss_total = 0
        during = time.time() - since
        print('Epoch {}/{} , Loss:{:.6f}, Time:{:.0f}s'.format(epoch + 1, total_epoch, total_loss / len(lang_dataset),during))
    #plotting loss
# running training
total_epoch = 1
train(encoder, decoder, total_epoch)

#progress 500/11885, Loss:2.483365 1000/11885, Loss:2.187418 1500/11885, Loss:2.180427 2000/11885, Loss:2.149146 2500/11885, Loss:2.095273 3000/11885, Loss:2.079026 3500/11885, Loss:2.097628 4000/11885, Loss:1.930167 4500/11885, Loss:1.936091 5000/11885, Loss:1.890468 5500/11885, Loss:1.921667 6000/11885, Loss:1.914081 6500/11885, Loss:1.873955 7000/11885, Loss:1.770381 7500/11885, Loss:1.719582 8000/11885, Loss:1.736032 8500/11885, Loss:1.735520 9000/11885, Loss:1.710942 9500/11885, Loss:1.607517 10000/11885, Loss:1.697581 10500/11885, Loss:1.636639 11000/11885, Loss:1.693851 11500/11885, Loss:1.698137

Epoch 1/1 , Loss:18.915344, Time:2059s

Figure 1.1. Decrese in error noted at every 500 iterations.


Evaluation module follows the same flow as training, it takes pretrained encoder and decoder and generate translated text. The main thing here is teacher forcing is not followed while evaluation. The decoder generate a word at each time step and the same word is feed as input to next time step.

Figure 2. Encoder Decoder architecure while testing.

At test time we dont know the output sequence for the input sequecne so output token generated at time step $t$ is given as input to $t+1$ step

def evaluate(encoder, decoder, in_lang, max_length=MAX_LENGTH):
    input_variable = Variable(in_lang)
    input_variable = input_variable.unsqueeze(0)
    input_length = input_variable.size(1)
    # Encoder Phase
    encoder_hidden = encoder.initHidden()
    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    encoder_outputs = encoder_outputs

    for encoder_input in range(input_length):
        encoder_output, encoder_hidden = encoder(torch.tensor([input_variable[0][encoder_input]]), encoder_hidden)
    #Decoder Phase 
    decoder_input = Variable(torch.LongTensor([[SOS_token]]))
    decoder_input = decoder_input
    decoder_hidden = encoder_hidden
    decoded_words = []
    decoder_attentions = torch.zeros(max_length, max_length)

    for decoder_input_no in range(max_length):
        decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
        topv, topi =
        ni = int(topi[0][0])
        # if decoder gives End Of Sequence token then break the sequence generation else continue.
        if ni == EOS_token:

        decoder_input = Variable(torch.LongTensor([[ni]]))
        decoder_input = decoder_input
    return decoded_words

def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair_idx = random.choice(list(range(len(lang_dataset))))
        pair = lang_dataset.pairs[pair_idx]
        in_lang, out_lang = lang_dataset[pair_idx]
        output_words = evaluate(encoder, decoder, in_lang)
        output_sentence = ' '.join(output_words)
        print('Input : ', pair[0], ' | Desired Output : ', pair[1], ' | Generated Output : ', output_sentence)

evaluateRandomly(encoder, decoder)
Input Desired Output Generated Output
je suis en retard pour l entrainement . i m late for practice . i m late for late late .
je suis touche ! i m hit ! i m a . .
vous avez probablement tort . you are probably wrong . you re probably . .
ils ne sont pas encore chez eux . Desired Output : they re not home yet . Generated Output : they re not always yet .
je ne te drague pas . i m not hitting on you . i m not helping you you .
tu es fort elegant . you re very sophisticated . you re very funny .
je suis veinarde . i m in luck . i m a . .
vous etes vraiment geniales . you re really awesome . you re really very .
je suis juste curieux . i m just curious . i m right behind . .
tu es tres indelicate . you are very insensitive . you re very observant .

You can access original google colab notebbok here.

Author face

Sunil Patel

I'm Sunil Patel, to me working with Deep Learning is a part of hobby more than just a "job". With this blog, I am keeping all my experience with AI open to entire Community. #DeepLearningEnthusiastic #Opportunistic

Recent post