From RNN to LSTM | Cipher's Blog

Pytorch

Publish Date: 2022-06-01

Update Date: 2022-12-23

Read Count:

RNN and LSTM

nn.embedding

nn.embedding 构造一个vocab size的lookup table，每个vocabulary用hidden_size维度的向量表示

>>> embedding = nn.Embedding(10, 3)
>>> embedding.weight

# Parameter containing:          
# tensor([[ 1.2402, -1.0914, -0.5382],
#         [-1.1031, -1.2430, -0.2571],
#         [ 1.6682, -0.8926, 1.4263],
#         [ 0.8971, 1.4592, 0.6712],
#         [-1.1625, -0.1598, 0.4034],
#         [-0.2902, -0.0323, -2.2259],
#         [ 0.8332, -0.2452, -1.1508],
#         [ 0.3786, 1.7752, -0.0591],
#         [-1.8527, -2.5141, -0.4990],
#         [-0.6188, 0.5902, -0.0860]], requires_grad=True)

>>> embedding.weight.size
# torch.Size([10, 3])

# note that input is indices, the size of which is [2,4]
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])

a = embedding(input)   
print(a)
# tensor([[[-1.1031, -1.2430, -0.2571],   
#          [ 1.6682, -0.8926, 1.4263],  
#          [-1.1625, -0.1598, 0.4034],
#          [-0.2902, -0.0323, -2.2259]],
#         [[-1.1625, -0.1598, 0.4034],
#          [ 0.8971, 1.4592, 0.6712],
#          [ 1.6682, -0.8926, 1.4263],
#          [-0.6188, 0.5902, -0.0860]]], grad_fn=<EmbeddingBackward>)

print(a.size())
# torch.Size([2, 4, 3])

a = embedding(input)是去embedding.weight中取对应index的词向量！

看a的第一行，input处index=1，对应取出weight中index=1的值，即按index取词向量！

Vanilla RNN

RNN

bs = 20
vocab_size =10000

class three_layer_recurrent_net(nn.Module):
	def __init__(self, hidden_size):
        super(three_layer_recurrent_net, self).__init__()
			self.layer1 = nn.Embedding( vocab_size  , hidden_size )
			self.layer2 = nn.RNN(       hidden_size , hidden_size )
			self.layer3 = nn.Linear(    hidden_size , vocab_size  )
		
        def forward(self, word_seq, h_init ):
			g_seq = self.layer1( word_seq ) 
			h_seq , h_final  =  self.layer2( g_seq , h_init )
			score_seq = self.layer3( h_seq )
			return score_seq, h_final

nn.RNN(input_size, hidden_size, num_layers=1, nonlinearity=tanh, bias=True, batch_first=False, dropout=0, bidirectional=False)

input_size输入特征的维度，一般rnn中输入的是词向量，那么 input_size 就等于一个词向量的维度（nn.Embedding()里定义了词向量维度）,或词向量的feature。

hidden_size隐藏层神经元个数，或者也叫输出的维度（因为rnn输出为各个时间步上的隐藏状态），可以不等于input_size，网络中会自动进行升降维度。

num_layers: Number of recurrent layers. Setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1

nonlinearity激活函数，只能用relu或tanh（默认）

bias是否使用偏置

batch_first输入数据的形式

dropout是否在每个recurrent layer(RNN layer)后应用dropout（除最后一层）,默认不使用，如若使用将其设置成一个0-1的数字即可

birdirectional是否使用双向的 rnn，默认是 False，双向RNN对输入输出维度会有影响

# 输入

input_shape为[seq_length, batch_size, hidden_size]

在例子里，g_seq是size为[word_seq, hidden_size]的张量，或者可以写成[seq_length, batch_size, hidden_size]

h_init的shape为[num_layers, batch_size, hidden_size]

# 输出

在前向计算后会分别返回输出output和隐藏状态h_n。其中输出指的是隐藏层在各个时间步上计算并输出的隐藏状态，它们通常作为后续输出层的输⼊。需要强调的是，该“输出”本身并不涉及输出层计算，size为[seq_length, batch_size, hidden_size]；隐藏状态指的是隐藏层在最后时间步的隐藏状态：当隐藏层有多层时，每⼀层的隐藏状态都会记录在该变量中，隐藏状态h的size为[num_layers, batch_size, hidden_size]

input_size = 5(word vector dimension or input feature), hidden_size = 6, num_layers = 2
rnn = nn.RNN(5, 6, 7)

input = torch.randn(21, 8, 5)

# seq_length = 21, (21 words, 21 time_steps), batch_size = 8, (8 sentences), every single word vector’s size = 5
# 假设一个batch里面有8个句子，现在走到第17个time step，同时计算的是这个batch里面，8个不同sequence中第17个词，得到这8个词的状态之后，传到下一个timestep对应的cell，再计算8个不同sequence中第18个词 
# num_layer = 7, batch_size = 8, hidden_size = 6

h0 = torch.randn(7, 8, 6)
output, hn = rnn(input, h0)

# output.size() = torch.Size([21, 8, 6])
# hn.size() = torch.Size([7, 8, 6])

Why tanh？

RNN的每个时间步的权重是相同的，bp过程相当于权重的连续相乘，因此比CNN更容易出现梯度消失和梯度爆炸，因此采用tanh比较多。tanh也会出现梯度消失和梯度爆炸，但是相比于Sigmoid[0, 0.25]更好

RNN training in 10 epoch

hidden_size=150
net = three_layer_recurrent_net( hidden_size )
criterion = nn.CrossEntropyLoss()
my_lr = 1
seq_length = 35
start=time.time()
for epoch in range(10):
	# keep the learning rate to 1 during the first 4 epochs, then divide by 1.1 at every epoch
	if epoch >= 4:
		my_lr = my_lr / 1.1
	# create a new optimizer and give the current learning rate.  
	optimizer=torch.optim.SGD( net.parameters() , lr=my_lr )
	# set the running quantities to zero at the beginning of the epoch
    running_loss=0
    num_batches=0  
    # set the initial h to be the zero vector
    h = torch.zeros(1, bs, hidden_size)
    # send it to the gpu  
    h=h.to(device)
    
	for count in range( 0 , 46478-seq_length ,  seq_length):
        # Set the gradients to zeros
        optimizer.zero_grad()
        # create a minibatch, every unit output should be compared with the next step, 
        # so minibatch_label should be [ count+1 : count+seq_length+1 ]
        minibatch_data = train_data[  count  :  count+seq_length  ]
        minibatch_label = train_data[ count+1 : count+seq_length+1 ]    
        # send them to the gpu
        minibatch_data=minibatch_data.to(device)
        minibatch_label=minibatch_label.to(device)
        # Detach to prevent from backpropagating all the way to the beginning
        # Then tell Pytorch to start tracking all operations that will be done on h
        h=h.detach()
        h=h.requires_grad_()
        # forward the minibatch through the net    
        scores, h = net( minibatch_data, h )
        # reshape the scores and labels to huge batch of size bs*seq_length
        scores = scores.view( bs*seq_length , vocab_size) 
        minibatch_label =  minibatch_label.view( bs*seq_length )    
        # Compute the average of the losses of the data points in this huge batch
        loss = criterion( scores , minibatch_label )
        # backward pass to compute dL/dR, dL/dV and dL/dW
        loss.backward()
        # do one step of stochastic gradient descent: R=R-lr(dL/dR), V=V-lr(dL/dV), ...
        utils.normalize_gradient(net)
        optimizer.step()
        # update the running loss 
        running_loss += loss.item()
        num_batches += 1
 
	# compute stats for the full training set
	total_loss = running_loss/num_batches
	elapsed = time.time()-start
	print('')
	print('epoch=',epoch, '\t time=', elapsed,'\t lr=', my_lr, '\t exp(loss)=', math.exp(total_loss))
	eval_on_test_set()

LSTM

help alleviate the problem of vanishing and exploding gradients in RNN

长短期记忆（LSTM），隐藏状态是⼀个元组(h, c)，即hidden state和cell state

其中的input

input(seq_len, batch_size, input_size)

seq_len：在文本处理中，如果一个句子有7个单词，则seq_len=7；在时间序列预测中，假设我们用前24个小时的负荷来预测下一时刻负荷，则seq_len=24。

batch_size：一次性输入LSTM中的样本个数。在文本处理中，可以一次性输入很多个句子；在时间序列预测中，也可以一次性输入很多条数据。

input_size：在文本处理中，由于一个单词没法参与运算，因此我们得通过Word2Vec来对单词进行嵌入表示，将每一个单词表示成一个向量，此时input_size=embedding_size。比如每个句子中有五个单词，每个单词用一个100维向量来表示，那么这里input_size=100；在时间序列预测中，比如需要预测负荷，每一个负荷都是一个单独的值，都可以直接参与运算，因此并不需要将每一个负荷表示成一个向量，此时input_size=1。但如果我们使用多变量进行预测，比如我们利用前24小时每一时刻的[负荷、风速、温度、压强、湿度、天气、节假日信息]来预测下一时刻的负荷，那么此时input_size=7

class three_layer_recurrent_net(nn.Module):
	def __init__(self, hidden_size):
		super(three_layer_recurrent_net, self).__init__()
			self.layer1 = nn.Embedding( vocab_size , hidden_size )
			self.layer2 = nn.LSTM(   hidden_size , hidden_size )
			self.layer3 = nn.Linear(  hidden_size , vocab_size  )
		def forward(self, word_seq, h_init, c_init ):
			g_seq = self.layer1(word_seq ) 
			h_seq,(h_final,c_final) = self.layer2( g_seq , (h_init,c_init) )
			score_seq = self.layer3(h_seq)
			return score_seq, h_final , c_final

lstm = nn.LSTM(10, 20, 2)
input = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
c0 = torch.randn(2, 3, 20)
output, (hn, cn) = lstm(input, (h0, c0))