Hierarchical Recurrent Attention Network for Response Generation

前言

讀這篇論文之前要先了解 attention 機制以及 hierarchical encoder

Introduction

本篇論文目標任務： open domain conversation in chatbot
目的：
- 原因：並不是所有 context 資訊跟 response 有關，word 之餘 utterances
- 解決方法：透過 attention 找尋相關資訊

Problem Formalization

$\mathscr{D}= \left\{\left(\mathbf{U}_{i}, \mathbf{Y}_{i}\right)\right\}_{i=1}^{N} . \quad \forall i,\left(\mathbf{U}_{i}, \mathbf{Y}_{i}\right)$

$\mathbf{Y}_{i}=\left(y_{i, 1}, \ldots, y_{i, T_{i}}\right)$ : response
$\mathbf{U}_{i}=\left(u_{i, 1}, \dots, u_{i, m_{i}}\right)$ : context
- $u_{i, j}=\left(w_{i, j, 1}, \dots, w_{i, j, T_{i, j}}\right)$ : w 為 word
- 本篇論文 $m_{i} \geq 2$

目標機率公式：

$p\left(y_{1}, \dots, y_{T} | \mathbf{U}\right)$

Model

Word Level Encoder

使用 BiGRU 做 enocder，並將前、後的 hidden state 做 concatenate

$\mathbf{h}_{i, k}=\operatorname{concat}\left(\overrightarrow{\mathbf{h}}_{i, k}, \overleftarrow{\mathbf{h}}_{i, k}\right)$

hidden state 初始為 isotropic Gaussian distribution

Utterance Encoder

使用逆向的單向 GRU 做 encoder
input 則是 word level attention 的 context 向量 $r$ 和下一句的 hidden state $l$

Hierarchical Attention

Word attention

跟傳統 attention 一樣藉由 $\alpha$ 來選擇 $h_{i,j}$

$\mathbf{r}_{i, t}=\sum_{j=1}^{T_{i}} \alpha_{i, t, j} \mathbf{h}_{i, j}$

計算完的 $r_{i,t}$ 為 utterance encoder 的 input

而 $\alpha$ 的計算如下:

$\begin{aligned} &e_{i, t, j}=\eta\left(\mathbf{s}_{t-1}, \mathbf{l}_{i+1, t}, \mathbf{h}_{i, j}\right)\\ &\alpha_{i, t, j}=\frac{\exp \left(e_{i, t, j}\right)}{\sum_{k=1}^{T_{i}} \exp \left(e_{i, t, k}\right)} \end{aligned}$

$s_{t-1}$ 為 decoder $t-1$ 的 hidden state，$l_{i+1}$ 為 utterance level encoder hidden state

$\eta$ 是一個 multi-layer perceptron ，並使用 tanh 當作 activation function

將 $e$ 做 softmax 得到 $\alpha$

Utterance Attention

和 word attention 類似

$\mathbf{c}_{t}=\sum_{i=1}^{m} \beta_{i, t} \mathbf{l}_{i, t}$ $\begin{aligned} e_{i, t}^{\prime} &=\eta\left(\mathbf{s}_{t-1}, \mathbf{l}_{i, t}\right) \\ \beta_{i, t} &=\frac{\exp \left(e_{i, t}^{\prime}\right)}{\sum_{i=1}^{m} \exp \left(e_{i, t}^{\prime}\right)} \end{aligned}$

Decoder

一般 decoder 沒有什麼特別的

$\begin{array}{c} \mathbf{s}_{t}=f\left(\mathbf{e}_{y_{t-1}}, \mathbf{s}_{t-1}, \mathbf{c}_{t}\right) \\ p\left(y_{t} | \mathbf{c}_{t}, y_{1}, \ldots, y_{t-1}\right)=\mathbb{I}_{y_{t}} \cdot \operatorname{softmax}\left(\mathbf{s}_{t}, \mathbf{e}_{y_{t-1}}\right) \\ \hat{\Theta}=\underset{\Theta}{\arg \min }-\sum_{i=1}^{N} \log \left(p\left(y_{i, 1}, \ldots, y_{i, T_{i}} | \mathbf{U}_{\mathbf{i}}\right)\right) \end{array}$

$f$ 為 GRU

Reference

paper link