Generative Model for text: An overview of recent advancements

The rapid development of GAN blossoms into many amazing applications in the continuous data such as image. They can be used to generate high-quality people or objects or translate pictures into different domains. Recently, GAN even starts to serve as a tool for the artist to create their work.

Up: Celebrity and objects generated in [1]. Bottom-left:Multimodal dog to cats image translation from MUNIT [2]. Bottom-right:Latent space of GAN trained with different paintings by Helena sarin [3]

With such breakthrough of GAN in the computer vision applications, it is natural to ask:

Can we leverage this powerful framework to other fields? For example, text.

Unfortunately, GAN is not naturally fit discrete data such as text. The development of GAN in text is far from explored. One of the reasons is the discrete nature of text causes the training of generator become difficult.

According to Goodfellow, the generator learns how to slightly change the synthetic data through the gradient information from the discriminator. Such slight change is only meaningful for continuous data. For example, it is resaonable to change a pixel value from 1 to 1 + 0.001 while it is not for changing “penguin” to “penguin” + 0.001. As a result, we have to investigate new methods.

The purpose of this article is to summarize (hope not too long) some text generation models developed in recent years for people who are excited by the fantastic effect of GAN and start to consider to leverage it to explore the text world. To make the exploration not so scary, I will first start with GAN, our familiar friend in computer vision, then slightly shift to recent approaches specially designed for text generation.

I hope you can get a picture of what challenges we face and what methods we have so you can save your time buried in papers at the end of this journey.

Table of Contents

Model for text generation

Improved training of GAN

Perhaps the most basic task to study the capability of generating meaningful text is to generate simple characters with a certain patterns. Kusner et al. [4] conduct an experiment on fitting RNN on character patterns (e.g., x+x+x+x x). Instead of using conventional multinomial distribution in the RNN output layer, they propose to relax it by using Gumbel-softmax at each step $t$, which is a continuous approximation.

\begin{align} \label{eq:1} \boldsymbol{y}_{t} = \textrm{softmax}(\frac{ \boldsymbol{h}_t + \boldsymbol{g}_t}{\tau}) \end{align} where $\boldsymbol{h}_t$ is the hidden state of RNN generator in $t$ step and $\boldsymbol{g}_t$ is sample from the Gumbel distribution. $\tau$ is a parameter used to control how close the continuous approximate distribution to the discrete one. When $\tau \rightarrow 0$, the above equation is more like a discrete distribution. When $\tau \rightarrow \infty$, it is more like a uniform distribution. We call this Gumbel-softmax trick. Here are some great tutorials by Eric Jang and Gonzalo Mena if you want to learn more in depth.

Fig. 2 shows the generated tokens. Using Gumbel-softmax seems help GAN in discrete space. The samples generated by MLE is treated as real sample (left) and those generated by RNN+Gumbel-softmax trick (right) are generated samples, which share the similar pattern to a certain degree especially in 4th, 10th, and 17th row. Combing GAN with Gumbel-softmax trick is the earliest attempt to apply GAN to discrete token generation directly.

Left: Samples generated by MLE. Right Sample generated by RNN + Gumbel-softmax
Gulrajani et al. later proposes to apply gradient penalty [5] to improve the training of Wasserstein GAN [6a, 6b]. The proposed improvement (WGAN-GP) not only enhances the quality of image generation but open the door to generate more complex characters. The authors directly train a character based language model by 1D-CNN on Billion Word dataset [7] to transform a latent vector into a sequence of 32 one-hot characters as shown in Fig. 3. During training, the output vector from softmax is directly used as input of the discriminator. When inference, the argmax operation is applied to each output vector.
Training and inference of 1D-CNN. D is arbitrary choosen dimension and C is the number of character.
Fig. 4 shows the comparison between samples generated by WGAN-GP (up) and the original GAN (bottom). We can observe that WGAN-GP can generate more complex characters pattern and even common vocabulary such as “In”, “the” and “was” while original GAN failed to generate some meaningful patterns. Although the generated text is still not readable, WGAN-GP is the first text generation model trained purely in an adversarial way without resorting to MLE pre-training.
Samples generated by WGAN-GP (up) and standard GAN (bottom)
Several recently proposed approaches have shown further improvement of training GAN in discrete space . Hjelm et al. propose boundary seeking GAN (BEGAN) [8] that uses discriminator optimized for estimating f-divergence. They use policy gradient and importance weight derived from the discriminator to guide the generator. Concurrently, Mroueh et al. propose Sobolev GAN [9] that trains GAN by a new objective with Sobolev norm. Their results are still not plausible sentences.

Methods above are trained from scratch, which is a more challenging setting for applying GAN in text generation. What if we relax our setting and allow generator pre-train on some corpus?

Zhang et al. propose textGAN [10] under this setting. They first initialize the weights of LSTM generator from a pre-trained CNN-LSTM autoencoder (i.e., Using the weights of LSTM part to initialize the generator). Inspired by feature matching techniques used to improve GAN in the early development of GAN [11], they train the generator to synthesized data that match the empirical distribution of the real data in the feature space by minimizing maximum mean discrepancy (MMD), which can be understood to minimize the moments of two distributions. If the Gaussian kernel is used (which is the case in the paper), it is shown that minimizing MMD will match moments of all orders of two distributions[12].

TextGAN architecture: The pre-trained generator G aims to generate samples match real ones by MMD. A reconstruction loss $||z - \tilde{z}||$ is also minimized during the training.

Fig. 6 shows the result of textGAN. We can see the generator can output some meaningful sentences. Also, the trajectory of latent space is more reasonable and smooth from one sentence to the othter compared to vanilla autoencoder, which suggests textGAN learned a better representation of text.

Left Samples generated by textGAN. Right The comparison of latent space transition between textGAN and autoencoder

Standard GAN needs the differential operation on the input data thus applying it to discrete input data is difficult. Is there any generative model that naturally fit the discrete input data?

Variational Autoencoder

Unlike GAN, Variational Autoencoder (VAE)[13] can work with both continuous and discrete input data directly. For people who are not familiar with VAE, I recommend the tutorial (here and here) written by Jaan Altossar and Blei et al. (if you want to go deeper). Here I only give a brief introduction.

Given data $\boldsymbol{x}$, latent variable $\boldsymbol{z}$, encoder parameters $\boldsymbol{\theta}$ and decoder parameters $\boldsymbol{\phi}$, the goal of VAE is to approximate the posterior $p(\boldsymbol{z}|\boldsymbol{x})$ by a familiy of distribution $q_\lambda(\boldsymbol{z}|\boldsymbol{x})$, where $\lambda$ indicates the familiy of distribution. Take Gaussian for example, $\boldsymbol{\lambda} = (\boldsymbol{\mu}, \boldsymbol{\sigma})$. \begin{align} \textrm{argmin}_{\lambda}\textrm{KL}(q_{\lambda}(z|x)||p(z|x)) \end{align} The above is shown to be equivalent to maximize the evidence lower bound (ELBO). For a single data point $x$, the ELBO is \begin{align} \textrm{ELBO}_i = \mathbb{E}_{q_{\theta}(z|x_i)}(\textrm{log}p_{\phi}(x_i|z)) - \textrm{KL}(q_{\theta}(z|x_i)||p(z)) \end{align} The first term can be viewed as how well the model can reconstruct data given the learned representation. The second term can be viewed as a regularization which we hope the learned posterior can be close to prior. Maximizing ELBO will encourage the model to learn useful latent representation that explains the data well. Note that we use $q_{\theta}$ to replace the $q_{\lambda}$ since we use the encoder to inference $\lambda$.

The recent generative model for text is based on VAE proposed by Bowman et al.[14]. They propose a RNN-based variational autoencoder which captures the global feature of sentences (e.g., topic, style) in continuous variables. The RNN-encoder inference the $\mu$ and $\sigma$ of Gaussian and a sample $z$ is drawn from the posterior for the decoder to generate the sentences. The architecture is shown below and many subsequent works follow similar architecture with some modifications.

Main architecture proposed by Bowman et al. The code $\boldsymbol{z}$ is only concatenated to the first hidden state of the decoder

The core issue of using VAE in text generation is KL collapse (or latent variable collapse you might see in other literature), which means the KL term in the ELBO become zero during the optimization. This is because the posterior $q_{\theta}(z|x)$ is relatively weak during the early step of optimization, which makes it tend to stay close to $p(z)$ to maximize the ELBO. If we further examine the KL term, zero KL term means that the posterior $q_{\theta}(z|x_i)$ is equal to prior $p(z)$ thus posterior is independent from the input data! Such collapse thus precludes us from learning useful latent representation from the input data which is a goal we aim to achieve. This is a common case in language modelling as discussed in [14]. Once this happened, the RNN decoder will completely ignore the latent representation and naively use the decoder’s capability to achieve the optimal. The difficulty here is how do we maximize the ELBO and prevent the KL term from going to zero at the same time.

Bowman et al. propose KL-annealing and word dropout to alleviate this issue, which increases the weight of KL term during training and randomly replaces word token by to force decoder to rely on global representation $z$ instead of the learned language model. However, these techniques cannot solve this issue. VAE trained in this way is slightly worse than language model (RNNLM) in NLL despite it can generate more plausible sentences and more smooth transition than autoencoder when moving in the latent space. Therefore, many subsequent works focus on finding better techniques to address KL collapse.

Yang et al. hypothesize and validate the contextual capacity of decoder is related to KL collapse [15]. They replace the original RNN decoder with the dilated convolutional neural network as shown in Fig. 8, which facilitates the control of contextual capacity by changing dilation.

Architecture proposed by Yang et al. The left encoder-decoder is similar to Fig. 7 and the right is the details inside the dilated CNN decoder. Note that the code $\boldsymbol{z}$ is concatebated to each word embedding.

The dilated CNN [16] used in the decoder is 1D dilated CNN, which is a way to enlarge the reception field without sacrificing computation cost. By controlling the dilation size in CNN. The author study four configurations of decoder with following depth and dilation: [1, 2, 4], [1, 2, 4, 8, 16], [1, 2, 4, 8, 16, 1, 2, 4, 8, 16] and [1, 2, 4, 8, 16, 1, 2, 4, 8, 16, 1, 2, 4, 8, 16], which are denoted as SCNN, MCNN, LCNN and VLCNN. They also compare with vanilla VAE proposed in [14](denoted as LSTM). In each configuration, they further divide 3 training methods for comparison: language model (LM), VAE (The architecture proposed in this paper) and VAE + init that initalizes the encoder’s weight by language model. The results are shown in Fig. 9

NLL and KL term of all methods on Yahoo dataset. There are three bars in each group which are LM, VAE, VAE+init. The red bar is the propotion of KL term in reconstruction loss

The vanilla VAE (LSTM) has worse NLL as KL term goes to zero which agrees with the negative results found in [14]. For the rest baselines based on CNN with different configurations (i.e., dilation, depth), we can see better improvement for the small model (SCNN) over pure LM and the improvement gradually diminish as model size become large (VLCNN). This finding suggests that using dilated CNN as decoder could alleviate KL collapse if we carefully choose the decoder contextual capacity. Using pretrained language model also helps for the learning.

The latent space learned by proposed VAE is visualized by setting dimension of $z=2$ on Yahoo and Yelp review dataset in Fig.10. It’s interesting to see different topics are mapped to different regions and the sentiments (i.e. rating from 1-5) of review are horizontally spread. The authors extend the proposed method to conditional text generation. Some examples are presented in Fig. 11.

Regions of different topic and sentiments mapped in the latent space.
Text samples generated by conditioning on the topic (Up) and sentiment labels (Bottom).

Concurrently, Semeniuta et al. propose a convolutional-deconvolutional VAE with recurrent models on top of the output of deconlolutional layers (Hybrid VAE)[17] in Fig. 12 for text generation.

Left: The convolutioanl-deconvolutional encoder decoder part of proposed hybrid VAE. Right: Two vairants of recurrent model appended on top of the output of deconvolutional layer, which are traditional LSTM and ByteNet [18]

They further introduce an auxiliary loss $J_{aux} = -\alpha\mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}\textrm{log}p_{\phi}(\boldsymbol{x}|\boldsymbol{z})$ into the optimization of ELBO to force the decoding process rely on the latent representation $\boldsymbol{z}$. $\alpha$ is a parameter to control the penalty of auxiliary loss.

In their experiments, they compare the decoding performane of the proposed method with vanilla VAE under historyless and history setting. In other words, they test the decoding performance of models by using word dropout rate $p$ from 1.0 to certain ratio during training (i.e. Randomly replacing the ground truth with <UNK> token). They also study the benefit of auxiliary loss for decoding performance as the expressive power of decoder become stronger (i.e., deeper layer). The results are shown in Fig 13.

Up: The performance between Hybrid VAE and LSTM VAE on historyless decoing. Bottom left: The benefit of auxilary loss for decoder with various depth. Bottom-right: The performance comparison of Hybrid VAE with LSTM VAE with different word dropout rate.

We can see that the hybrid VAE converge far faster and better than LSTM VAE [14] no matter the length of text (Up). The proposed auxiliary loss can prevent the KL term from 0 which allow us to train deeper (more powerful) decoder without latent variable collapse (Bottom left). The last figure (Bottom right) demonstrates that the Hybrid VAE perform similarly to LSTM-VAE in terms of bits-per-character. However, the KL term of Hybrid VAE is not zero while it is almost zero for LSTM-VAE, which suggest LSTM-VAE more severe KL collapse than Hybrid VAE.

Developing new methods to address the KL collapse is still a very active research area. In ICML 2018, Kim et al. [19] propose semi-amortized inference that initializing variational parameter by amortized inference then applying stochastic variational inference to refine them. Dieng et al. [20] introduce skip connections between latent variables $\boldsymbol{z}$ and decoder to enforce the relationship between latent variables and reconstruction loss. Both methods are justified better than the previous method by experiments. One most recent and exciting advancement is using von mises fisher distribution as prior instead of Gaussian, which is first proposed for text generation in [21] and carefully studied by Xu et al. [22]. VAE with such incorporation is termed hyperspherical VAE and it can control KL term by hyper-parameter $\kappa$ thus totally avert KL collpase. What’s more exciting is that this $\kappa$ is not sensitive to task as reported in [22], which implies more general application.

While GAN dominates the world of continuous data, the world of discrete data is much dominated by VAE. Here is an interesting analogy on historical perspective: If we look back the development of GAN, people search for architectural design to stabilize the GAN [23] and this also happened in the development of VAE in text. Until a theoretical flaw pointed out and mitigated in Wasserstein GAN [6a, 6b] do GAN take a huge leap in generating high quality samples. Although it might be too strong to claim the techniques used in hyperspherical VAE is analogous that in Wasserstein GAN, we have at least a theoretical-like approach to address KL collapse now.

Adversarial Regularized Autoencoder

We have seen two main categories of generative models in text, VAE and GAN. Before diving into another main line of research, I would like to deviate a little bit and introduce an interesting work for a break. You might be familiar with the figure below:

Vector arithmic example of images in [23]
This is the vector arithmetic in latent space of DCGAN [23], which can generate the image with desired attributes by offset vectors. In the above example, we generate an image by conducting following vector arithmetic:

smiling woman - normal woman + normal man = smiling man

Is there any recent text generation model that can achieve the similar effect? Zhao et al. propose Adversarially Regularized Autoencoders (ARAE) [24] which can do such vector arithmetic in the latent space of text. Similar to the above example, the authors change “attribute” of sentence (i.e., Subject, verb, and modifier) and generate the sentence with desired attributes by vector arithmetic. They first generate 1M sentences by ARAE and parse the sentences to get subject, verb, and modifiers. To substitute the verb, say sleeping, they first subtract the mean latent vector of all sentences which contain sleeping from original sentence then add the mean latent vector of all sentences which contain running. The results are shown in Fig 15.

Vector arithmetic examples of ARAE in text. The right colum are the attributes to change and the the left top and down subrow are the generated text before and after vector arithmetic.

Although the text generated after the vector arithmetic is somewhat not plausible, it is the same for such task in the image data as shown in Fig 14. In my opinion, I found this result very interesting because it reminds me of the early development of GAN for image and I expect to see future improvement can generate more fluent sentences.

The high level idea of ARAE is illustrated in Fig. 16. Given an encoder, a generator, a decoder, and a critic parameterized by $\phi$, $\theta$, $\psi$, $w$, the goal is to learn a model which can map real discrete data $x \sim \mathbb{P}_{\star}$ to a latent code $z$ that is indistinguishable from a prior $\tilde{z}$ generated from a noise $s$. It is achieved by minimizing the reconstruction loss of the autoencoder regularized with a prior $P_{z}$, which is

\begin{align} \textrm{min}_{\phi, \psi}~\mathcal{L}_{rec} + \lambda W(\mathbb{P}_{Q}, \mathbb{P}_{z})~, \end{align} $W$ is Wasserstein distance between two distributions and is computed by critic function $f_w$ which is adversarially trained by encoder $\phi$ and generator $\theta$. $\lambda$ is a hyperparameter to control the strength of regularization. For simplicity, $w$ and $\theta$ are not shown above but are trained during optimization of critic and encoder.

The architecture of ARAE.

Reinforcement Learning

Before we start the discussion about this section, let’s play a cloze game. Consider filling the blanks in the sentence below:

A dog __ on the grass, chasing the __.

What are the possible vocabulary in your mind?

Scanning from left to right, the vocabulary in your minds could be run, sit or lie because we know a dog can take these actions while fly barely comes to our mind since dog cannot fly. For the next blank, frisbee, balls or insects might be possible since the information we have now is “dog on the grass” and “chasing”. Fish is impossible as our information we have seen tell us what in the blank should be on the ground.

In the cloze above, we “decide” which vocabulary to fill in the blank depends on the “previous words” we see. Such sequential procedure is very similar to reinforcement learning where the “agent” (generator) take “action” (word) based on the “previous state information” (previous word information or context). From now on, I will use generator to refer agent, word to refer action and context to refer the previous state in the following discussion.

In fact, this is the key concept of modeling text generation as a reinforcement learning problem. This concept is first studied in [25,26] which demonstrate that sequence generation problem can be formulated as a sequential decision-making process above. A key question here is:

How do we provide the reward to guide the generator?

This is the main question most methods developed in this line of research center around. A naive approach is to use task-specific score as our reward [26]. However, such task-specific score is sometimes hard to define. Many recent works are based on GAN-like, a generator compete with a discriminator, for training text generation model which I am going to introduce.

Yu et al. propose SeqGAN [27] that uses the prediction score (real/fake) from discriminator as reward to guide the generator. In SeqGAN, the goal of generator $G_\theta$ is to generate sequence $\tilde{s} \sim p_{G}$ that get predicted as real (i.e. higher reward) by discriminator. The discrimiator $D_\phi$ tries to distinguish the real $s \sim p_{data}$ and generated sequences. Formally,

\begin{align} G_{\theta}: \textrm{ argmax}_{\theta}R(\tilde{s}) = \textrm{argmax}_{\theta}\sum_{t=0}^{n-1}r_{t}\textrm{log}p_{G_{\theta}}(w_{t} | w_{0:t-1}) \end{align} \begin{align} D_{\phi}: \textrm{ argmin}_{\phi}-\mathbb{E}_{s \sim p_{data}}\textrm{log}(D_{\phi}) - \mathbb{E}_{\tilde{s} \sim p_{G}}(1 - \textrm{log}D_{\phi}) \end{align}

A question comes up here is how do we get the intermediate reward before the sentences are completed? It is as if it is hard to know whether the next move we take will help us win or lose in Go. To estimate this reward, SeqGAN applies Monte-Carlo search to roll-out current policy to estimate the reward as shown in Fig 17. The generator uses the current learned policy network to roll-out several times till the end of sentences to get the estimated reward.

Left: Training discriminator by feeding real and fake data. Right: Evaluation of reward at time step $t$.

A key difference in training strategy between SeqGAN and standard GAN is it requires pretrain the generator on the target corpora before adversarial training. Similar to GAN, the performance of the generator is highly susceptible to the training strategies of generator and discriminator. We might get worse performance than MLE if we fail to orchestrate the training of generator and discriminator well. Fig. 18 illustrates different training strategies (i.e., how frequently the generator and discriminator update their parameters) on the synthetic dataset, which shows the negative log likelihood of SeqGAN pretrain on the target corpora first then apply adversarial training for the remaining training process.

Performance of different different training strategies. The vertical line marks the begin of adversarial training. $k$ is the times of roll-out.

Several works based on modified reward are proposed after SeqGAN. Che et al. propose MaliGAN [28], which extends [29] to design a renormalized MLE objective. They prove that this new objective can provide better training signal even if the discriminator is less optimal (i.e. $D$ ranges from 0.5 to $p_{data}/(p_{data} + p_{G})$). In short, MaliGAN use $r = D(\tilde{s}) / 1 - D(\tilde{s})$ to calculate reward instead of binary score.

Lin et al. propose another modification of reward function by training a adversarial ranker instead of a discriminator [30]. Similar to the framework of SeqGAN, the goal of the generator is to produce sentences ranked higher than real ones by the ranker while the ranker tries to learn to rank the generated sentences lower than the real.

Given input sentences $s_i$ , reference sentence $s_u$ from a reference set $U$ which contains all real sentences, the ranking score $R_{\phi}(s|U, C)$ is calculated by:

\begin{align} \alpha(s_i|s_u) = \frac{s_i \cdot s_u}{||s_{i}||s_{u}||}\textrm{, }P(s|u, C) = \frac{\textrm{exp}(\gamma\alpha(s|u))}{\sum_{s^{\prime} \in C^{\prime}} \textrm{exp}(\gamma\alpha(s^{\prime}|u))} \end{align} \begin{align} R_{\phi}(s_i|U, C) = \mathbb{E}_{u \in U}P(s_i|u,C) \end{align}

The final score of input sentence $s_i$ is the expectation over the whole reference set, which is constructed by randomly sampling real setences during learning. This simple modification of loss improves SeqGAN in various tasks as shown in Fig 19. The performance is measured by BLEU score [31], which can be understood as a score to measure the overlap of n-gram between the generated sample and the ground truth. RankGAN has better performance on synthetic data and increased BLEU score on Chinese poem generation [32] and coco-caption [33] generation. Samples generated by RankGAN are also more favored by human than SeqGAN.

Left: The performance comparison among RankGAN and other baselines on synthetic data. The green vertical line marks the start of adversarial training. Up-right: The performance comparison on Chinese poem generation. Bottom-right: The perforance comparison on coco-caption generation

Despite samples generated by RankGAN have increased BLEU score and much favored by human, there are still two challenges: First, using a scalar as the score might not be informative enough to guide the generator as it cannot well represent the intermediate structure of text during the generation process. Second, estimating the intermediate reward tends to be noisy and sparse especially in long text generation where generator get reward only when entire sentences are finished.

To mitigate these issues, Guo et al. propose LeakGAN that combines feature matching in and hierarchical reinforcement learning [34] to mitigate these challenges [35]. A bird eye’s view of LeakGAN is illustrated in Fig. 20:

Overview of LeakGAN. The discriminator and the generator are in the up and bottom dotted box. The generator contains a Manager and a Worker module. The Manager forward information computed from the discriminator to help the worker to take action.

Given generator $G_{\theta}$ (bottom dotted-line) and $D_{\phi}$ (upper dotted-line), the generator is broken into a Manager ($\mathcal{M}$) and a Worker module ($\mathcal{W}$). To generate the next word $x_{t+1}$, the current sentence $S_{t}$ is first fed into feature extractor $\mathcal{F} $ inside $D_{\phi}$. The extracted feature then leak to manager in the generator to compute a goal vector projected to action space $w_{t}$ by a linear projection $\psi$. The worker computes the probability distribution of next action to take $O_{t}$ given the previous word $x_{t}$ then adjust its distribution by element-wise dot with $w_{t}$ to get final distribution.

In plain language, the worker is like a miscreant who forges the money, and the discriminator is like police to detect it. The manager act as a spy to leak the information police used to identify fake money during the period $t$ for miscreant whenever the miscreant attempt to forge the money in the next period.

We can see an improvement on the synthetic dataset and increased BLEU score on text generation with text length ranges from short to long as shown in Fig. 21.

Left: Performance of LeakGAN compared with other baselines on the synthetic dataset. Right: From top to bottom: BLEU score of LeakGAN compared with different baselines on Chinese Poem (short), COCO-caption (medium), EMNLP2017 WMT (long) dataset.

One interesting result in LeakGAN is that we can explore whether generator exploits the leaked information from the discriminator to generate the data. This can be visualized by projecting the extracted feature of real data and the text generated during period $t = 0$ to $t = T$, where $T$ denotes the timestep when the sentence is completed. We call this process feature trace as termed by authors. Fig. 22 shows the result. We can see the features extracted from generator become more and more close to the real data as the generation process complete while other baselines fail to match it.

Comparison of feature trace of LeakGAN to other baselines. Text generated by LeakGAN is close to real data when the generation process is completed.

All methods above attempt to leverage more information from discriminator effectively to generate better quality text. However, another critical issue, mode collapse, is ignored in the previous literature. In a recent survey [36], Lu et al. observe that all methods above suffer from serious mode-collapse by measuring self-BLEU score, which is calculated by picking one generated sample then compute average BLEU score over generated samples collection.

Fig. 23 shows the BLEU score of above methods and self-BLEU on EMNLP2017 WMT. We can see that all methods discussed above suffer from serious mode collapse except MaskGAN [37]. However, MaskGAN achieves lowest BLEU score on the same dataset. This raise an important question:


It seems that higher automatic score (i.e., BLEU in our discussion) is not a good measure. This will be discussed further in the evaluation part. Before diving to that, I would like to introduce MaskGAN to close this section.

Left: BLEU score of all baselines on EMNLP2017 WMT. Right: Self-BLEU score of all baselines on EMNLP2017 WMT.

The concept of MaskGAN is learning to generate text by filling the blank. Instead of generating text unconditionally like previous methods, MaskGAN uses a Seq2Seq architecture for encoding masked text and generate text to fill in the blank. The discriminator also uses the same architecture. Fig. 24 shows the high-level picture of MaskGAN.

MaskGAN overview: The encoder encodes the masked text input, and the generator tries to learn to fill in the blank by reward from the discriminator. The correct answer of the example above is a,b,c,d,e.

The goal of the generator is to predict the token given the masked sentence and what it has filled in. Different from the prior works, the discriminator here takes the generated text and masked sentence as input to output a scalar to determine fake or real in every time step. An advantage of this design is the generator only get more punished on the error token which causes the entire sentence from real to fake. The tokens generated before this token will not be punished. The reason of why include masked input in discriminator is to help discriminator to deal with the following case as illustrated in the paper: If the generator output the director director guided the series, it is ambiguous of which director is fake because the director expertly guided the series and the associate director guided the series are valid sentences. The masked input can tell the discriminator which context of the real sentence.

Formally, given an input sequence $\boldsymbol{x} = (x_1, \cdots, x_T)$ and a binary mask of same length $\boldsymbol{m} = (m_1, \cdots, m_T)$. The generator $G_{\theta}$ and the discriminator $D_{\phi}$ are \begin{align} G_{\theta}(x_{t}) = P(\hat{x_{t}} | \hat{x}_1, \cdots \hat{x}_{t-1}, \boldsymbol{m(x)}) \end{align} \begin{align} D_{\phi}(\tilde{x}_{t}|\tilde{x}_{0:T}) = P(\tilde{x}_{t} = x^{real}_{t} | \tilde{x}_{0:T}, \boldsymbol{m}(\boldsymbol{x}))\textrm{ ,} \end{align} The reward is defined as the logatithm of the prediction of discriminator, which is \begin{align} r_{t} = \textrm{log}D_{\phi}(\tilde{x}_{t}|\tilde{x}_{0:T})\textrm{ ,} \end{align} The training strategy is similar to previous methods. They first pretrain MaskGAN by MLE then apply REINFORCE with baseline to train the generator. The training of discriminator is the same used in conventional GAN.

Fig. 25 shows the human study result of MaskGAN on IMDB review and PTB dataset. For each row, they ask workers which model do they prefer or neither do they prefer regarding Grammaticality, Topicality, and Overall. They sample 100 sentences for each model and ask 3 workers to give their preference. We can see human always prefers MaskGAN in all metric.

Left: Human preference score of paired comparison on IMDB review datasset (left) and PTB dataset (right).


In addition to BLEU, many metrics have been proposed to match the human judgment better [38,39,40]. However, recent studies [41,42] point out that existing metrics are poorly correlated with human judgement and biased. Chaganty et al. points out the cost to debias these automatic metrics is about conducting the full human evaluation [42].

So, back to our question: “How do we properly evaluate the quality of text generation model?” To my best knowledge, we don’t know.

The evaluation of natural language generation in different settings (e.g., dialog response generation, image to caption, summarization…etc) remains an open research problem. Each has their challenges to be overcome. For text generation model discussed in this post, a specific aspect is whether the model suffers from mode-collapse (lack diversity) in addition to the text quality. A naive approach to measure the diversity is to count unique n-gram. However, improving such n-gram metric does not necessarily enhance the diversity as discussed in MaskGAN. The diversity of text generation is a less addressed issue in the recent literature although a few very recent work start to explore it [43]. Moreover, there is no existing metrics in text generation analogous to Inception-Score [11] or Fréchet Inception Distance [44] in conventional GAN that can reflect the diversity and quality of generated images, which might suggest a direction to explore.

It might sound disappointed that we have barely any idea about what metrics to use to evaluate the text generation model accurately. Current approaches usually resort to the combination of human judgment and automatic metrics. The former even plays a more important role. However, it would be easier to talk about how NOT to evaluate text generation models:

Do not claim any model is superior to others by automatic metric only. There are many aspects of language cannot be reflected by them.

Final Word

Phew! We made it to the end. Although it might take some time to digest, I hope this overview can give you a bird-eye view of recent developments of text generation models.

In summary, generating natural language is a holy grail of text generation research. Current approaches have not yet fully captured the nuances, details, and semantics of natural language, which compounds when we generate longer text. A better metric and carefully designed human study are also necessary for evaluating the progress of text generation models. This post only summarized three lines of text generation research. There are other interesting methods not belong to these categories defined in this post. If you have not get overwhelmed, these two papers I found exciting. You can check them here and here

Finally, since I am not a senior researcher in this field, I might make mistakes so any comments or suggestions are welcome :)


[1]: Tero Karras, Timo Aila, Samuli Laine and Jaakko Lehtinen. Progressive Growing of GANS for Improved Quality, Stability, and Variation. In ICLR, 2018 ⬆️
[2]: Xun Huang, Ming-Yu Liu, Serge Belongie and Jan Kautz. Multimodal Unsupervised Image-to-Image Translation. In ECCV, 2018 ⬆️
[3]: Helena Sarin. prettyInGAN. In 1st Computer Vision for Fashion, Art and Design, ECCV, 2018. ⬆️
[4]: Matt J. Kusner and José Miguel Hernández-Lobato. GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution. In arXiv:1611.04051, 2016. ⬆️
[5]: Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin and Aaron Courville. Improved Training of Wasserstein GANs. In NIPS, 2017 ⬆️
[6a]: Martin Arjovsky, Soumith Chintala and Léon Bottou. Wasserstein GAN. In arXiv:1701.07875v3, 2017 ⬆️
[6b]: Martin Arjovsky, Leon Bottou. Towards Principled Methods for Training Generative Adversarial Networks. In ICLR, 2017 ⬆️
[7]: Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn and Tony Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. In arXiv:1312.3005, 2013 ⬆️
[8]: R Devon Hjelm, Athul Paul Jacob, Adam Trischler, Gerry Che, Kyunghyun Cho and Yoshua Bengio. Boundary Seeking GANs. In ICLR, 2018 ⬆️
[9]: Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj and Yu Cheng. Sobolev GAN. In ICLR, 2018 ⬆️
[10]: Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin. Adversarial Feature Matching for Text Generation. In ICML, 2017 ⬆️
[11]: Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen. Improved Techniques for Training GANs. In arXiv:1606.03498, 2016 ⬆️
[12]: Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Schölkopf,Bernhard, and Smola, Alexander. A Kernel Two-Sample Test. In JMLR, 2012. ⬆️
[13]: Diederik P. Kingma, Max Welling. Auto-Encoding Variational Bayes. In ICLR, 2014. ⬆️
[14]: Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio. Generating Sentences from a Continuous Space. In CONLL, 2016. ⬆️
[15]: Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, Taylor Berg-Kirkpatrick. Improved Variational Autoencoders for Text Modeling using Dilated Convolutions. In ICML, 2017. ⬆️
[16]: Fisher Yu, Vladlen Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. In ICLR, 2016. ⬆️
[17]: Stanislau Semeniuta, Aliaksei Severyn, Erhardt Barth. A Hybrname Convolutional Variational Autoencoder for Text Generation. In EMNLP, 2017. ⬆️
[18]: Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. In arXiv:1610.10099, 2016. ⬆️
[19]: Yoon Kim, Sam Wiseman, Andrew C. Miller, Davname Sontag, Alexander M. Rush. Semi-Amortized Variational Autoencoders. In ICML 2018. ⬆️
[20]: Adji B. Dieng, Yoon Kim, Alexander M. Rush, David M. Blei. Avoiding Latent Variable Collapse with Generative Skip Models. In ICML workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018. ⬆️
[21]: Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, Percy Liang. Generating Sentences by Editing Prototypes. In ACL, 2018. ⬆️
[22]: Jiacheng Xu and Greg Durrett. Spherical Latent Spaces for Stable Variational Autoencoders. In EMNLP, 2018. ⬆️
[23]: Alec Radford, Luke Metz, Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In ICLR, 2016. ⬆️
[24]: Jake (Junbo) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun. Adversarially Regularized Autoencoders. In ICML, 2018 ⬆️
[25]: Philip Bachman, Doina Precup. Data Generation as Sequential Decision Making. In NIPS, 2015. ⬆️
[26]: Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, Yoshua Bengio. An Actor-Critic Algorithm for Sequence Prediction. In ICLR, 2017. ⬆️
[27]: Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In AAAI, 2017. ⬆️
[28]: Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, Yoshua Bengio. Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. In arXiv:1702.07983v1, 2017. ⬆️
[29]: Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans. Reward Augmented Maximum Likelihood for Neural Structured Prediction. In NIPS, 2016. ⬆️
[30]: Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, Ming-Ting Sun. Adversarial Ranking for Language Generation. In NIPS, 2017. ⬆️
[31]: Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. In ACL, 2002. ⬆️
[32]: Xingxing Zhang and Mirella Lapata. Chinese Poetry Generation with Recurrent Neural Networks. In EMNLP, 2014. ⬆️
[33]: Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server. In arXiv:1504.00325v2, 2015. ⬆️
[34]: Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, Davname Silver, Koray Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. In arXiv:1703.01161v2, 2017. ⬆️
[35]: Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang. Long Text Generation via Adversarial Training with Leaked Information. In AAAI, 2018. ⬆️
[36]: Sidi Lu, Yaoming Zhu, Weinan Zhang, Jun Wang, Yong Yu. Neural Text Generation: Past, Present and Beyond. In arXiv:1803.07133v1, 2018. ⬆️
[37]: William Fedus, Ian Goodfellow, Andrew M. Dai. MaskGAN: Better Text Generation via Filling in the______. In ICLR, 2018. ⬆️
[38]: Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation, 2005. ⬆️
[39]: Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. CIDEr: Consensus-based Image Description Evaluation. In CVPR, 2015 ⬆️
[40]: Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould. SPICE: Semantic Propositional Image Caption Evaluation. In ECCV 2016. ⬆️
[41]: Jekaterina Novikova, Ondrej Du ˇ sek, Amanda Cercas Curry,Verena Rieser. Why We Need New Evaluation Metrics for NLG. In EMNLP, 2017. ⬆️
[42]: Arun Tejasvi Chaganty, Stephen Mussman, Percy Liang. The price of debiasing automatic metrics in natural language evaluation. In ACL, 2018. ⬆️
[43]: Jingjing Xu, Xuancheng Ren, Junyang Lin, Xu Sun. DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text. In EMNLP, 2018. ⬆️
[44]: Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NIPS, 2017. ⬆️

comments powered by Disqus