_
Adversarially
Regul_rized
Autoencoders
for
Generating
Discrete
Structures
Junbo
(J_ke)
Zh_o
Department
of
Computer
Science
New
York
University
_
J
akezhao_cs
t
nyu
t
edu
_
Nh4_
Kellyzh_ng
_
Yoon
Kim
School
f
Engineering
and
Applied
Sciences
Harvard
University
yoonkim_se_s
_
h_rvard
_
edu
Alexander
M.
Rush
School
of
Engineering
and
Applied
Sciences
Harvard
University
stush_se_s
i
harv_Td
i
edu
Y_nn
Lecun
Depalrtment
f
Computer
Science
fi
Newyorkuniversitv
_
yann_cs
_
nyu
_
edu
_
_
_
Abstract
_
_
Gener_tive
_dvers_ri_l
networks
are
an
effective
approach
for
le_rning
rich
i_tent
m
representations
of
continuous
data,
but
have
proven
difncult
tO
apply
directly
tO
N
y
"_
problem,
but
_t
1S
dif_cult
tO
learn
an
appropriate
general-pumose
encoder.
In
this
V.
work,
We
consider
a
simple
approach
for
h_ndling
these
two
challenges
Jointly,
_
employing
a
discrete
structure
autoencoder
with
a
code
sp_ce
regularized
bY
O
generative
advers_ri_l
tr_ining.
The
model
learns
a
smooth
reg4I_rized
code
_
space
while
still
beln_
_ble
to
model
the
underlying
d_t_,
and
Can
be
used
aS
a
____
discreteGANwiththeabilitytogeneratecoherentdiscreteoutputsfromcontinuous
_
_J
In
the
model's
i_tent
sp_ce,
and
evaluate
the
model
itself
On
the
tasks
of
discrete
X
image
generation,
textgeneration,
_nd
semi-supervised
learning.
_
_
l
Introduction
Recent
work
On
gener_tive
advers_rial
networks
(GAYs)
_9]
and
other
deeP
latent
v_riable
models
haS
shown
signincant
PrO_reSS
ln
le_rling
smooth
i_tent
v_riable
representations
of
complex,
highdimensional
continuous
data
such
aS
images
r1,
2,
25,
37].
These
i_tent
representations
facilitate
the
ability
to
apply
smooth
tr_nsformations
and
intemolations
ln
i_tent
sp_ce
ln
order
to
produce
complex
modijn_c_tions
of
generated
outputs,
while
Still
remaining
On
the
data
manifold.
Unfortunately,
learning
similar
latent
representations
of
discrete
stJLuctures,
such
aS
text
sequences
Or
discretized
images,
remains
a
challenging
problem.
Applyln_
GANs
directly
to
this
task
produces
discrete
output
from
the
generator,
which
then
requires
clever
approaches
for
backpropagation.
Fujcthermore
this
issue
1S
compounded
in
C_SeS
where
the
generative
model
1S
recujcrent,
e._'
in
sequence
modeling.
Resealrchers
h_ve
circumvented
SOme
f
these
issues
bY
using
policy
gradient
Submitted
tO
3
l
st
Conference
On
Neur_l
Information
Processing
Systems
(NIPS
2017),
Lon_
Beach,
CA,
VSA.
methods
_5,
i2,
36]
Or
With
the
Gumbel-Softmax
distribution
_17].
However,
neither
approach
C_n
yet
produce
robust
latent
representations
directly
from
samples,
p_iculalrly
for
discrete
structures.
An
altemative
approach
1S
tO
instead
encode
discrete
structures
in
_
continuous
code
space
to
circumvent
this
problem
altogether.
As
this
sp_ce
iS
continuous,
traditional
GAN
training
Can
be
directly
applied
to
learn
a
latent
representation
f
the
code
space
itself.
Samples
from
the
GAN
Can
then
be
decoded
tO
generate
discrete
outputs.
While
ln
theory
this
technique
Can
be
applied
directly_
in
practice,
learning
general-purpose
autoencoders
1S
in
itself
a
dif_cult
problem.
In
this
work,
We
PrOPOSe
a
simple
extension
of
this
technique
bY
Jointly
training
a
code-space
GAN
and
a
discrete
structure
autoencoder,
which
We
call
an
adversajally
regularized
autoencoder
(ARAE).
The
approach
allows
US
tO
USe
a
general-pu_ose
GAN
architecture
th_t
gener_tes
continuous
code
representations,
while
_t
the
Same
time
deploying
t_sk-speci_c
autoencoder
alrchitectures,
like
_
recurrent
neural
network
for
text,
tO
produce
and
decode
from
these
latent
representations.
The
ARAE
appro_ch
C_n
be
used
both
_S
_
generative
model
and
_S
a
w_y
tO
obtain
an
encoding
of
the
input.
First,
it
leaJL1n_s
a
GAN
With
_
Gaussian
i_tent
space
that
Can
be
Sampled
tO
produce
discrete
structures.
This
model
Can
be
compaired
directly
With
existing
generative
models.
Second,
it
learns
an
adversarially
regul_rized
encoder
that
Can
produce
useful
code
space
representations
from
discrete
structures,
without
requiring
an
eX_licit
code-space
P"Or'
We
Can
compare
this
method
to
other
specialized
autoencoders
such
aS
denoising
and
valriational
autoencoders
_
OUr
experiments
test
ARAE
On
two
different
discrete
domains:
discretized
images
and
text
sequences.
We
show
that
this
approach
successfully
leajcjns
latent
representations
for
both
tasks,
aS
the
model
1S
able
to
gener_te
coherent
samples,
ignore
Or
_X
corrupted
inputs,
and
produce
predictable
ch_nges
in
the
outputs
when
performing
manipulations
in
the
latent
sp_ce.
We
find
that
We
are
able
perform
consistent
sentence
manipulations
bY
moving
around
in
the
latent
space
via
offset
vectors.
A
similar
property
WaS
observed
in
image
GANs
[25J
and
word
represent_tions
_23J
_
Finglly,
experiments
On
a
semi-supervised
lea]rning
task
for
natural
language
inference
provide
quantitative
evidence
that
this
_ppro_ch
improves
upon
continuous
representations
le_rned
bY
autoencoders.
Code
iS
available
at
https
i
//github
i
com/J
akezhaoJ
b/ARAE.
2
Rel_tedwork
GANs
for
Discrete
Structures
The
SUCCeSS
of
GANs
On
images
have
led
m_ny
researchers
tO
consider
_pplying
GANs
tO
discrete
d_t_
such
aS
text.
_ol1Cy
gr_dient
methods
_re
a
fi_tur_l
way
tO
deal
With
the
resulting
non-differentiable
generator
objective
when
training
directly
ln
discrete
space
_8,
34].
When
trained
On
text
data
however,
such
methods
often
require
pre-training/co-training
with
_
maximum
likelihood
(i.e.
i_nguage
modeling)
objective
[5,
36,
18].
This
precludes
there
beln_
_
latent
encoding
of
the
sentence,
and
1S
_lso
_
potential
disadvantage
of
existing
language
models
(which
Can
otherwise
generate
locally-coherent
samples).
Another
direction
of
work
h_S
been
through
reparameterizing
the
c_tegorical
distribution
with
the
Gumbel-Softmax
trick
[14,
1_]-while
initial
experiments
Were
encouraging
On
a
synthetic
task
_17],
scaling
them
tO
work
On
natural
language
iS
a
challenging
open
problem.
There
haS
also
been
a
u_Y
f
recent,
related
approaches
that
work
directly
With
the
soft
outputs
from
a
generator
[1O,
28,
30,
24].
For
example,
Shen
et
al.
[3O]
train
With
adversajal
loss
for
unaligned
style
tr_nsfer
between
text
bY
havln_
the
discriminator
act
On
the
RNN
hidden
states
and
using
the
soft
outputs
at
each
ste_
aS
input
to
an
RNN
generator.
Vari_tion_l
Autoencoders
Ideally,
autoencoders
would
learn
useful
coded
feature
representations
f
their
inputs
i
However
ln
pr_ctice
S1mple
autoencoders
often
le_rn
_
degener_te
identity
m_pping
where
the
latent
code
space
iS
free
of
y
structure.
One
way-among
others-to
regularize
the
code
space
IS
through
havln_
an
explicit
Pr'Or
On
the
code
space
and
using
a
variational
approximation
tO
the
postejior,
leading
tO
a
famlly
f
models
called
vajational
_utoencoders
(VAE)
r16,
27].
Unfortunately
VAEs
for
text
Can
be
challenging
tO
train-for
example,
if
the
training
procedure
iS
not
carefully
tuned
With
techniques
like
word
dropout
and
KL
_nne_ling
_4]
t
the
decoder
simply
becomes
a
language
model
and
ignores
the
latent
code
(_lthough
there
haS
been
SOme
recent
SUCCeSSeS
With
convolutional
models
[29,
35J).
A
possible
reaSOn
for
the
dif_culty
In
tr_ining
VAEs
1S
due
tO
the
strictness
of
the
Pr'O'
(usu_lly
a
S_herical
Gaussian)
and/or
the
parameterization
f
the
posterior.
There
h_S
been
SOme
work
On
making
2
the
prior/poste_or
more
Aexible
througn
eX_licit
parameterization
r26,
i5,
6J,
One
notable
technique
1S
adversajal
autoencoders
(AAE)
_2lJ
which
attempt
to
imbue
the
model
With
a
more
Aexible
Pr'Or
implicitly
through
adversarial
training.
In
AAEs,
the
discriminator
1S
trained
tO
distinguish
between
p
from
_
_xed
Pr'Or
distribution
and
the
input
encoding,
thereby
pushing
the
code
distribution
tO
match
the
prior.
Our
approach
haS
similar
motivation,
but
notably
We
dO
not
sample
from
a
_xed
Pr'Or
distribution-our
%r'Or'
iS
inste_d
p_r_meterized
througn
a
generator.
Yonetheless,
this
view
(which
haS
been
obsejved
bY
vajous
rese_rchers
[32,
22,
2O]
)
provides
an
interesting
connection
between
VAEs
and
GANs.
3
B_ckground
3.1
Generative
Adversarial
Yetworks
Generative
Adversajal
Networks
(GANs)
alre
a
class
f
parameterized
implicit
generative
models
[_]
i
The
method
approximates
drawing
samples
from
a
true
distribution
C
kl_
Ir,
bY
instead
employing
a
latent
vajable
Z
and
a
parameterized
deterministic
generator
function
C
__
9_
(2)
tO
produce
generated
samples
C
N
_
The
aim
1S
tO
characterize
the
Complex
data
manifold
descjbed
bY
the
unknown
Ir,
wi_hin
the
latent
space
_.
GAN
training
utilizes
two
separate
models:
a
generntor
g(_)
maps
a
latent
vector
from
SOme
easy-tos_mple
SOUrce
distribution
tO
a
value
and
a
critic/discriminator
J(c)
aims
to
distinguish
r_al
data
and
_nhe
samples,
generated
bY
9'
The
generator
1S
trained
tO
fool
the
critic,
and
the
critic
tO
separate
out
real
from
generated.
In
this
work,
We
utilize
the
recently-proposed
Wasserstein
GAN
(wGAN)
r1].
WGAN
replace$
the
_ensen-Shannon
divergence
in
standard
GANs
With
_nrth-Mover
(Wasserstein-1)
distance.
WGAN
training
USeS
the
following
min-max
optimization
OVer
gener_tor
parameters
0
and
Critic
parameters
W,
min
max
I_c
_Jw
(c)]
_
9
[Jw
(c)]
t
(
l
)
where
Jw
__
C
_
I_
denotes
the
critic
function,
C
iS
obt_ined
from
the
gener_tor,
C
__
9g(_),
and
Ir,
and
9
are
r_ccl
(true
)
and_nhe
(generated
)
distribution
respectively.
Notably
the
critic
parameters
W
are
restricted
to
an
1-Lipschitz
function
set
_,
which
Can
be
shown
tO
make
this
term
coJ1respond
tO
W_sserstein-1
Distance.
We
follow
r1J
and
4Se_
a
naive
implementation
tO
approximately
enforce
this
property
bY
weight-clipping,
i.e.
W
__
_t
_]ra__
Throughout
this
work
We
only
USe
9_
and
Jw
aS
fully-connected
networks,
namely
MLPs.
3.2
Discrete
Structure
Autoencoders
An
autoencoder
1S
y
model
tr_ined
tO
map
an
input
to
a
code
sp_ce
and
then
back
to
the
ojginal
form.
Ideally
the
code
represents
important
abstr_cted
fe_tures
of
the
original
input
(although
this
1S
dif_cult
to
fojcjmalize)
instead
f
learning
to
simply
copy
y
_'Ven
input.
We
are
interested
in
probabilistic
autoencoders
for
discrete
structures.
De_ne
X
__
__
to
be
a
discrete
set
of
structures
where
V
1S
a
vocabulary
of
symbols.
For
instance,
for
binarized
images
U
__
(o,
1}
and
_
1S
the
number
f
pixels
t
orforsentences_=
(1,...,#words)andnisthe
sentence
length.
A
discrete
structure
autoencoder
consists
of
two
palrameterized
functions:
a
deterministic
encoder
function
enC_
__
X
_
C
with
parameters
_
and
_
decoder
distribution
P_(x
i
c)
with
parameters
V
that
gives
a
distribution
OVer
structures
X.
The
model
iS
trained
On
cross-entropy
reconstruction
loss
where
We
lealrn
palrameters
to
minimize
the
negative
log-likelihood
f
reconstruction
i
_
__
-logpw(x
i
enc_(x))
(2)
Computing
this
for
arbitrary
sets
1S
intractable,
SO
the
choice
of_V
iS
important
and
problem
speci_c.
FingIIy
_t
IS
often
useful
tO
USe
the
decoder
tO
produce
a
_O1nt
estim_te
from
X.
We
call
this
X
__
_rgmaxxpw(x
i
enC_
(x))
When
X
__
X
the
autoencoder
1S
said
tO
copy
the
input,
Or
perfectly
reconstruct
X.
u
Model
An
adversarially
regularized
autoencoder
(ARAE)
combines
a
discrete
autoencoder
with
a
code-space
GAN.
OUr
model
employs
a
discrete
autoencoder
tO
learn
continuous
codes
based
On
discrete
inputs
3
an
autoencoder
(ARAE),
where
a
structure
X
iS
encoded
and
decoded
tO
produce
X,
and
4S
u
GAN
(ARAE-GAN),
where
a
sample
2
IS
used
tO
generate
a
code
vector
which
iS
similarly
decoded
tO
_
and
a
WGAN
tO
le_Jcjn
_n
implicit
probabilistic
model
OVer
these
codes.
The
aim
1S
tO
exploit
the
GAN_
ability
tO
learn
the
latent
structure
of
code
data,
while
using
an
autoencoder
tO
abstract
away
the
encoding
and
generation
f
discrete
structure
tO
support
GAN
training.
The
main
difference
with
WGANs
_S
described
above,
1S
that
We
nO
longer
have
_CCeSS
tO
observed
data
S_mples
for
the
GAN.
Instead
We
have
aCCeSS
tO
discrete
structure
X
rV_
I_,
where
I_,
iS
the
distribution
f
interest.
(Work_n_
With
this
space
directly
would
require
backpropag_ting
through
non-differentiable
operations
and
1S
the
b_sis
for
policy
gradient
methods
for
GAN
training.)
We
handle
this
issue
bY
integr_ting
_n
encoder
into
the
procedure
which
_rst
maps
X
tO
a
continuous
code
C
__
enc_(x),
i.e.
using
the
code
vector
for
each
observed
structure
de_ned
bY
enC_
i
The
full
model
h_S
a
three
_a_
objective. We
minimize
reconstruction
error
ln
the
AE
while
employing
adversarial
training
On
its
code
space.
min
_AE(_'
V)
(3)
mln
maX
_wGAN-cti
(w,
_)
__
mln
maX
I_x_i_,
[Jw(enc_(x))]
_
I___I_9
[Jw(c)]
(4)
g
_wGAN-Gen
(_)
__
g
g
[Jw
(c)]
(S)
where
Ir,
1S
the
real
distribution
ln
the
input
space.
We
minimize
the
three
objectives
i
Jointly
In
this
work.
04r
model
1S
visually
depicted
in
Figure
1.
The
algorithm
used
for
training
iS
shown
ln
Algorithm
1.
We
USe
block
coordinate
descent
tO
optimlze
the
AE,
critic
and
generator
in
turn.
Notably
with
this
change
We
nOW
receive
gradients
through
the
encoder
from
the
adversarial
loss.
This
gradient
will
_llow
the
encoder
tO
hel_
the
gener_tor
tO
produce
Sample$
ln
the
support
of
the
true
data
learned
bY
the
WGAN
critic.
Theoretically,
the
effect
of
such
term
should
decre_se
(and
eventu_lly
diminish)
aS
the
GAN
converges
to
a
Nash-Equalibrium.
5
Architectures
We
consider
two
different
instantiations
of
ARAEs,
One
for
discrete
images
and
the
other
for
text
sequences.
For
both
models
We
USe
the
Same
WGAN
architecture
but
substitute
in
different
autoencoder
architectures.
The
gener_tor
architecture
USeS
a
low
dimensional
2
With
_
Gaussian
P"O'
p(_)
__
N(o,
I),
and
maps
it
tO
C.
Both
the
critic
Jw
and
the
generator
9e
are
parameterized
aS
feed-forward
MLPs.
The
structure
of
the
deterministic
encoder
enC_
and
probabilistic
decoder
PV
IS
specialized
for
the
domain.
Im_ge
Model
OuT
_rst
model
USeS
a
fully-connected
neural
networh
tO
encode
binarized
images.
Here
X
__
(o,
1)_
where
_
1S
the
image
size.
The
encoder
used
1S
_
feed-forwaird
MLP
network
mapping
from
{o,
1}_
_
I__,
enc_(x)
__
MLP(x;
_)
__
C.
The
decoder
predicts
each
pixel
In
X
4S
a
parameterized
logistic
regression,
Pw(x
i
c)
__
rT,:__,
_(h)_,'
(1
_
cr_(h))
where
h
__
MLP(c;
V).
u
_
Algorithm
l
ARAE
Training
Procedure
_
for
number
f
training
iterations
dO
r_ain
fhe
a_toenco_er
Sample
(x(;)
},_,,
rV_
Ir,
a
batch
from
the
training
data
Compute
the
i_tent
represent_tions
c(z)
__
enC_
(x(,.,
)
Compute
the
autoencoder
loss,
_AE
__
--'
__.-_,
lo_PV
(x(ż)
_c(_))1
backpropagate
gradients,
U_date
the
decoder
(V)
and
the
encoder
(_)
I_ain
t_e
critic
ror
h
steps
dO
Positive
S_mple
pha_e
Compute
the
_dvers_ri_l
loss
On
the
_eal
Samples
_1
__.-_,
I_x_I_2
cjw(,Ci>)J
_nd
t
backpropagate
gradients,
4_date
the
critic
(w)
and
the
encoder
t_)
Negative
__mple
_hase
S_mple
_
batch
ofrandom
noise
)z___
_
N(o,
r)
Generate
code
representation
c(i)
__
9g(2(_))
bY
PaSS'n_
2(_)
through
the
generator
Compute
the
adversalrial
loss
_
_l
__.-_,
I_c_i_g
[Jw
(c(;)
)J
_
backpropagate
gr_dients,
Upd_te
the
critic
(w>
t
clI_
the
Weights
f
the
cjitic
W
tO
_t
_Jd
Trajn
tjh_e
generntor
Sample
a
batch
of
random
noise
(_(;)
},__,
rV_
N(o,
I)
Generate
code
representation
c(i)
__
9g(_(_))
bY
PaSS'n_
2(_)
through
the
generator
Compute
the
gener_tor
loss
_1
__.-_,
I_c_i_g
(Jw
(c(i)
)),
backpropagate
gr_dients
througn
the
critic
into
the
_enerator
tO
Update
the
_enerator
(_)
_
Text
Model
Our
second
model
1S
devised
for
text.
Here
X
__
U_
where
rt_
iS
the
sentence
length
_nd
U
1S
the
vocabulary
of
the
underlying
language
(typically
_
U
_
_
[1ok,
<barcodetype="unknown"/>.
Following
usual
practice
We
USe
a
recurrent
neur_l
network
(RNN)
_S
both
the
text
encoder
and
decoder.
De_ne
an
RNN
aS
a
parameterized
recurrent
function
_
__
RNN(_,.,
hJ'-1i
_)
forJ
__
l
_ _
_
_
(with
hO
__
O)
that
maps
a
discrete
input
stJLucture
X
tO
hidden
vectors
h'
_
_ _
h,.
For
the
encoder,
We
de_ne
enc_(x)
__
h,
__
C'
the
last
hidden
state
in
this
reCUrrence.
The
decoder
1S
denned
in
a
similar
way,
With
p_rameters
V.
For
prediction
We
combine
C
With
_
to
produce
_
distribution
OVer
U
at
each
time
step,
P_(x
i
c)
__
rT,:__,
softmax(W[hJ';
c]
+
b)2i
where
W
and
b
are
parameters
(part
f
V).
Finding
the
most
likely
sequence
X
under
this
distribution
iS
intr_ctable,
but
We
Can
approximate
it
using
greedy
search
Or
beam
search.
In
04r
experiments
We
USe
an
LSTM
architecture
fl3]
for
both
the
encoder/decoder,
and
train
With
teacher-forcing.
Semi-Supervised
Model
OUr
model
1S
trained
ln
an
unsupervised
manner
aS
a
combination
f
an
_utoencoder
and
a
GAN.
As
_n
extension
We
also
consider
the
USe
Case
where
the
code
vector
1S
additionally
used
aS
input
to
a
supervised
classi_cation
t_sk.
As
in
the
standard
semi-supervised
setup,
We
aSSUme
that
OUr
data
consist
of
a
small
set
of
labeled
data
_
t
yi
};
and
lar_e
set
of
unlabeled
data
{xi)J-'
We
Can
set
UP
a
standard
supervised
classi_cation
loss
function
using
the
code
vectors
from
the
encoder
and
a
neW
set
f
palrameters
___
_NLL(1,_)
__
_l(_())
(6)
t
We
then
extend
OUr
multi-task
loss
function
tO
include
this
objective.
6
Methods
and
Data
We
consider
two
different
settings
for
testing
the
ARAE:
(1)
images,
utilizing
the
binalrized
version
of
MNIST,
and
(2)
text,
using
the
Stanford
Natural
Language
Inference
comus
c3].
This
comus
provides
a
useful
testbed
aS
it
comprises
f
sentences
With
rel_tively
simple
structure.
The
comus
1S
addition_lly
annotated
for
p_irwise
sentence
classi_cation,
which
allows
US
tO
experiment
with
semi-supervised
learning
ln
a
controlled
setting.
For
this
t_sk
the
model
1S
presented
With
two
sentences-premise
and
hypothesis-and
haS
to
predict
their
relationship:
entailment,
contradiction,
Or
neutral.
For
training,
We
Used
a
subset
of
the
corpus
consisting
of
sentences
of
less
than
1S
words,
although
preliminary
results
suggest
this
approach
works
u_
to
30
words.
S
Figure
2__
Left:
Produced
bY
an
AE.
Middle:
Produced
bY
an
ARAE.
The
arrangement
of
the
left
and
middle
ngures
aire:
(i)-top
blocks
are
the
input
tO
the
AE,
clean
_nd
noised;
(ii)-bottom
blocks
alre
the
corresponding
reconstruction.
Right;
Results
of
the
ARAE.
The
tOP
block
consists
f
the
reconstruction
f
the
r_nl
MNIST
samples;
the
middle
blocks
are
the
out_4t
of
the
decoder
taking_nhe
hidden
codes
generated
bY
the
GAN;
the
bottom
blocks
are
the
sample
interpolation
results,
constructed
bY
linearly
interpolating
In
the
latent
space
and
then
decoding
back
tO
the
pixels.
We
consider
several
di_erent
empirical
t_sks
tO
test
the
performance
of
the
model
_S
both
an
_utoencoder
(ARAE),
bY
USe
the
encoder
aspect,
and
aS
i_tent-valriable
model
(A_E-GAN),
bY
sampling
zts
(the
two
are
trained
identically).
Experiments
include:
(1)
code
"_rcce
str_ct_r_;
does
the
model
preserve
natural
inputs
X
rV_
Ir,
while
not
preserving
noised
inputs
x/_t
(2)
sefni-super1J_ised
learning;
does
the
performance
of
a
supervised
model
improve
when
it
1S
additionally
trained
aS
an
autoencoder;
(3)
S__ple
geneyation_
how
well
does
_
simple
model
dO
when
trained
On
generated
S_mples_,
(4)
interpolatioii_
__d
arithlnetic;
how
e_sily
C_n
We
manipulate
vectors
in
Z
tO
smoothly
control
the
generated
text
S_mples
X.
For
these
experiments
We
compare
tO
a
standalrd
AE,
trained
without
the
code-space
GAN
component,
aS
well
aS
a
standard
language
model.
We
also
attempted
tO
train
VAEs
On
the
text
dataset
but
found
that
lt
W_S
unable
tO
learn
meaningful
latent
represent_tions
despite
tuning
the
i_tent
dimension
size,
KL
annealing,
_nd
word
dropout.
Refer
tO
the
appendix
for
_
detailed
description
of
the
hyperparameters,
model
architecture,
and
training
regime.
7
Experiments
7.1
Code
Space
Structure
As
the
code
space
We
USe
bY
de_nition
does
not
have
the
capacity
tO
represent
the
entire
discrete
input
space,
ideally
the
autoencoder
would
lea]LJn_
tO
maintain
valid
representations
for
only
fral
inputs
which
roughly
exist
alon_
a
low-dimensional
manifold
determined
bY
the
space
f
natural
images
Or
natural
language
sentences.
This
property
iS
difncult
tO
maintain
in
standard
autoencoders,
which
often
learn
a
_artial
identity
mapping,
but
ideally
should
be
improved
bY
code
space
regul_'zation.
We
test
this
property
bY
PaSS'n_
two
sets
of
S_mples
through
ARAE,
One
of
true
held-out
samples
and
the
other
of
explicitly-noised
ex_mples
existing
o_
this
manifold.
Figure
2
(left)
shows
these
examples
_nd
their
reconstjcuction
from
the
discretized
MNIST
where
the
noised
examples
COme
from
adding
noise
to
the
original
image.
For
images
We
observe
that
a
regul_r
AE
simply
copies
inputs,
regardless
of
whether
the
input
iS
On
the
data
manifold.
ARAE,
On
the
other
hand,
will
learn
not
reproduce
the
noised
Samples.
Table
l
(right)
shows
similar
experiments
for
text
where
We
add
noise
bY
permuting
h
words
ln
e_ch
sentence.
Ag_in
We
observe
th_t
the
ARAE
iS
able
tO
map
a
noised
sample
b_ck
onto
coherent
sentences.
Table
l
(left)
shows
empirical
results
for
these
experiments.
We
obt_in
the
reconstruction
ejcror
(i.e.
neg_tive
lo_
likelihood)
f
the
origingl
6
_
_
_
k
AE
A_E
Ojgingl
Awomanwealringsunglasses.
Origingl
They
h_ve
been
swimming
_
_
Noised
Awomansunglassesweairing.
Noised
beenhavetheyswimming
i
4.51
4.07
_
_
2
6.61
5.3_
Ojginal
_etS
galloping
down
the
stieet
_
Ojgingl
The
child
is
sleeping
_
3
9.I4
6.86
Noised
Petsdownlth_egallopingstreet.
Noised
childtheissleeping.
__
_
from
a
corrupted
sentence.
Here
h
IS
the
number
of
swaps
performed
On
the
original
sentence.
Right.
Samples
generated
from
AE
and
ARAE
where
the
input
iS
noised
bY
swapping
words.
_
_y
_
__
Realdata27.4
Table
2__
Lert.
Semi-Supervised
accuracy
On
the
natural
language
_
22.2_o
(medium),
10.8_o
(small),
5.25_o
(tiny)
of
the
supervised
labels
of
the
full
SNLI
training
set
(rest
used
for
unl_beled
AE
tr_ining).
Right.
_emlexlty
(lower
_
better)
f
i_nguage
models
trained
On
the
real
d_t_
_nd
synthetic
Samples
from
_
GAN/AE/LM.
(non-noised)
sentence
under
the
decoder,
utilizing
the
noised
code.
We
_nd
th_t
when
h
__
O
(i.e.
nO
swaps),
the
regular
AE
better
reconstructs
the
input
(_s
expected).
However,
aS
We
increase
the
fiumber
of
swaps
and
_ush
the
input
fu_her
away
from
the
data
manifold,
the
ARAE
iS
more
likely
tO
produce
the
original
sentence.
We
note
that
unlike
denoising
autoencoders
which
require
a
domain-speci_c
noising
function
_1
1,
33],
the
ARAE
1S
not
explicitly
trained
to
denoise
an
input,
but
learns
tO
dO
SO
aS
a
byproduct
f
adversajal
regularization.
7.2
Semi-Supervised
Training
Next
We
utilize
ARAE
for
semi-supervised
training
On
_
natur_l
langu_ge
inference
t_sk,
shown
ln
Table
2
(jght).
We
experiment
With
using
22.2_o,
10.8_o
and
S.2S_o
f
the
original
labeled
training
data,
and
USe
the
rest
f
the
training
set
for
unlabeled
AE
training.
The
labeled
set
1S
randomly
_1cked.
The
full
SNLI
training
set
contains
S43k
sentence
Pa'rS_
and
We
USe
supervised
sets
of
12Ok,
59k
and
28k
sentence
Pa"S
respectively
for
the
three
settings.
As
a
baseline
We
USe
an
AE
trained
On
the
_dditional
d_ta,
simil_r
tO
the
setting
explored
ln
[7].
For
ARAE
We
USe
the
subset
of
unsupervised
data
f
length
<
15,
which
roughly
includes
655k
S1ngle
sentences
(due
tO
the
length
restriction,
this
1S
a
subset
of715k
sentences
that
Were
used
for
AE
training).
As
obsejved
bY
Dal
and
Le
[7],
training
On
unlabeled
data
with
an
AE
objective
improves
upon
a
model
Just
trained
On
labeled
data.
Training
With
adversarial
regularization
provides
further
_a'nS'
ARAF_-GAN
S4mpl_i
AE
S_m_l_i
LM
S4mples
A
WOman
P'epalr'n_
three
_sh
_
Two
Three
WOman
in
_
calt
teairing
OVer
of
_
tree
_
_
m_n
wal_ng
outside
On
a
dijt
road
t
sitting
On
the
A
WOman
is
seeing
a
man
in
the
river
_
A
man
is
huggln_
and
_
_
dock
_
There
PaSSeS
a
WOman
near
birds
in
the
au'r
_
The
fancy
skier
is
st_ing
under
the
d_a_
cU_
in
_
A
large
_rOUP
f
people
is
ta_n_
a
photo
for
Chjstsome
ten
people
is
sitting
through
their
ofnce
_
A
do_
_re
<unk>
_
m_S
_nd
_t
night
_
The
m_n
got
stolen
with
YOUn_
dinner
ba_
_
A
man
is
not
standing
_
Someone
is
avoiding
a
SOCCer
__me
_
Monks
are
_nnine
in
court
_
The
BoyS
in
their
swimming
_
The
man
_d
WOman
alre
dressed
for
a
movie
_
The
Two
boyS
_
gl_sses
_e
_ll
_irl
_
A
sumer
_d
_
couple
w_iling
for
_
show
_
Person
in
all
empty
st_dium
pointing
at
a
mountain
_
The
man
is
small
sitting
in
twO
men
that
tell
a
chjl-
A
COUple
is
a
kidS
_t
_
barbecue
_
TwO
children
and
a
little
boy
alre
<4nk>
a
man
in
a
dren
_
The
motorcycles
is
in
the
OCean
loadln_
blue
shirt
_
The
iwo
children
_re
e_ting
the
b_lloon
_nim_l
_
I
ts
bike
is
On
empty
A
boy
rides
_
bicycle
_
A
WOman
is
t_ing
On
a
microscope
_
The
actor
waS
wal_ng
in
a a
small
do_
alrea
_
A
_l_l
is
Lunning
another
in
the
forest
_
The
do_S
alre
sleeping
_
bed
_
nO
do_
is
YOUn_
their
muther
the
man
is
an
indian
WOmen
_
Figure
3
__
Text
samples
generated
from
ARAE-GAN,
a
simple
AE,
and
from
a
baseline
LM
trained
On
the
Same
data.
TO
generate
from
an
AE
We
fit
a
multiv_'ate
G_ussian
tO
the
learned
code
space
and
generate
code
vectors
from
this
Gaussian.
7
A
m_n
is
On
the
COrner
_
_
sport
_re_
_
A
m_n
_
On
_
shi_
__th
W_th
the
WOm_n
_
A
man
in
_
Cave
is
used
an
escai_tor
_
A
man
is
On
COrner
in
a
road
i
_
A
m_n
i_
On
a
shl_
__th
w_th
the
WOm_n
_
A
m_n
in
a
Cave
i_
used
_n
esc_l_tor
A
i_dy
is
On
outside
_
r_cetr_ck
_
A
m_
is
__ssing
On
_
brid_e
W_th
the
_irl
_
A
man
in
a
Cave
is
used
chairs
_
A
ladY
is
outside
On
a
racetrack
_
A
man
is
PaSS'n_
On
_
brid_e
with
the
_lrl
_
A
man
in
a
number
is
used
many
equipment
_
A
lot
ofpeople
is
outdoors
in
an
urban
setting
_
A
man
is
PaSS'n_
On
_
brid_e
with
the
_i_l
_
A
man
in
_
number
is
POS'n_
SO
On
a
bl_
rock
_
A
lot
of
_eople
is
outdoors
in
an
urban
setting
_
A
man
is
P_SS'n_
On
a
brid_e
with
the
do_S
_
People
alre
POS'n_
in
a
Lur_l
JLr_e_
_
A
lot
f
people
is
outdoors
in
_n
urb_n
setting
_
A
m_n
is
Pa"'n_
On
a
brid_e
wjth
the
do_S
t
People
_re
POS'n_
in
a
Lur_l
_re_.
Figure
4:
Sample
inte_olations
from
the
ARAE-GAN.
Constructed
bY
linearly
inte_olating
in
the
latent
space
and
decoding
tO
the
out_4t
space.
Word
changes
are
highlighted
In
black.
7.3
S_mple
Generation
A
COmmon
test
for
a
GANs
ability
to
gener_te
re_listic
samples
th_t
COVer
the
origingl
d_t_
sp_ce
1S
tO
train
a
simple
model
On
the
samples
from
the
GAN
itself.
Acknowledging
the
pitfalls
of
such
qu_ntit_tive
evaluations
[31],
for
text
GANs
We
C_n
dO
this
bY
producing
_
lar_e
set
f
s_mpled
sentences,
and
training
a
simple
i_nguage
model
OVer
the
generations.
For
these
experiments
We
gener_te
10Ok
s_mples
from
(i)
ARAE-GAN,
(ii)
an
AE,
(iii)
_
RNN
LM
tr_ined
On
the
S_me
data,
and
(iv)
the
real
training
set.
TO
"s_mple"
from
an
AE
We
__
a
multivariate
Gaussian
to
the
code
space
(of
the
training
data)
after
training
the
AE
and
generate
code
vectors
from
this
Gaussian
and
decode
back
into
sentence
space.
All
models
are
f
the
S_me
size
to
allow
for
fair
comparison.
Samples
from
the
models
are
shown
ln
Figure
3
i
We
subsequently
train
a
standard
RNN
language
model
On
the
_e_erated
data
and
evaluate
pemlexity
On
held-out
fral
data.
The
language
model
1S
of
the
Same
size
aS
the
decoder
of
the
ARAE.
As
Can
be
Seen
from
I_ble
2
training
On
real
d_t_
(understandably)
outperforms
training
On
gener_ted
d_t_
bY
a
lar_e
margin.
Su_risingly
however,
We
find
that
a
language
model
tr_ined
On
ARAE-GAN
d_t_
performs
slightly
better
than
One
trained
On
LM-generatedJAE-generated
data.
7.4
Interpolution
_nd
Vector
Arithmetic
A
widely
obsejved
propelty
f
GANs
(and
VAEs)
1S
that
the
Gaussian
Pr'Or
p(_)
induces
the
ability
tO
smoothly
inte_olate
between
outputs
bY
exploiting
the
stJLucture
of
the
latent
space.
While
language
models
may
provide
_
better
estim_te
f
the
underlying
probability
sp_ce,
constructing
this
style
f
interpolation
would
require
combinatorial
search,
which
makes
this
a
useful
feature
of
text
GANs.
We
experiment
with
this
property
bY
sampling
two
points
_O
and
_l
from
p(_)
and
constructing
intermediary
points
__
__
__1
+
(1
_
_)_O'
For
each
We
gener_te
the
argmax
output
X_'
The
samples
are
shown
in
Figure
u
for
text
and
In
Figure
2
(right-bottom)
for
MNIST.
While
it
1S
difncult
tO
aSSeSS
the
"accuracy"
f
these
intemolations,
We
generally
qualitatively
observe
smooth
changes
ln
the
output
sentences/images
aS
We
mOVe
from
One
latent
sp_ce
tO
another.
Another
intriguing
property
of
image
GANs
1S
the
ability
tO
mOVe
in
the
latent
space
Vla
o_set
vectors
(similar
tO
the
C_Se
with
word
vectors
c23]).
For
example,
Radford
et
al.
r2S]
observe
that
when
the
me_n
latent
vector
for
"men
With
glasses"
1S
subtracted
from
the
mean
i_tent
vector
for
"men
without
glasses"
and
applied
tO
an
image
of
a
"WOman
without
glasses",
the
resulting
image
1S
that
of
a
"woman
With
glasses".
We
experiment
to
See
if
a
similar
property
holds
for
sentences.
We
generate
l
million
sentences
from
the
ARAE-GAN
and
Parse
the
sentences
to
obt_in
the
main
verb,
subject,
and
modi_er.
Then
for
a
_'Ven
sentence,
to
change
the
main
verb
We
subtract
the
mean
latent
vector
(_)
for
i
other
sentences
with
the
Same
main
verb
(in
the
first
example
ln
Figure
5
this
would
correspond
to
i
sentences
that
had
t
aS
the
main
verb)
and
add
the
mean
latent
vector
for
i
sentences
th_t
h_Ve
the
desired
transformation
(with
the
running
ex_mple
this
would
be
i
sentences
whose
main
verb
WaS
"walk_ng"),
We
dO
the
Same
to
transform
the
subject
and
the
modi_er.
We
decode
back
into
sentence
space
With
the
transformed
latent
vector
via
Sampling
from
pw(g(_
+
_)).
Some
examples
of
successful
transformations
are
shown
In
Figure
S
(right).
Quantitative
evalu_tion
of
the
SUCCeSS
of
the
vector
transformations
1S
given
in
Figure
S
(left).
For
each
original
vector
_
We
sample
lOO
sentences
from
PW
(g(_
+
_))
OVer
the
transformed
neW
latent
vector
and
consider
_t
a
match
if
finy
f
the
sentenceS
demonstrate
the
desired
transfomlation.
Match
_o
iS
propojction
f
origingl
vectors
that
yield
a
match
_ost
transformation.
As
We
ideally
want
the
generated
Samples
to
only
differ
in
the
speci_ed
transformation,
We
also
calculate
the
average
word
PreC'S'On
against
the
original
sentence
(Prec)
for
y
match.
8
A
m_n
in
_
tie
is
sleeping
_nd
clappin6
On
balloons
t
tw_lkin%
A
m_n
in
_
tie
t
clapping
and
walking
do_S
_
_
A
PerSOn
is
standing
in
the
air
beneath
a
criminal
_
tw_lking
A
PeCSOn
is
walking
in
the
air
bene_th
_
pickup
t
Transform
Match
_O
_rec
The
jewish
boy
is
_ying
tO
stay
out
of
hiS
sk_teboalrd
_
tm_n
Thejewish___niscryingcost_youcofhishorse.
_
The
people
works
in
a
neW
uniform
studio
_
tm_n
A
m_n
works
in
a
neW
studio
uniform
_
w_lking
85
79.5
Some
child
he_d
_
playing
_l_stic
Wi_
dlrink
_
t_u
Two
children
Pl_ying
_
he_d
with
pl_stic
drink
_
m_n
92
80.2
A
b_by
workers
is
w_tching
ste_k
with
the
w_ler
_
t_O
Two
workers
w_cching
b_by
ste_k
wich
the
_r_SS
_
two
86
74.
l
The
people
shine
O_
looks
into
an
area
_
tdO_
The
do_
anives
OC
looks
into
_n
_Ce_
t
_
88
77.0
Theboy_babiesiswearingahugefactory.
tdo_
Thedo___babies
iswean-ngahugeeairs.
_
several
70
67.0
The
do_S
alre
_leeping
in
front
of
the
dinner
_
ts(andi_C
Two
do_S
_e
st_nding
_
front
of
the
dinner
_
_
A
side
child
listenln_
tO
a
P'ece
with
steps
playing
On
a
table
_
tseve,a_
Several
child
playing
a
guitar
On
side
with
a
table
_
TWO
children
alre
wor_ng
in
red
shirt
at
the
cold
field
_
tseve,a_
Several
children
working
in
red
shirt
_re
cold
_l
the
field
_
Figure
s__
Lert.
Quantitative
evaluation
of
transformations.
Match
_o
refers
tO
the
_o
of
samples
where
at
least
One
decoder
samples
(per
loo)
had
the
desired
transformation
In
the
out_Ut_
while
Prec.
meaSUreS
the
average
PreC'S'On
of
the
output
against
the
original
sentence.
Right.
Examples
(out
f
i
00
decoder
Samples
_er
sentence)
where
the
offset
vectors
produced
successful
transform_tions
f
the
origingl
sentence.
See
Section
7.4
for
methodology.
8
Conclusion
We
present
adversari_lly
regularized
autoencoders,
_S
a
simple
approach
for
tr_ining
_
discrete
structure
autoencoderjointly
With
a
code-space
generative
adversarial
network.
The
model
learns
an
improved
_utoencoder
aS
demonstrated
bY
semi-supejrvised
experiments
and
_n_lysis
f
the
manifold
structure
for
text
_nd
images.
It
_lso
learns
_
useful
gener_tive
model
for
text
that
exhibits
a
robust
latent
sp_ce,
_S
demonstrated
bY
natural
intemolations
and
vector
arithmetic.
We
however
note
that
(_s
haS
been
frequently
observed
when
training
GANs)
OUr
model
seemed
tO
be
quite
sensitive
tO
hype_arameters.
Finally,
while
many
useful
models
for
text
gener_tion
_lready
exist,
text
GANs
provide
a
qualitatively
different
approach
inauenced
bY
the
underlying
latent
variable
structure.
We
envision
that
such
a
framework
could
be
extended
tO
a
conditional
setting,
combined
with
other
existing
decoding
schemes,
Or
used
to
provide
a
more
intemretable
model
f
language.
Acknowledgment
We
thank
Sam
Wiseman,
Kyunghyun
Cho,
Sam
Bowman,
_oan
Bruna,
Yacine
_ernlte,
Martín
Arjovsky,
Mikael
Hena_
and
Michael
M_thieu
for
fJLuitful
discussions.
Yoon
Kim
1S
sponsored
bY
a
SYST_N
research
award.
We
also
thank
the
NVIDIA
Comor_tion
for
the
donation
of
a
Titan
X
Pascal
GPU
that
WaS
used
for
this
research.
References
[1]
M_in
Arjovsky,
Soumith
Chint_l_,
and
Léon
Bottou.
W_sserstein
_an.
arxiv;
i7Ol.O7875,
2017.
[2]
David
Bejcthelot,
Tom
Schumm,
and
L4ke
MetZ_
Began:
Bounda_
equilibrium
generative
adversarial
networks.
arxiv;17O_.107J1,
2017.
[3]
S_muel
R.
Bowman,
Gabor
Angeli,
Christopher
_ottS,
d
Christopher
D.
M_nning.
A
i__e
_nnotated
corpus
for
lealrning
natural
langu_ge
inference.
In
Proceedings
o__MNLP,
2015.
[4]
S_muel
R.
Bowman,
Luke
Vilnis,
Andrew
M.
_
Oriol
Viny_l
and,
Rafal
Jozefowicz,
and
Samy
Bengio.
Categorical
Reparameterization
With
Gumbel-Softmax.
In
P_oceedings
o_CoNLL,
20I6.
[5]
Ton_
Che,
Yanran
L'_
Ruixiang
Zhang,
R
Devon
Hjelm,
Wenjie
L'_
Y_ngqui
Son__
_nd
Yoshu_
Bengio.
Maximum-Likelihood
Augment
Discrete
Generative
Adversarial
Networks.
ccrxiv;1702.0798_,
2Ol7.
[6]
X'
Chen,
Diederik
P.
Kingma,
Ilm
Salimans,
Yan
Duan,
Prafulla
Dhariwal,
_ohn
Schulman,
Ilya
S
utskever,
and
Pieter
Abbeel.
Vajational
Lossy
A4toencoder.
In
Proceedings
o_lCLR,
2Ol7.
[7]
Andrew
M
Dal
and
Quoc
V
Le,
Semi-supervised
sequence
learnln__
In
Proceedings
o_Nlps,
2Ol5.
[8]
Peter
Glynn.
Likelihood
R4tio
Gradient
Estimation;
An
Overview.
In
Proceedings
o_
Winter
Si_ulation
Conjference,
I987.
[9]
Ian
Goodfellow,
Jean
Pouget-Abadie,
Mehdi
Mirza,
Bing
Xu,
David
Warde-Farley,
SherJ
_l
Ozair,
Aaron
Courville,
and
Yoshu_
Bengio.
Generative
_dvers_'aI
nets.
In
Proceedings
o_NIPS,
2014.
[
1O]
Ish__n
Gulr_jani,
Faruk
Ahmed,
Malrtin
Arjovsky,
_nd
Aalron
Courville
Vincent
Dumoulin.
Improved
Tr_iningofW_ssersteinGANs.
arxiv;J7O4.00O_8,2017.
9
[
ll
]
Felix
Hill,
Kyunghyun
Cho,
and
Ann_
Korhonen.
Learning
distributed
representations
f
sentences
from
unlabelled
data,
In
Proceedings
o_NAACL,
20
I
6.
[12]
R
Devon
Hjelm,
Athul
Paul
Jacob,
Ton_
Che,
Kyunghyun
Cho,
and
Yoshua
Bengio.
Bounda_-Seeking
Generative
Adversairi_l
Networks.
arxiv_l7O_.084_J,
2017.
[13]
sepP
Hochreiter
and
Jürgen
Schmidhuber.
Lon_
shojct-tenn
memo_.
Neurnl
cojnputntion,
g(8):
1735-1780,
l
997
_
[14]
Eric
_an__
Shixiang
G4_
and
Ben
Poole.
Categorical
Reparameterization
With
G4mbel-Softmax.
In
Proceedings
o_lCLR,
2Ol7.
[
15]
Diederik
P.
_ngma,
Tim
Salim_ns,
and
MaX
Welling_
Improving
V_'ation_l
Inference
with
Autoregressive
Flow.
arxiv_J6O6.049_4,
2016.
[16]
Diederik
P.
Kingma
and
Max
Welling'
A4to-Encoding
Variational
Bayes.
In
Proceedings
o_lCLR,
2014.
[17]
M_tt
Kusner
and
Jose
Miguel
Hernandez-Lob_to.
GANs
for
Sequences
of
Discrete
Elements
With
the
Gumbel-Softmax
Distribution.
cc_Xiv;16JJ.O4051,
2016.
[18]
Jiwei
L'_
i
Monroe,
Ti_nlin
Shi,
Séb_stien
Je_n,
Alan
Ritter,
_nd
D_n
Jurafsky.
Adversajal
Learning
for
Neur_l
Dialogue
Generation.
arxiv;J7Ol.O6541,
2017.
[19]
Chris
J.
Maddison,
Andrly
Mnih,
and
Yee
Whye
Teh.
The
Concrete
Distribution:
A
Continuous
Relaxation
f
Discrete
Random
Vaj_bles.
In
P_oceedings
o_ICLR,
2017.
[2o]
Alireza
Makhzani
and
Brendan
Frey'
PixelGAN
Autoencoders.
cirxiv;I7O6.005_I,
2Ol7.
[21]
Alireza
Mahhzani,
_onathon
Shlens,
Navdee_
Jaltly_
Ian
Goodfellow,
and
Brendan
Frey'
Adversarial
Autoencoders.
a_Xiv_J5J
J.O5644,
2015.
[22]
Lars
Mescheder,
Sebastian
Nowozin,
and
Andreas
Geiger.
Adversarial
Variational
B
ayes
__
unifyln_
Variational
Autoencoders
and
Generative
Adversarial
Networks.
ccrxiv;1701.04722,
2Ol7.
[23]
Tom_s
Mikolov,
Scott
Wen
tau
Yih,
and
Geoffrey
Zwe1_,
Linguistic
Regulajties
_
Continuous
Space
Word
Representations.
In
Pyoceedings
o_NAACL,
2013.
[24]
Onr
Press,
Amir
Bar_
Ben
BO_1n_
Jonath_n
Berant,
and
Lior
Wolf.
L_nguage
Generation
With
Recujcrent
Generative
Adversarial
Networks
without
Pre-tr_ining.
arxiv_J1O6.OJ_99,
2OI7.
[25]
Alec
Radford,
Luke
Metz,
and
Soumith
Chintala.
Unsupervised
Representation
Leajcln_ing
With
Dee_
Convolutional
Generative
Adversarial
Networks.
In
Proceedings
o_JCLR,
2OI6.
[26]
Danilo
J.
Rezende
and
Shakir
Moh_med.
Variational
Inference
With
Norm_lizing
Flows.
In
Proceerfings
o_ICML,
2OI5.
[27]
Danilo
Jimenez
Rezende,
Sh_r
Mohamed,
and
Daan
Wierstra.
Stoch_stic
Backpropagation
and
Approxi-
mate
Inference
t
Dee_
Generative
Models.
In
P_oceerf_ings
o_ICML,
2OI4.
[28]
Francis
D4til
Christopher
Pal
Aaron
Courville
Sal
Rajeswar,
Sandeep
S
ubramani_n.
Adversajal
Generation
ofnaturallanguage.
arxiv;1705.JO9__,2017.
[29]
Stanislau
Semeniuta,
Aliaksei
Severyn,
and
Erhardt
Barth.
A
Hybrid
Convolutional
Variational
Autoen-
coder
for
Text
Generation.
arxiv;J702.02_90,
2017.
[3O]
Ti_nxiao
Shen,
I_O
Lel_
Regina
BarzllaY_
d
Tommi
J__kola_
Style
Tr_nsfer
from
Non-Palrallel
Text
bY
Cross-Alignment.
arxiv;l705.O_655,
2017.
[31]
Lucas
Theis,
Aaron
Van
den
Oord,
and
Matthias
Beth_e_
A
note
On
the
evaluation
of
generative
models.
In
Proceedings
o_ICLR,
2016.
[32]
D4stin
Tran,
Rajesh
Ranganath,
and
David
M.
Blei.
Dee_
and
Hierarchical
Implicit
Models
_
firxiv;I7O2.08896,
2017.
[33]
Pascal
Vincent,
Hu_O
Larochelle,
Yoshua
Ben_1O7
and
Pie]cr_e-Antoine
Manzagol.
Extracting
and
Composing
Robust
Features
With
Denoising
Autoencoders.
In
Proceedings
o_ICML,
2008.
_34]
Ron_ld
J.
Willi_ms.
Simple
Statistical
Gradient-following
Algorithms
for
Connectionist
Reinforcement
Lealrning.
Machine
Learning,
8,
1__2.
10
[35]
Zich_o
Y___
Zhiting
Hu,
Ruslan
S_lakhutdinov,
and
Taylor
Berg-Kirkpatrick.
Improved
V_'ational
A4toencoders
for
Text
Modelln_
using
Dilated
Convolutions.
In
Proceedings
o_lCML,
2Ol7.
[36]
Lantao
Y4_
Weinan
Zhang,
J4n
Wan__
and
Yon_
Y4'
SeqGAN;
Sequence
Generative
Adversarial
NetS
With
_olicy
Gr_dient.
In
P_oceedings
o_AAAr,
2017.
[37]
Junbo
Jake
Zhao,
Michaël
M_thieu,
and
Yann
Lecun.
Energy-based
generative
adversarial
network.
CoRR,
_bs/l6O9.03l26,
2016.
ll
Appendix:
Experiments
details
MNIST
experiments
f
The
encoder
iS
a
three-layer
MLP,
784-800-400-100.
The
output
of
the
encoder
1S
nor-
m_Iized
onto
_
unit
ball
(h_ving
i2
nOTm
1),
denoted
aS
C
_
__1oo
before
beln_
forwarded
fujrther.
Except
for
the
output
i_yer_
b_tch
normalization
_nd
ReLU
are
used
following
the
line_r
layers.
i
We
also
add
SOme
additive
Gaussian
noise
into
C
which
1S
then
fed
into
the
decoder.
The
standard
deviation
of
that
noise
1S
initialized
tO
be
O.4,
and
then
exponentially
decayed
tO
O.
i
The
decoder
1S
a
four-layer
MLP,
100-400-800-1000-784
Except
for
the
output
layer,
batch
normalization
and
LeakyreLU
(scale
__
O.2)
alre
used
following
the
linear
layers.
f
The
autoencoder
iS
optimized
bY
Adam,
with
learning
rate
5e-O4.
f
The
GAN
employs
a
MLP
generator,
with
structure
32-64-
100- 150-
100.
The
noise
vector
1S
Z
_
_g32
i
The
GAN
employs
a
MLP
critic,
with
structure
100-100-60-20-1.
The
clipping
factor
_
__
O.05.
The
critic
iS
trained
With
lO
iterations
ln
each
GAN
loO_'
f
The
GAN
1S
optimized
bY
Adam,
With
learning
rate
5e-O4
On
the
gener_tor,
and
5e-O5
On
the
critic.
_
When
Updating
the
encoder,
We
multiply
the
critic
gradient
bY
O.2
before
b_ckpropping
to
the
encoder.
Text
experiments
The
architecture
We
u$ed
for
the
text
generation
task
1S
described
blow_
i
The
encoder
iS
an
one-layer
LSTM
with
300
hidden
units.
The
output
of
the
encoder
1S
normalized
onto
a
unit
ball
(having
i2
norm
i),
denoted
aS
C
_
t
before
beln_
forwarded
further.
i
We
add
Gaussian
noise
into
C
before
feeding
it
into
the
decoder.
The
standard
deviation
of
that
noise
1S
initialized
tO
be
O.2,
and
then
exponenti_lly
decayed
every
100
iter_tions
bY
a
factor
of
O.995.
_
The
decoder
1S
a
one-layer
LSTM
With
300
hidden
units.
i
The
decoding
P'OCeSS
at
each
time
ste_
t_kes
the
tO_
layer
LSTM
hidden
st_te
and
COn-
catenates
it
With
the
hidden
codes
C'
before
feedln_
them
into
the
output
(i.e.
voc_bulary
projection)
and
the
softmax
layer.
_
The
word
embedding
1S
f
size
300.
_
We
adopt
a
_rad
clipping
On
the
encoder/decoder,
With
maX
_rad
norm
__
1.
i
The
encoder/decoder
1S
optimized
bY
vanilla
SGD
with
learning
rate
1.
i
The
GAN
employs
a
MLP
generator,
with
structure
100-300-300,
batch
normalization,
and
ReLU
nonlinea_.
The
noise
vector
Z
_
__1Oo.
i
The
GAN
y
a
MLP
critic,
with
structure
300-300-1,
batch
norm_liz_tion,
_nd
LeakyreLU
(with
scale
O.2)
nonlinearity.
The
clipping
f_ctor
_
__
O.01.
The
critic
iS
trained
with
S
iterations
in
e_ch
GAN
loO_'
_
The
GAN
iS
optimized
bY
Adam,
With
learning
rate
5e-O5
On
the
generator,
and
le-O5
On
the
critic.
i
When
We
U_date
the
encoder,
We
normalize
both
SOUrceS
of
gradients,
i.e.
from
the
critic
and
decoder,
based
On
their
normS.
After
that,
a
Weight
factor
O.01
1S
imposed
On
the
critic
backproped
gradient.
_
We
increment
the
number
f
GAY
training
i
bY
l
(it
initially
1S
set
tO
1)
t
respectively
at
the
beginning
of
epoch
#2,
epoch
#4
and
epoch
#6.
_
Wetrainforatotalof6epochs.
_
1The
GAN
training
loO_
refers
tO
how
m_ny
times
We
train
GAN
t
e_ch
entire
tr_ining
loO_
(one
training
loO_
contains
training
autoencoder
for
One
loO_,
and
training
GAN
for
One
Or
several)
_
l
2
Semi-supervised
experiments
The
architecture
We
used
for
semi-supervised
learning
task
iS
described
blow;
_
The
encoder
1S
a
three-layer
LSTM,
With
the
hidden
st_te
size
beln_
300.
The
output
f
the
encoder
iS
normalized
onto
_
unit
ball
(having
i2
norm
1),
denoted
aS
C
_
__3Oo,
before
beln_
forw_rded
further.
f
The
decoder
1S
a
one-layer
LSTM,
with
the
hidden
state
size
being
300.
_
The
decoding
PrOCeSS
at
each
time
ste_
takes
the
tOP
layer
LSTM
hidden
state
and
concate-
nates
it
with
the
hidden
codes
C'
before
forward
them
tO
a
word
tr_nsition
matrix.
i
The
initial
decoder
hidden
state
1S
initialized
with
C'
with
a
linear
transfojcmation.
_
The
word
embedding
1S
f
size
300.
_
We
adopt
a
_rad
clipping
On
both
LSTMs,
With
a
maximum
allowed
gradient
norm
beln_
i
i
f
The
encoder/decoder
iS
optimized
bY
vanilla
SGD,
learning
rate
1.
f
The
GAN
employs
_
MLP
generator,
with
structure
100-150-300-500,
batch
normaliza-
tion,
and
ReLU
nonlinearity.
The
noise
vector
Z
_
__1Oo
_
f
The
GAN
y
a
MLP
critic,
with
structure
500-500-150-80-20-1,
batch
nomlaliza-
tion,
and
LeakyreLU
(scale
__
O.2)
nonlinearity.
The
clipping
factor
_
__
O.02.
The
critic
1S
trained
with
10
iter_tions
In
each
GAN
loO__
_
The
GAN
1S
optimized
bY
Adam,
With
learning
rate
5e-O5
On
the
generator,
and
le-O5
On
the
critic.
f
When
We
U_date
the
encoder,
We
normalize
both
SOUrceS
of
gradients,
i.e.
from
the
critic
und
decoder,
based
On
their
nOImS.
After
that,
We
multi_ly
the
critic
gr_dients
bY
O.01
before
backpropping
tO
the
encoder
Note
We
USe
the
Same
architecture
for
i
three
expejiment
settings
(i.e.
i_bel
set
portion:
Medium
(22.2_o),
Small
(1O.8_o),
Tiny
(S.S_o)).
The
baseline
models
under
comparison
(Supervised
Encoder,
Semi-Supervised
AE)
USe
the
Same
setting
aS
described
above.
l
3