Adversarially

Regul_rized

Autoencoders

for

Generating

Discrete

Structures

•

Junbo

(J_ke)

Zh_o

Department

Computer

Science

New

York

University

akezhao_cs

nyu

edu

Nh4_

Kellyzh_ng

Yoon

Kim

School

Engineering

and

Applied

Sciences

Harvard

University

yoonkim_se_s

h_rvard

edu

Alexander

Rush

School

Engineering

and

Applied

Sciences

Harvard

University

stush_se_s

harv_Td

edu

Y_nn

Lecun

Depalrtment

Computer

Science

Newyorkuniversitv

yann_cs

nyu

edu

Abstract

Gener_tive

_dvers_ri_l

networks

are

effective

approach

for

le_rning

rich

i_tent

representations

continuous

data,

but

have

proven

difncult

apply

directly

problem,

but

dif_cult

learn

appropriate

general-pumose

encoder.

this

work,

consider

simple

approach

for

h_ndling

these

two

challenges

Jointly,

employing

discrete

structure

autoencoder

with

code

sp_ce

regularized

generative

advers_ri_l

tr_ining.

The

model

learns

smooth

reg4I_rized

code

space

while

still

beln_

_ble

model

the

underlying

d_t_,

and

Can

used

____

discreteGANwiththeabilitytogeneratecoherentdiscreteoutputsfromcontinuous

the

model's

i_tent

sp_ce,

and

evaluate

the

model

itself

the

tasks

discrete

image

generation,

textgeneration,

_nd

semi-supervised

learning.

Introduction

Recent

work

gener_tive

advers_rial

networks

(GAYs)

_9]

and

other

deeP

latent

v_riable

models

haS

shown

signincant

PrO_reSS

le_rling

smooth

i_tent

v_riable

representations

complex,

highdimensional

continuous

data

such

images

r1,

25,

37].

These

i_tent

representations

facilitate

the

ability

apply

smooth

tr_nsformations

and

intemolations

i_tent

sp_ce

order

produce

complex

modijn_c_tions

generated

outputs,

while

Still

remaining

the

data

manifold.

Unfortunately,

learning

similar

latent

representations

discrete

stJLuctures,

such

text

sequences

discretized

images,

remains

challenging

problem.

Applyln_

GANs

directly

this

task

produces

discrete

output

from

the

generator,

which

then

requires

clever

approaches

for

backpropagation.

Fujcthermore

this

issue

compounded

C_SeS

where

the

generative

model

recujcrent,

e._'

sequence

modeling.

Resealrchers

h_ve

circumvented

SOme

these

issues

using

policy

gradient

Submitted

Conference

Neur_l

Information

Processing

Systems

(NIPS

2017),

Lon_

Beach,

CA,

VSA.

methods

_5,

i2,

36]

With

the

Gumbel-Softmax

distribution

_17].

However,

neither

approach

C_n

yet

produce

robust

latent

representations

directly

from

samples,

p_iculalrly

for

discrete

structures.

altemative

approach

instead

encode

discrete

structures

continuous

code

space

circumvent

this

problem

altogether.

this

sp_ce

continuous,

traditional

GAN

training

Can

directly

applied

learn

latent

representation

the

code

space

itself.

Samples

from

the

GAN

Can

then

decoded

generate

discrete

outputs.

While

theory

this

technique

Can

applied

directly_

practice,

learning

general-purpose

autoencoders

itself

dif_cult

problem.

this

work,

PrOPOSe

simple

extension

this

technique

Jointly

training

code-space

GAN

and

discrete

structure

autoencoder,

which

call

adversajally

regularized

autoencoder

(ARAE).

The

approach

allows

USe

general-pu_ose

GAN

architecture

th_t

gener_tes

continuous

code

representations,

while

the

Same

time

deploying

t_sk-speci_c

autoencoder

alrchitectures,

recurrent

neural

network

for

text,

produce

and

decode

from

these

latent

representations.

The

ARAE

appro_ch

C_n

used

both

generative

model

and

w_y

obtain

encoding

the

input.

First,

leaJL1n_s

GAN

With

Gaussian

i_tent

space

that

Can

Sampled

produce

discrete

structures.

This

model

Can

compaired

directly

With

existing

generative

models.

Second,

learns

adversarially

regul_rized

encoder

that

Can

produce

useful

code

space

representations

from

discrete

structures,

without

requiring

eX_licit

code-space

P"Or'

Can

compare

this

method

other

specialized

autoencoders

such

denoising

and

valriational

autoencoders

OUr

experiments

test

ARAE

two

different

discrete

domains:

discretized

images

and

text

sequences.

show

that

this

approach

successfully

leajcjns

latent

representations

for

both

tasks,

the

model

able

gener_te

coherent

samples,

ignore

corrupted

inputs,

and

produce

predictable

ch_nges

the

outputs

when

performing

manipulations

the

latent

sp_ce.

find

that

are

able

perform

consistent

sentence

manipulations

moving

around

the

latent

space

via

offset

vectors.

similar

property

WaS

observed

image

GANs

[25J

and

word

represent_tions

_23J

Finglly,

experiments

semi-supervised

lea]rning

task

for

natural

language

inference

provide

quantitative

evidence

that

this

_ppro_ch

improves

upon

continuous

representations

le_rned

autoencoders.

Code

available

https

//github

com/J

akezhaoJ

b/ARAE.

Rel_tedwork

GANs

for

Discrete

Structures

The

SUCCeSS

GANs

images

have

led

m_ny

researchers

consider

_pplying

GANs

discrete

d_t_

such

text.

_ol1Cy

gr_dient

methods

_re

fi_tur_l

way

deal

With

the

resulting

non-differentiable

generator

objective

when

training

directly

discrete

space

_8,

34].

When

trained

text

data

however,

such

methods

often

require

pre-training/co-training

with

maximum

likelihood

(i.e.

i_nguage

modeling)

objective

[5,

36,

18].

This

precludes

there

beln_

latent

encoding

the

sentence,

and

_lso

potential

disadvantage

existing

language

models

(which

Can

otherwise

generate

locally-coherent

samples).

Another

direction

work

h_S

been

through

reparameterizing

the

c_tegorical

distribution

with

the

Gumbel-Softmax

trick

[14,

1_]-while

initial

experiments

Were

encouraging

synthetic

task

_17],

scaling

them

work

natural

language

challenging

open

problem.

There

haS

also

been

ﬀu_Y

recent,

approaches

that

work

directly

With

the

soft

outputs

from

generator

[1O,

28,

30,

24].

For

example,

Shen

al.

[3O]

train

With

adversajal

loss

for

unaligned

style

tr_nsfer

between

text

havln_

the

discriminator

act

the

RNN

hidden

states

and

using

the

soft

outputs

each

ste_

input

RNN

generator.

Vari_tion_l

Autoencoders

Ideally,

autoencoders

would

learn

useful

coded

feature

representations

their

inputs

However

pr_ctice

S1mple

autoencoders

often

le_rn

degener_te

identity

m_pping

where

the

latent

code

space

free

structure.

One

way-among

others-to

regularize

the

code

space

through

havln_

explicit

Pr'Or

the

code

space

and

using

variational

approximation

the

postejior,

leading

famlly

models

called

vajational

_utoencoders

(VAE)

r16,

27].

Unfortunately

VAEs

for

text

Can

challenging

train-for

example,

the

training

procedure

not

carefully

tuned

With

techniques

word

dropout

and

_nne_ling

_4]

the

decoder

simply

becomes

language

model

and

ignores

the

latent

code

(_lthough

there

haS

been

SOme

recent

SUCCeSSeS

With

convolutional

models

[29,

35J).

possible

reaSOn

for

the

dif_culty

tr_ining

VAEs

due

the

strictness

the

Pr'O'

(usu_lly

S_herical

Gaussian)

and/or

the

parameterization

the

posterior.

There

h_S

been

SOme

work

making

the

prior/poste_or

Aexible

througn

eX_licit

parameterization

r26,

i5,

6J,

One

notable

technique

adversajal

autoencoders

(AAE)

_2lJ

which

attempt

imbue

the

model

With

Aexible

Pr'Or

implicitly

through

adversarial

training.

AAEs,

the

discriminator

trained

distinguish

between

from

_xed

Pr'Or

distribution

and

the

input

encoding,

thereby

pushing

the

code

distribution

match

the

prior.

Our

approach

haS

similar

motivation,

but

notably

not

sample

from

_xed

Pr'Or

distribution-our

%r'Or'

inste_d

p_r_meterized

througn

generator.

Yonetheless,

this

view

(which

haS

been

obsejved

vajous

rese_rchers

[32,

22,

2O]

)

provides

interesting

connection

between

VAEs

and

GANs.

B_ckground

3.1

Generative

Adversarial

Yetworks

Generative

Adversajal

Networks

(GANs)

alre

class

parameterized

implicit

generative

models

[_]

The

method

approximates

drawing

samples

from

true

distribution

kl_

Ir,

instead

employing

latent

vajable

and

parameterized

deterministic

generator

function

(2)

produce

generated

samples

The

aim

characterize

the

Complex

data

manifold

descjbed

the

unknown

Ir,

wi_hin

the

latent

space

GAN

training

utilizes

two

separate

models:

generntor

g(_)

maps

latent

vector

from

SOme

easy-tos_mple

SOUrce

distribution

value

and

critic/discriminator

J(c)

aims

distinguish

r_al

data

and

_nhe

samples,

generated

The

generator

trained

fool

the

critic,

and

the

critic

separate

out

real

from

generated.

this

work,

utilize

the

recently-proposed

Wasserstein

GAN

(wGAN)

r1].

WGAN

replace$

the

_ensen-Shannon

divergence

standard

GANs

With

_nrth-Mover

(Wasserstein-1)

distance.

WGAN

training

USeS

the

following

min-max

optimization

OVer

gener_tor

parameters

and

Critic

parameters

min

max

I_c

_Jw

(c)]

[Jw

(c)]

(

)

where

denotes

the

critic

function,

obt_ined

from

the

gener_tor,

9g(_),

and

Ir,

and

are

r_ccl

(true

)

and_nhe

(generated

)

distribution

respectively.

Notably

the

critic

parameters

are

restricted

1-Lipschitz

function

set

which

Can

shown

make

this

term

coJ1respond

W_sserstein-1

Distance.

r1J

and

4Se_

naive

implementation

approximately

enforce

this

property

weight-clipping,

i.e.

_]ra__

Throughout

this

work

only

USe

and

fully-connected

networks,

namely

MLPs.

3.2

Discrete

Structure

Autoencoders

autoencoder

model

tr_ined

map

input

code

sp_ce

and

then

back

the

ojginal

form.

Ideally

the

code

represents

important

abstr_cted

fe_tures

the

original

input

(although

this

dif_cult

fojcjmalize)

instead

learning

simply

copy

_'Ven

input.

are

interested

probabilistic

autoencoders

for

discrete

structures.

De_ne

discrete

set

structures

where

vocabulary

symbols.

For

instance,

for

binarized

images

(o,

and

the

number

pixels

orforsentences_=

(1,...,#words)andnisthe

sentence

length.

discrete

structure

autoencoder

consists

two

palrameterized

functions:

deterministic

encoder

function

enC_

with

parameters

and

decoder

distribution

P_(x

with

parameters

that

gives

distribution

OVer

structures

The

model

trained

cross-entropy

reconstruction

loss

where

lealrn

palrameters

minimize

the

negative

log-likelihood

reconstruction

-logpw(x

enc_(x))

(2)

Computing

this

for

arbitrary

sets

intractable,

the

choice

of_V

important

and

problem

speci_c.

FingIIy

often

useful

USe

the

decoder

produce

_O1nt

estim_te

from

call

this

_rgmaxxpw(x

enC_

(x))

When

the

autoencoder

said

copy

the

input,

perfectly

reconstruct

Model

adversarially

regularized

autoencoder

(ARAE)

combines

discrete

autoencoder

with

code-space

GAN.

OUr

model

employs

discrete

autoencoder

learn

continuous

codes

based

discrete

inputs

autoencoder

(ARAE),

where

structure

encoded

and

decoded

produce

and

GAN

(ARAE-GAN),

where

sample

used

generate

code

vector

which

similarly

decoded

and

WGAN

le_Jcjn

implicit

probabilistic

model

OVer

these

codes.

The

aim

exploit

the

GAN_

ability

learn

the

latent

structure

code

data,

while

using

autoencoder

abstract

away

the

encoding

and

generation

discrete

structure

support

GAN

training.

The

main

difference

with

WGANs

described

above,

that

longer

have

_CCeSS

observed

data

S_mples

for

the

GAN.

Instead

have

aCCeSS

discrete

structure

rV_

I_,

where

I_,

the

distribution

interest.

(Work_n_

With

this

space

directly

would

require

backpropag_ting

through

non-differentiable

operations

and

the

b_sis

for

policy

gradient

methods

for

GAN

training.)

handle

this

issue

integr_ting

encoder

into

the

procedure

which

_rst

maps

continuous

code

enc_(x),

i.e.

using

the

code

vector

for

each

observed

structure

de_ned

enC_

The

full

model

h_S

three

_a_

objective. We

minimize

reconstruction

error

the

while

employing

adversarial

training

its

code

space.

min

_AE(_'

(3)

mln

maX

_wGAN-cti

(w,

mln

maX

I_x_i_,

[Jw(enc_(x))]

I___I_9

[Jw(c)]

(4)

_wGAN-Gen

(_)

[Jw

(c)]

(S)

where

Ir,

the

real

distribution

the

input

space.

minimize

the

three

objectives

Jointly

this

work.

04r

model

visually

depicted

Figure

The

algorithm

used

for

training

shown

Algorithm

USe

block

coordinate

descent

optimlze

the

AE,

critic

and

generator

turn.

Notably

with

this

change

nOW

receive

gradients

through

the

encoder

from

the

adversarial

loss.

This

gradient

will

_llow

the

encoder

hel_

the

gener_tor

produce

Sample$

the

support

the

true

data

learned

the

WGAN

critic.

Theoretically,

the

effect

such

term

should

decre_se

(and

eventu_lly

diminish)

the

GAN

converges

Nash-Equalibrium.

Architectures

consider

two

different

instantiations

ARAEs,

One

for

discrete

images

and

the

other

for

text

sequences.

For

both

models

USe

the

Same

WGAN

architecture

but

substitute

different

autoencoder

architectures.

The

gener_tor

architecture

USeS

low

dimensional

With

Gaussian

P"O'

p(_)

N(o,

I),

and

maps

Both

the

critic

and

the

generator

are

parameterized

feed-forward

MLPs.

The

structure

the

deterministic

encoder

enC_

and

probabilistic

decoder

specialized

for

the

domain.

Im_ge

Model

OuT

_rst

model

USeS

fully-connected

neural

networh

encode

binarized

images.

Here

(o,

1)_

where

the

image

size.

The

encoder

used

feed-forwaird

MLP

network

mapping

from

{o,

1}_

I__,

enc_(x)

MLP(x;

The

decoder

predicts

each

pixel

parameterized

logistic

regression,

Pw(x

rT,:__,

_(h)_,'

cr_(h))

where

MLP(c;

V).

Algorithm

ARAE

Training

Procedure

for

number

training

iterations

r_ain

fhe

a_toenco_er

Sample

(x(;)

},_,,

rV_

Ir,

batch

from

the

training

data

Compute

the

i_tent

represent_tions

c(z)

enC_

(x(,.,

)

Compute

the

autoencoder

loss,

_AE

--'

__.-_,

lo_PV

(x(ż)

_c(_))1

backpropagate

gradients,

U_date

the

decoder

(V)

and

the

encoder

(_)

I_ain

t_e

critic

ror

steps

Positive

S_mple

pha_e

Compute

the

_dvers_ri_l

loss

the

_eal

Samples

__.-_,

I_x_I_2

cjw(,Ci>)J

_nd

backpropagate

gradients,

4_date

the

critic

(w)

and

the

encoder

t_)

Negative

__mple

_hase

S_mple

batch

ofrandom

noise

)z___

N(o,

Generate

code

representation

c(i)

9g(2(_))

PaSS'n_

2(_)

through

the

generator

Compute

the

adversalrial

loss

__.-_,

I_c_i_g

[Jw

(c(;)

backpropagate

gr_dients,

Upd_te

the

critic

(w>

clI_

the

Weights

the

cjitic

_Jd

Trajn

tjh_e

generntor

Sample

batch

random

noise

(_(;)

},__,

rV_

N(o,

Generate

code

representation

c(i)

9g(_(_))

PaSS'n_

2(_)

through

the

generator

Compute

the

gener_tor

loss

__.-_,

I_c_i_g

(Jw

(c(i)

)),

backpropagate

gr_dients

througn

the

critic

into

the

_enerator

Update

the

_enerator

(_)

Text

Model

Our

second

model

devised

for

text.

Here

where

rt_

the

sentence

length

_nd

the

vocabulary

the

underlying

language

(typically

[1ok,

<barcodetype="unknown"/>.

Following

usual

practice

USe

recurrent

neur_l

network

(RNN)

both

the

text

encoder

and

decoder.

De_ne

RNN

parameterized

recurrent

function

RNN(_,.,

hJ'-1i

forJ

_ _

(with

that

maps

discrete

input

stJLucture

hidden

vectors

_ _

h,.

For

the

encoder,

de_ne

enc_(x)

the

last

hidden

state

this

reCUrrence.

The

decoder

denned

similar

way,

With

p_rameters

For

prediction

combine

With

produce

distribution

OVer

each

time

step,

P_(x

rT,:__,

softmax(W[hJ';

b)2i

where

and

are

parameters

(part

V).

Finding

the

most

likely

sequence

under

this

distribution

intr_ctable,

but

Can

approximate

using

greedy

beam

search.

04r

experiments

USe

LSTM

architecture

fl3]

for

both

the

encoder/decoder,

and

train

With

teacher-forcing.

Semi-Supervised

Model

OUr

model

trained

unsupervised

manner

combination

_utoencoder

and

GAN.

extension

also

consider

the

USe

Case

where

the

code

vector

additionally

used

input

supervised

classi_cation

t_sk.

the

standard

semi-supervised

setup,

aSSUme

that

OUr

data

consist

small

set

labeled

data

};

and

lar_e

set

unlabeled

data

{xi)J-'

Can

set

standard

supervised

classi_cation

loss

function

using

the

code

vectors

from

the

encoder

and

neW

set

palrameters

___

_NLL(1,_)

_l(_())

(6)

then

extend

OUr

multi-task

loss

function

include

this

objective.

Methods

and

Data

consider

two

different

settings

for

testing

the

ARAE:

(1)

images,

utilizing

the

binalrized

version

MNIST,

and

(2)

text,

using

the

Stanford

Natural

Language

Inference

comus

c3].

This

comus

provides

useful

testbed

comprises

sentences

With

rel_tively

simple

structure.

The

comus

addition_lly

annotated

for

p_irwise

sentence

classi_cation,

which

allows

experiment

with

semi-supervised

learning

controlled

setting.

For

this

t_sk

the

model

presented

With

two

sentences-premise

and

hypothesis-and

haS

predict

their

relationship:

entailment,

contradiction,

neutral.

For

training,

Used

subset

the

corpus

consisting

sentences

less

than

words,

although

preliminary

results

suggest

this

approach

works

words.

Figure

2__

Left:

Produced

AE.

Middle:

Produced

ARAE.

The

arrangement

the

left

and

middle

ngures

aire:

(i)-top

blocks

are

the

input

the

AE,

clean

_nd

noised;

(ii)-bottom

blocks

alre

the

corresponding

reconstruction.

Right;

Results

the

ARAE.

The

tOP

block

consists

the

reconstruction

the

r_nl

MNIST

samples;

the

middle

blocks

are

the

out_4t

the

decoder

taking_nhe

hidden

codes

generated

the

GAN;

the

bottom

blocks

are

the

sample

interpolation

results,

constructed

linearly

interpolating

the

latent

space

and

then

decoding

back

the

pixels.

consider

several

di_erent

empirical

t_sks

test

the

performance

the

model

both

_utoencoder

(ARAE),

USe

the

encoder

aspect,

and

i_tent-valriable

model

(A_E-GAN),

sampling

zts

(the

two

are

trained

identically).

Experiments

include:

(1)

code

"_rcce

str_ct_r_;

does

the

model

preserve

natural

inputs

rV_

Ir,

while

not

preserving

noised

inputs

x/_t

(2)

sefni-super1J_ised

learning;

does

the

performance

supervised

model

improve

when

additionally

trained

autoencoder;

(3)

S__ple

geneyation_

how

well

does

simple

model

when

trained

generated

S_mples_,

(4)

interpolatioii_

__d

arithlnetic;

how

e_sily

C_n

manipulate

vectors

smoothly

control

the

generated

text

S_mples

For

these

experiments

compare

standalrd

AE,

trained

without

the

code-space

GAN

component,

well

standard

language

model.

also

attempted

train

VAEs

the

text

dataset

but

found

that

W_S

unable

learn

meaningful

latent

represent_tions

despite

tuning

the

i_tent

dimension

size,

annealing,

_nd

word

dropout.

Refer

the

appendix

for

detailed

description

the

hyperparameters,

model

architecture,

and

training

regime.

Experiments

7.1

Code

Space

Structure

the

code

space

USe

de_nition

does

not

have

the

capacity

represent

the

entire

discrete

input

space,

ideally

the

autoencoder

would

lea]LJn_

maintain

valid

representations

for

only

fral

inputs

which

roughly

exist

alon_

low-dimensional

manifold

determined

the

space

natural

images

natural

language

sentences.

This

property

difncult

maintain

standard

autoencoders,

which

often

learn

_artial

identity

mapping,

but

ideally

should

improved

code

space

regul_'zation.

test

this

property

PaSS'n_

two

sets

S_mples

through

ARAE,

One

true

held-out

samples

and

the

other

explicitly-noised

ex_mples

existing

this

manifold.

Figure

(left)

shows

these

examples

_nd

their

reconstjcuction

from

the

discretized

MNIST

where

the

noised

examples

COme

from

adding

noise

the

original

image.

For

images

observe

that

regul_r

simply

copies

inputs,

regardless

whether

the

input

the

data

manifold.

ARAE,

the

other

hand,

will

learn

not

reproduce

the

noised

Samples.

Table

(right)

shows

similar

experiments

for

text

where

add

noise

permuting

words

e_ch

sentence.

Ag_in

observe

th_t

the

ARAE

able

map

noised

sample

b_ck

onto

coherent

sentences.

Table

(left)

shows

empirical

results

for

these

experiments.

obt_in

the

reconstruction

ejcror

(i.e.

neg_tive

lo_

likelihood)

the

origingl

A_E

Ojgingl

Awomanwealringsunglasses.

Origingl

They

h_ve

been

swimming

Noised

Awomansunglassesweairing.

Noised

beenhavetheyswimming

4.51

4.07

6.61

5.3_

Ojginal

_etS

galloping

down

the

stieet

Ojgingl

The

child

sleeping

9.I4

6.86

Noised

Petsdownlth_egallopingstreet.

Noised

childtheissleeping.

from

corrupted

sentence.

Here

the

number

swaps

performed

the

original

sentence.

Right.

Samples

generated

from

and

ARAE

where

the

input

noised

swapping

words.

Realdata27.4

Table

2__

Lert.

Semi-Supervised

accuracy

the

natural

language

22.2_o

(medium),

10.8_o

(small),

5.25_o

(tiny)

the

supervised

labels

the

full

SNLI

training

set

(rest

used

for

unl_beled

tr_ining).

Right.

_emlexlty

(lower

better)

i_nguage

models

trained

the

real

d_t_

_nd

synthetic

Samples

from

GAN/AE/LM.

(non-noised)

sentence

under

the

decoder,

utilizing

the

noised

code.

_nd

th_t

when

(i.e.

swaps),

the

regular

better

reconstructs

the

input

(_s

expected).

However,

increase

the

fiumber

swaps

and

_ush

the

input

fu_her

away

from

the

data

manifold,

the

ARAE

likely

produce

the

original

sentence.

note

that

unlike

denoising

autoencoders

which

require

domain-speci_c

noising

function

33],

the

ARAE

not

explicitly

trained

denoise

input,

but

learns

byproduct

adversajal

regularization.

7.2

Semi-Supervised

Training

utilize

ARAE

for

semi-supervised

training

natur_l

langu_ge

inference

t_sk,

shown

Table

(jght).

experiment

With

using

22.2_o,

10.8_o

and

S.2S_o

the

original

labeled

training

data,

and

USe

the

rest

the

training

set

for

unlabeled

training.

The

labeled

set

randomly

_1cked.

The

full

SNLI

training

set

contains

S43k

sentence

Pa'rS_

and

USe

supervised

sets

12Ok,

59k

and

28k

sentence

Pa"S

respectively

for

the

three

settings.

baseline

USe

trained

the

_dditional

d_ta,

simil_r

the

setting

explored

[7].

For

ARAE

USe

the

subset

unsupervised

data

length

15,

which

roughly

includes

655k

S1ngle

sentences

(due

the

length

restriction,

this

subset

of715k

sentences

that

Were

used

for

training).

obsejved

Dal

and

[7],

training

unlabeled

data

with

objective

improves

upon

model

Just

trained

labeled

data.

Training

With

adversarial

regularization

provides

further

_a'nS'

ARAF_-GAN

S4mpl_i

S_m_l_i

S4mples

WOman

P'epalr'n_

three

_sh

Two

Three

WOman

calt

teairing

OVer

tree

m_n

wal_ng

outside

dijt

road

sitting

the

WOman

seeing

man

the

river

man

huggln_

and

dock

There

PaSSeS

WOman

near

birds

the

au'r

The

fancy

skier

st_ing

under

the

d_a_

cU_

large

_rOUP

people

ta_n_

photo

for

Chjstsome

ten

people

sitting

through

their

ofnce

do_

_re

<unk>

m_S

_nd

night

The

m_n

got

stolen

with

YOUn_

dinner

ba_

man

not

standing

Someone

avoiding

SOCCer

__me

Monks

are

_nnine

court

The

BoyS

their

swimming

The

man

WOman

alre

dressed

for

movie

The

Two

boyS

gl_sses

_ll

_irl

sumer

couple

w_iling

for

show

Person

all

empty

st_dium

pointing

mountain

The

man

small

sitting

twO

men

that

tell

chjl-

COUple

kidS

barbecue

TwO

children

and

little

boy

alre

<4nk>

man

dren

The

motorcycles

the

OCean

loadln_

blue

shirt

The

iwo

children

_re

e_ting

the

b_lloon

_nim_l

bike

empty

boy

rides

bicycle

WOman

t_ing

microscope

The

actor

waS

wal_ng

a a

small

do_

alrea

_l_l

Lunning

another

the

forest

The

do_S

alre

sleeping

bed

do_

YOUn_

their

muther

the

man

indian

WOmen

Figure

Text

samples

generated

from

ARAE-GAN,

simple

AE,

and

from

baseline

trained

the

Same

data.

generate

from

fit

multiv_'ate

G_ussian

the

learned

code

space

and

generate

code

vectors

from

this

Gaussian.

m_n

the

COrner

sport

_re_

m_n

shi_

__th

W_th

the

WOm_n

man

Cave

used

escai_tor

man

COrner

road

m_n

shl_

__th

w_th

the

WOm_n

m_n

Cave

used

esc_l_tor

i_dy

outside

r_cetr_ck

__ssing

brid_e

W_th

the

_irl

man

Cave

used

chairs

ladY

outside

racetrack

man

PaSS'n_

brid_e

with

the

_lrl

man

number

used

many

equipment

lot

ofpeople

outdoors

urban

setting

man

PaSS'n_

brid_e

with

the

_i_l

man

number

POS'n_

bl_

rock

lot

_eople

outdoors

urban

setting

man

P_SS'n_

brid_e

with

the

do_S

People

alre

POS'n_

Lur_l

JLr_e_

lot

people

outdoors

urb_n

setting

m_n

Pa"'n_

brid_e

wjth

the

do_S

People

_re

POS'n_

Lur_l

_re_.

Figure

Sample

inte_olations

from

the

ARAE-GAN.

Constructed

linearly

inte_olating

the

latent

space

and

decoding

the

out_4t

space.

Word

changes

are

highlighted

black.

7.3

S_mple

Generation

COmmon

test

for

GANs

ability

gener_te

re_listic

samples

th_t

COVer

the

origingl

d_t_

sp_ce

train

simple

model

the

samples

from

the

GAN

itself.

Acknowledging

the

pitfalls

such

qu_ntit_tive

evaluations

[31],

for

text

GANs

C_n

this

producing

lar_e

set

s_mpled

sentences,

and

training

simple

i_nguage

model

OVer

the

generations.

For

these

experiments

gener_te

10Ok

s_mples

from

(i)

ARAE-GAN,

(ii)

AE,

(iii)

RNN

tr_ined

the

S_me

data,

and

(iv)

the

real

training

set.

"s_mple"

from

multivariate

Gaussian

the

code

space

(of

the

training

data)

after

training

the

and

generate

code

vectors

from

this

Gaussian

and

decode

back

into

sentence

space.

All

models

are

the

S_me

size

allow

for

fair

comparison.

Samples

from

the

models

are

shown

Figure

subsequently

train

standard

RNN

language

model

the

_e_erated

data

and

evaluate

pemlexity

held-out

fral

data.

The

language

model

the

Same

size

the

decoder

the

ARAE.

Can

Seen

from

I_ble

training

real

d_t_

(understandably)

outperforms

training

gener_ted

d_t_

lar_e

margin.

Su_risingly

however,

find

that

language

model

tr_ined

ARAE-GAN

d_t_

performs

slightly

better

than

One

trained

LM-generatedJAE-generated

data.

7.4

Interpolution

_nd

Vector

Arithmetic

widely

obsejved

propelty

GANs

(and

VAEs)

that

the

Gaussian

Pr'Or

p(_)

induces

the

ability

smoothly

inte_olate

between

outputs

exploiting

the

stJLucture

the

latent

space.

While

language

models

may

provide

better

estim_te

the

underlying

probability

sp_ce,

constructing

this

style

interpolation

would

require

combinatorial

search,

which

makes

this

useful

feature

text

GANs.

experiment

with

this

property

sampling

two

points

and

from

p(_)

and

constructing

intermediary

points

__1

_)_O'

For

each

gener_te

the

argmax

output

X_'

The

samples

are

shown

Figure

for

text

and

Figure

(right-bottom)

for

MNIST.

While

difncult

aSSeSS

the

"accuracy"

these

intemolations,

generally

qualitatively

observe

smooth

changes

the

output

sentences/images

mOVe

from

One

latent

sp_ce

another.

Another

intriguing

property

image

GANs

the

ability

mOVe

the

latent

space

Vla

o_set

vectors

(similar

the

C_Se

with

word

vectors

c23]).

For

example,

Radford

al.

r2S]

observe

that

when

the

me_n

latent

vector

for

"men

With

glasses"

subtracted

from

the

mean

i_tent

vector

for

"men

without

glasses"

and

applied

image

"WOman

without

glasses",

the

resulting

image

that

"woman

With

glasses".

experiment

See

similar

property

holds

for

sentences.

generate

million

sentences

from

the

ARAE-GAN

and

Parse

the

sentences

obt_in

the

main

verb,

subject,

and

modi_er.

Then

for

_'Ven

sentence,

change

the

main

verb

subtract

the

mean

latent

vector

(_)

for

other

sentences

with

the

Same

main

verb

(in

the

first

example

Figure

this

would

correspond

sentences

that

had

the

main

verb)

and

add

the

mean

latent

vector

for

sentences

th_t

h_Ve

the

desired

transformation

(with

the

running

ex_mple

this

would

sentences

whose

main

verb

WaS

"walk_ng"),

the

Same

transform

the

subject

and

the

modi_er.

decode

back

into

sentence

space

With

the

transformed

latent

vector

via

Sampling

from

pw(g(_

_)).

Some

examples

successful

transformations

are

shown

Figure

(right).

Quantitative

evalu_tion

the

SUCCeSS

the

vector

transformations

given

Figure

(left).

For

each

original

vector

sample

lOO

sentences

from

(g(_

_))

OVer

the

transformed

neW

latent

vector

and

consider

match

finy

the

sentenceS

demonstrate

the

desired

transfomlation.

Match

propojction

origingl

vectors

that

yield

match

_ost

transformation.

ideally

want

the

generated

Samples

only

differ

the

speci_ed

transformation,

also

calculate

the

average

word

PreC'S'On

against

the

original

sentence

(Prec)

for

match.

m_n

tie

sleeping

_nd

clappin6

balloons

tw_lkin%

m_n

tie

clapping

and

walking

do_S

PerSOn

standing

the

air

beneath

criminal

tw_lking

PeCSOn

walking

the

air

bene_th

pickup

Transform

Match

_rec

The

jewish

boy

_ying

stay

out

hiS

sk_teboalrd

tm_n

Thejewish___niscryingcost_youcofhishorse.

The

people

works

neW

uniform

studio

tm_n

m_n

works

neW

studio

uniform

w_lking

79.5

Some

child

he_d

playing

_l_stic

Wi_

dlrink

t_u

Two

children

Pl_ying

he_d

with

pl_stic

drink

m_n

80.2

b_by

workers

w_tching

ste_k

with

the

w_ler

t_O

Two

workers

w_cching

b_by

ste_k

wich

the

_r_SS

two

74.

The

people

shine

looks

into

area

tdO_

The

do_

anives

looks

into

_Ce_

77.0

Theboy_babiesiswearingahugefactory.

tdo_

Thedo___babies

iswean-ngahugeeairs.

several

67.0

The

do_S

alre

_leeping

front

the

dinner

ts(andi_C

Two

do_S

st_nding

front

the

dinner

side

child

listenln_

P'ece

with

steps

playing

table

tseve,a_

Several

child

playing

guitar

side

with

table

TWO

children

alre

wor_ng

red

shirt

the

cold

field

tseve,a_

Several

children

working

red

shirt

_re

cold

the

field

Figure

s__

Lert.

Quantitative

evaluation

transformations.

Match

refers

the

samples

where

least

One

decoder

samples

(per

loo)

had

the

desired

transformation

the

out_Ut_

while

Prec.

meaSUreS

the

average

PreC'S'On

the

output

against

the

original

sentence.

Right.

Examples

(out

decoder

Samples

_er

sentence)

where

the

offset

vectors

produced

successful

transform_tions

the

origingl

sentence.

See

Section

7.4

for

methodology.

Conclusion

present

adversari_lly

regularized

autoencoders,

simple

approach

for

tr_ining

discrete

structure

autoencoderjointly

With

code-space

generative

adversarial

network.

The

model

learns

improved

_utoencoder

demonstrated

semi-supejrvised

experiments

and

_n_lysis

the

manifold

structure

for

text

_nd

images.

_lso

learns

useful

gener_tive

model

for

text

that

exhibits

robust

latent

sp_ce,

demonstrated

natural

intemolations

and

vector

arithmetic.

however

note

that

(_s

haS

been

frequently

observed

when

training

GANs)

OUr

model

seemed

quite

sensitive

hype_arameters.

Finally,

while

many

useful

models

for

text

gener_tion

_lready

exist,

text

GANs

provide

qualitatively

different

approach

inauenced

the

underlying

latent

variable

structure.

envision

that

such

framework

could

extended

conditional

setting,

combined

with

other

existing

decoding

schemes,

used

provide

intemretable

model

language.

Acknowledgment

thank

Sam

Wiseman,

Kyunghyun

Cho,

Sam

Bowman,

_oan

Bruna,

Yacine

_ernlte,

Martín

Arjovsky,

Mikael

Hena_

and

Michael

M_thieu

for

fJLuitful

discussions.

Yoon

Kim

sponsored

SYST_N

research

award.

also

thank

the

NVIDIA

Comor_tion

for

the

donation

Titan

Pascal

GPU

that

WaS

used

for

this

research.

References

[1]

M_in

Arjovsky,

Soumith

Chint_l_,

and

Léon

Bottou.

W_sserstein

_an.

arxiv;

i7Ol.O7875,

2017.

[2]

David

Bejcthelot,

Tom

Schumm,

and

L4ke

MetZ_

Began:

Bounda_

equilibrium

generative

adversarial

networks.

arxiv;17O_.107J1,

2017.

[3]

S_muel

Bowman,

Gabor

Angeli,

Christopher

_ottS,

Christopher

M_nning.

i__e

_nnotated

corpus

for

lealrning

natural

langu_ge

inference.

Proceedings

o__MNLP,

2015.

[4]

S_muel

Bowman,

Luke

Vilnis,

Andrew

Oriol

Viny_l

and,

Rafal

Jozefowicz,

and

Samy

Bengio.

Categorical

Reparameterization

With

Gumbel-Softmax.

P_oceedings

o_CoNLL,

20I6.

[5]

Ton_

Che,

Yanran

L'_

Ruixiang

Zhang,

Devon

Hjelm,

Wenjie

L'_

Y_ngqui

Son__

_nd

Yoshu_

Bengio.

Maximum-Likelihood

Augment

Discrete

Generative

Adversarial

Networks.

ccrxiv;1702.0798_,

2Ol7.

[6]

Chen,

Diederik

Kingma,

Ilm

Salimans,

Yan

Duan,

Prafulla

Dhariwal,

_ohn

Schulman,

Ilya

utskever,

and

Pieter

Abbeel.

Vajational

Lossy

A4toencoder.

Proceedings

o_lCLR,

2Ol7.

[7]

Andrew

Dal

and

Quoc

Le,

Semi-supervised

sequence

learnln__

Proceedings

o_Nlps,

2Ol5.

[8]

Peter

Glynn.

Likelihood

R4tio

Gradient

Estimation;

Overview.

Proceedings

Winter

Si_ulation

Conjference,

I987.

[9]

Ian

Goodfellow,

Jean

Pouget-Abadie,

Mehdi

Mirza,

Bing

Xu,

David

Warde-Farley,

SherJ

Ozair,

Aaron

Courville,

and

Yoshu_

Bengio.

Generative

_dvers_'aI

nets.

Proceedings

o_NIPS,

2014.

[

1O]

Ish__n

Gulr_jani,

Faruk

Ahmed,

Malrtin

Arjovsky,

_nd

Aalron

Courville

Vincent

Dumoulin.

Improved

Tr_iningofW_ssersteinGANs.

arxiv;J7O4.00O_8,2017.

[

]

Felix

Hill,

Kyunghyun

Cho,

and

Ann_

Korhonen.

Learning

distributed

representations

sentences

from

unlabelled

data,

Proceedings

o_NAACL,

[12]

Devon

Hjelm,

Athul

Paul

Jacob,

Ton_

Che,

Kyunghyun

Cho,

and

Yoshua

Bengio.

Bounda_-Seeking

Generative

Adversairi_l

Networks.

arxiv_l7O_.084_J,

2017.

[13]

sepP

Hochreiter

and

Jürgen

Schmidhuber.

Lon_

shojct-tenn

memo_.

Neurnl

cojnputntion,

g(8):

1735-1780,

997

[14]

Eric

_an__

Shixiang

G4_

and

Ben

Poole.

Categorical

Reparameterization

With

G4mbel-Softmax.

Proceedings

o_lCLR,

2Ol7.

[

15]

Diederik

_ngma,

Tim

Salim_ns,

and

MaX

Welling_

Improving

V_'ation_l

Inference

with

Autoregressive

Flow.

arxiv_J6O6.049_4,

2016.

[16]

Diederik

Kingma

and

Max

Welling'

A4to-Encoding

Variational

Bayes.

Proceedings

o_lCLR,

2014.

[17]

M_tt

Kusner

and

Jose

Miguel

Hernandez-Lob_to.

GANs

for

Sequences

Discrete

Elements

With

the

Gumbel-Softmax

Distribution.

cc_Xiv;16JJ.O4051,

2016.

[18]

Jiwei

L'_

Monroe,

Ti_nlin

Shi,

Séb_stien

Je_n,

Alan

Ritter,

_nd

D_n

Jurafsky.

Adversajal

Learning

for

Neur_l

Dialogue

Generation.

arxiv;J7Ol.O6541,

2017.

[19]

Chris

Maddison,

Andrly

Mnih,

and

Yee

Whye

Teh.

The

Concrete

Distribution:

Continuous

Relaxation

Discrete

Random

Vaj_bles.

P_oceedings

o_ICLR,

2017.

[2o]

Alireza

Makhzani

and

Brendan

Frey'

PixelGAN

Autoencoders.

cirxiv;I7O6.005_I,

2Ol7.

[21]

Alireza

Mahhzani,

_onathon

Shlens,

Navdee_

Jaltly_

Ian

Goodfellow,

and

Brendan

Frey'

Adversarial

Autoencoders.

a_Xiv_J5J

J.O5644,

2015.

[22]

Lars

Mescheder,

Sebastian

Nowozin,

and

Andreas

Geiger.

Adversarial

Variational

ayes

unifyln_

Variational

Autoencoders

and

Generative

Adversarial

Networks.

ccrxiv;1701.04722,

2Ol7.

[23]

Tom_s

Mikolov,

Scott

Wen

tau

Yih,

and

Geoffrey

Zwe1_,

Linguistic

Regulajties

Continuous

Space

Word

Representations.

Pyoceedings

o_NAACL,

2013.

[24]

Onr

Press,

Amir

Bar_

Ben

BO_1n_

Jonath_n

Berant,

and

Lior

Wolf.

L_nguage

Generation

With

Recujcrent

Generative

Adversarial

Networks

without

Pre-tr_ining.

arxiv_J1O6.OJ_99,

2OI7.

[25]

Alec

Radford,

Luke

Metz,

and

Soumith

Chintala.

Unsupervised

Representation

Leajcln_ing

With

Dee_

Convolutional

Generative

Adversarial

Networks.

Proceedings

o_JCLR,

2OI6.

[26]

Danilo

Rezende

and

Shakir

Moh_med.

Variational

Inference

With

Norm_lizing

Flows.

Proceerfings

o_ICML,

2OI5.

[27]

Danilo

Jimenez

Rezende,

Sh_r

Mohamed,

and

Daan

Wierstra.

Stoch_stic

Backpropagation

and

Approxi-

mate

Inference

Dee_

Generative

Models.

P_oceerf_ings

o_ICML,

2OI4.

[28]

Francis

D4til

Christopher

Pal

Aaron

Courville

Sal

Rajeswar,

Sandeep

ubramani_n.

Adversajal

Generation

ofnaturallanguage.

arxiv;1705.JO9__,2017.

[29]

Stanislau

Semeniuta,

Aliaksei

Severyn,

and

Erhardt

Barth.

Hybrid

Convolutional

Variational

Autoen-

coder

for

Text

Generation.

arxiv;J702.02_90,

2017.

[3O]

Ti_nxiao

Shen,

I_O

Lel_

Regina

BarzllaY_

Tommi

J__kola_

Style

Tr_nsfer

from

Non-Palrallel

Text

Cross-Alignment.

arxiv;l705.O_655,

2017.

[31]

Lucas

Theis,

Aaron

Van

den

Oord,

and

Matthias

Beth_e_

note

the

evaluation

generative

models.

Proceedings

o_ICLR,

2016.

[32]

D4stin

Tran,

Rajesh

Ranganath,

and

David

Blei.

Dee_

and

Hierarchical

Implicit

Models

firxiv;I7O2.08896,

2017.

[33]

Pascal

Vincent,

Hu_O

Larochelle,

Yoshua

Ben_1O7

and

Pie]cr_e-Antoine

Manzagol.

Extracting

and

Composing

Robust

Features

With

Denoising

Autoencoders.

Proceedings

o_ICML,

2008.

_34]

Ron_ld

Willi_ms.

Simple

Statistical

Gradient-following

Algorithms

for

Connectionist

Reinforcement

Lealrning.

Machine

Learning,

1__2.

[35]

Zich_o

Y___

Zhiting

Hu,

Ruslan

S_lakhutdinov,

and

Taylor

Berg-Kirkpatrick.

Improved

V_'ational

A4toencoders

for

Text

Modelln_

using

Dilated

Convolutions.

Proceedings

o_lCML,

2Ol7.

[36]

Lantao

Y4_

Weinan

Zhang,

J4n

Wan__

and

Yon_

Y4'

SeqGAN;

Sequence

Generative

Adversarial

NetS

With

_olicy

Gr_dient.

P_oceedings

o_AAAr,

2017.

[37]

Junbo

Jake

Zhao,

Michaël

M_thieu,

and

Yann

Lecun.

Energy-based

generative

adversarial

network.

CoRR,

_bs/l6O9.03l26,

2016.

Appendix:

Experiments

details

MNIST

experiments

The

encoder

three-layer

MLP,

784-800-400-100.

The

output

the

encoder

nor-

m_Iized

onto

unit

ball

(h_ving

nOTm

1),

denoted

__1oo

before

beln_

forwarded

fujrther.

Except

for

the

output

i_yer_

b_tch

normalization

_nd

ReLU

are

used

following

the

line_r

layers.

also

add

SOme

additive

Gaussian

noise

into

which

then

fed

into

the

decoder.

The

standard

deviation

that

noise

initialized

O.4,

and

then

exponentially

decayed

The

decoder

four-layer

MLP,

100-400-800-1000-784

Except

for

the

output

layer,

batch

normalization

and

LeakyreLU

(scale

O.2)

alre

used

following

the

linear

layers.

The

autoencoder

optimized

Adam,

with

learning

rate

5e-O4.

The

GAN

employs

MLP

generator,

with

structure

32-64-

100- 150-

100.

The

noise

vector

_g32

The

GAN

employs

MLP

critic,

with

structure

100-100-60-20-1.

The

clipping

factor

O.05.

The

critic

trained

With

iterations

each

GAN

loO_'

The

GAN

optimized

Adam,

With

learning

rate

5e-O4

the

gener_tor,

and

5e-O5

the

critic.

When

Updating

the

encoder,

multiply

the

critic

gradient

O.2

before

b_ckpropping

the

encoder.

Text

experiments

The

architecture

u$ed

for

the

text

generation

task

described

blow_

The

encoder

one-layer

LSTM

with

300

hidden

units.

The

output

the

encoder

normalized

onto

unit

ball

(having

norm

i),

denoted

before

beln_

forwarded

further.

add

Gaussian

noise

into

before

feeding

into

the

decoder.

The

standard

deviation

that

noise

initialized

O.2,

and

then

exponenti_lly

decayed

every

100

iter_tions

factor

O.995.

The

decoder

one-layer

LSTM

With

300

hidden

units.

The

decoding

P'OCeSS

each

time

ste_

t_kes

the

tO_

layer

LSTM

hidden

st_te

and

COn-

catenates

With

the

hidden

codes

before

feedln_

them

into

the

output

(i.e.

voc_bulary

projection)

and

the

softmax

layer.

The

word

embedding

size

300.

adopt

_rad

clipping

the

encoder/decoder,

With

maX

_rad

norm

The

encoder/decoder

optimized

vanilla

SGD

with

learning

rate

The

GAN

employs

MLP

generator,

with

structure

100-300-300,

batch

normalization,

and

ReLU

nonlinea_.

The

noise

vector

__1Oo.

The

GAN

MLP

critic,

with

structure

300-300-1,

batch

norm_liz_tion,

_nd

LeakyreLU

(with

scale

O.2)

nonlinearity.

The

clipping

f_ctor

O.01.

The

critic

trained

with

iterations

e_ch

GAN

loO_'

The

GAN

optimized

Adam,

With

learning

rate

5e-O5

the

generator,

and

le-O5

the

critic.

When

U_date

the

encoder,

normalize

both

SOUrceS

gradients,

i.e.

from

the

critic

and

decoder,

based

their

normS.

After

that,

Weight

factor

O.01

imposed

the

critic

backproped

gradient.

increment

the

number

GAY

training

(it

initially

set

respectively

the

beginning

epoch

#2,

epoch

and

epoch

#6.

Wetrainforatotalof6epochs.

1The

GAN

training

loO_

refers

how

m_ny

times

train

GAN

e_ch

entire

tr_ining

loO_

(one

training

loO_

contains

training

autoencoder

for

One

loO_,

and

training

GAN

for

One

several)

Semi-supervised

experiments

The

architecture

used

for

semi-supervised

learning

task

described

blow;

The

encoder

three-layer

LSTM,

With

the

hidden

st_te

size

beln_

300.

The

output

the

encoder

normalized

onto

unit

ball

(having

norm

1),

denoted

__3Oo,

before

beln_

forw_rded

further.

The

decoder

one-layer

LSTM,

with

the

hidden

state

size

being

300.

The

decoding

PrOCeSS

each

time

ste_

takes

the

tOP

layer

LSTM

hidden

state

and

concate-

nates

with

the

hidden

codes

before

forward

them

word

tr_nsition

matrix.

The

initial

decoder

hidden

state

initialized

with

linear

transfojcmation.

The

word

embedding

size

300.

adopt

_rad

clipping

both

LSTMs,

With

maximum

allowed

gradient

norm

beln_

The

encoder/decoder

optimized

vanilla

SGD,

learning

rate

The

GAN

employs

MLP

generator,

with

structure

100-150-300-500,

batch

normaliza-

tion,

and

ReLU

nonlinearity.

The

noise

vector

__1Oo

The

GAN

MLP

critic,

with

structure

500-500-150-80-20-1,

batch

nomlaliza-

tion,

and

LeakyreLU

(scale

O.2)

nonlinearity.

The

clipping

factor

O.02.

The

critic

trained

with

iter_tions

each

GAN

loO__

The

GAN

optimized

Adam,

With

learning

rate

5e-O5

the

generator,

and

le-O5

the

critic.

When

U_date

the

encoder,

normalize

both

SOUrceS

gradients,

i.e.

from

the

critic

und

decoder,

based

their

nOImS.

After

that,

multi_ly

the

critic

gr_dients

O.01

before

backpropping

the

encoder„

Note

USe

the

Same

architecture

for

three

expejiment

settings

(i.e.

i_bel

set

portion:

Medium

(22.2_o),

Small

(1O.8_o),

Tiny

(S.S_o)).

The

baseline

models

under

comparison

(Supervised

Encoder,

Semi-Supervised

AE)

USe

the

Same

setting

described

above.