200609

stri.destride 2020. 6. 9. 22:23

DiscreTalk : Text-to-Speech as a Machine Translation Problem

: GAN-based VQ-VAE + NMT-Transformer

1) GAN-based VQ-VAE

- loss: VQ-VAE loss를 가져오지만 encoder, decoder 스트럭처는 melGAN에서 가져온다. discriminator는 K개. 따라서 reconstruction loss는 spectrum에 대한 loss 2개, codebook loss, commitment loss, adversarial loss로 구성됨

NMT모델에 beam search와 같은 기존의 ASR필드에서 쓰던 기법들을 적용해봤다

Adversarial Auto-encoders for Speech Based Emotion Recognition

AE + D = AAE

AE에서 뽑은 code를 이용해서 classification해도 기존 1500차 피쳐로 classification했을때랑 큰 차이 안남. 여기서 discriminator는 fake input의 경우 GMM에서 sampling하여 썼음.

real point에서 생성한 데이터와 synthetic point에서 생성한 데이터를 함께 사용해서 SER을 돌리면 그게 가장 성능이 좋음. 근데 솔직히 당연한거 아닌가?

인상깊은건 latent code의 fake는 임의의 distribution에서 sampling한것이었다는거,, 근데도 뭐 잘 나오긴 하네,,