200618

stri.destride 2020. 6. 23. 12:49

On Enhancing Speech Emotion Recognition using Generative Adversarial Networks

전작: Adversarial auto-encoders for speech based emotion recognition

motivation: SER 성능을 GAN structure를 이용해 올려보겠다.

approach: using GAN

DB: IEMOCAP, MSP-IMPROV, feature는 opensmile toolkit 1582-dim

Background: AAE, GAN

1) vanilla GAN만 쓰면 수렴 안해서 AAE를 사용하여 정보를 compress

: input of D: Real sample and output of G, input of G: sample from a distribution,

: 목표 - 2d p_z 에서 1582d feature vector 생성하기

2) conditional GAN: AAE를 안 쓰기 위해서 label 정보를 사용한다. 그리고 수렴을 위해 몇가지 방법을 적용함

: input of D: Real sample + class label and output of G, input of G: sample from a mixture distribution, mixture id

: initializer - generator as AAE decoder, learning rate ratio (ld/lg = 10), training 5 generator and 1 discriminator

evaluation

: 1) in-domain evaluation w/ synthetic amples as training set w/ w/o real data

- using only synthetic from Vanilla GAN, using real samples only, using combination : 분류기는 SVM, 셋 다 성능은 비슷비슷한데 제일 좋은건 3번 combination이고 이건 feature vector를 code vector로 압축했을때에도 해당함.

: 2) in-domain evaluation w/ synthetic samples as test set

- compressed as training set and generated as test set: compressed as training set이 더 좋은 성능을 보이는데 왜냐면 dimension이 낮아서 svm이 추정하기 더 쉽기 때문이라고 설명함

: 3) cross-corpus evaluation w/ cobination of real and synthetic data

참고했던 블로그: http://blog.naver.com/PostView.nhn?blogId=kkes0220&logNo=221471523028&categoryNo=90&parentCategoryNo=0&viewDate=¤tPage=1&postListTopCurrentPage=1&from=postView