On Combining Global and Localized Self-Supervised Models of Speech

Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore

Producción científica: Contribución a una revistaArtículo de la conferenciarevisión exhaustiva

4 Citas (Scopus)

Resumen

Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

Idioma originalEnglish
Páginas (desde-hasta)3593-3597
Número de páginas5
PublicaciónProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volumen2022-September
DOI
EstadoPublished - 2022
Evento23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duración: sep. 18 2022sep. 22 2022

Nota bibliográfica

Publisher Copyright:
Copyright © 2022 ISCA.

ASJC Scopus Subject Areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Huella

Profundice en los temas de investigación de 'On Combining Global and Localized Self-Supervised Models of Speech'. En conjunto forman una huella única.

Citar esto

Dumpala, S. H., Sastry, C. S., Uher, R., & Oore, S. (2022). On Combining Global and Localized Self-Supervised Models of Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 3593-3597. https://doi.org/10.21437/Interspeech.2022-11174