On Combining Global and Localized Self-Supervised Models of Speech

Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore

Research output: Contribution to journalConference articlepeer-review

4 Citations (Scopus)

Abstract

Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

Original languageEnglish
Pages (from-to)3593-3597
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2022-September
DOIs
Publication statusPublished - 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: Sept 18 2022Sept 22 2022

Bibliographical note

Publisher Copyright:
Copyright © 2022 ISCA.

ASJC Scopus Subject Areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'On Combining Global and Localized Self-Supervised Models of Speech'. Together they form a unique fingerprint.

Cite this