On Combining Global and Localized Self-Supervised Models of Speech

Sri Harsha Dumpala; Chandramouli S. Sastry; Rudolf Uher; Sageev Oore

doi:10.21437/Interspeech.2022-11174

On Combining Global and Localized Self-Supervised Models of Speech

Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore

Medicine

Research output: Contribution to journal › Conference article › peer-review

4 Citations (Scopus)

Abstract

Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

Original language	English
Pages (from-to)	3593-3597
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2022-September
DOIs	https://doi.org/10.21437/Interspeech.2022-11174
Publication status	Published - 2022
Event	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: Sept 18 2022 → Sept 22 2022

Bibliographical note

Publisher Copyright:
Copyright © 2022 ISCA.

ASJC Scopus Subject Areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modelling and Simulation

Access to Document

10.21437/Interspeech.2022-11174

Cite this

@article{7e25adcb558f4b27a4c76fbe003d1a33,

title = "On Combining Global and Localized Self-Supervised Models of Speech",

abstract = "Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.",

author = "Dumpala, {Sri Harsha} and Sastry, {Chandramouli S.} and Rudolf Uher and Sageev Oore",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-11174",

language = "English",

volume = "2022-September",

pages = "3593--3597",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - On Combining Global and Localized Self-Supervised Models of Speech

AU - Dumpala, Sri Harsha

AU - Sastry, Chandramouli S.

AU - Uher, Rudolf

AU - Oore, Sageev

PY - 2022

Y1 - 2022

N2 - Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

AB - Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

UR - http://www.scopus.com/inward/record.url?scp=85140064547&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85140064547&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-11174

DO - 10.21437/Interspeech.2022-11174

M3 - Conference article

AN - SCOPUS:85140064547

SN - 2308-457X

VL - 2022-September

SP - 3593

EP - 3597

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

Y2 - 18 September 2022 through 22 September 2022

ER -

On Combining Global and Localized Self-Supervised Models of Speech

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Fingerprint

Cite this