On Combining Global and Localized Self-Supervised Models of Speech

Sri Harsha Dumpala; Chandramouli S. Sastry; Rudolf Uher; Sageev Oore

doi:10.21437/Interspeech.2022-11174

On Combining Global and Localized Self-Supervised Models of Speech

Sri Harsha Dumpala, Chandramouli S. Sastry, Rudolf Uher, Sageev Oore

Medicine

Producción científica: Contribución a una revista › Artículo de la conferencia › revisión exhaustiva

4 Citas (Scopus)

Resumen

Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

Idioma original	English
Páginas (desde-hasta)	3593-3597
Número de páginas	5
Publicación	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volumen	2022-September
DOI	https://doi.org/10.21437/Interspeech.2022-11174
Estado	Published - 2022
Evento	23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duración: sep. 18 2022 → sep. 22 2022

Nota bibliográfica

Publisher Copyright:
Copyright © 2022 ISCA.

ASJC Scopus Subject Areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modelling and Simulation

Acceder al documento

10.21437/Interspeech.2022-11174

Otros archivos y enlaces

Citar esto

Dumpala, S. H., Sastry, C. S., Uher, R., & Oore, S. (2022). On Combining Global and Localized Self-Supervised Models of Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 3593-3597. https://doi.org/10.21437/Interspeech.2022-11174

On Combining Global and Localized Self-Supervised Models of Speech. / Dumpala, Sri Harsha; Sastry, Chandramouli S.; Uher, Rudolf et al.
En: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2022-September, 2022, p. 3593-3597.

Producción científica: Contribución a una revista › Artículo de la conferencia › revisión exhaustiva

@article{7e25adcb558f4b27a4c76fbe003d1a33,

title = "On Combining Global and Localized Self-Supervised Models of Speech",

abstract = "Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.",

author = "Dumpala, {Sri Harsha} and Sastry, {Chandramouli S.} and Rudolf Uher and Sageev Oore",

note = "Publisher Copyright: Copyright {\textcopyright} 2022 ISCA.; 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 ; Conference date: 18-09-2022 Through 22-09-2022",

year = "2022",

doi = "10.21437/Interspeech.2022-11174",

language = "English",

volume = "2022-September",

pages = "3593--3597",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - On Combining Global and Localized Self-Supervised Models of Speech

AU - Dumpala, Sri Harsha

AU - Sastry, Chandramouli S.

AU - Uher, Rudolf

AU - Oore, Sageev

PY - 2022

Y1 - 2022

N2 - Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

AB - Self supervised learning involves learning general-purpose representations that can be useful in a variety of downstream tasks. In this work, we study the application of speech-embeddings derived from popular self-supervised learning frameworks such as wav2vec-2.0 and HuBERT over four different speech-classification tasks such as sentiment classification, command detection, emotion classification and depression detection. We distinguish between and discuss self-supervised training tasks that induce localized and global features of speech based on their temporal granularity: noting that self-supervised representation learning frameworks based on the masked language-modeling objective - such as wav2vec-2.0 and HuBERT - induce localized embeddings, we define a self-supervised learning framework based on SimSiam for learning global features of speech. Through our evaluations, we find that these global representations are better suited for tasks such as depression detection and emotion classification while the localized embeddings of speech can be very useful in tasks such as speech-command detection; we also find that our proposed model outperforms TRILL - a popular model for learning global representations. Finally, we also propose and confirm empirically that combining the global and localized representations of speech helps obtain better performance across a range of downstream tasks than each of the individual embedding methods.

UR - http://www.scopus.com/inward/record.url?scp=85140064547&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85140064547&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2022-11174

DO - 10.21437/Interspeech.2022-11174

M3 - Conference article

AN - SCOPUS:85140064547

SN - 2308-457X

VL - 2022-September

SP - 3593

EP - 3597

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022

Y2 - 18 September 2022 through 22 September 2022

ER -

On Combining Global and Localized Self-Supervised Models of Speech

Resumen

Nota bibliográfica

ASJC Scopus Subject Areas

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto