Deep learning for animal audio classification - a foundational approach

Melendez-Rios, Rafael

dc.contributor.advisor	Corrada Bravo, Carlos
dc.contributor.author	Melendez-Rios, Rafael
dc.date.accessioned	2022-07-28T14:32:18Z
dc.date.available	2022-07-28T14:32:18Z
dc.date.issued	2022-06-10
dc.identifier.uri	https://hdl.handle.net/11721/2877
dc.description.abstract	Despite impressive successes in computer vision, deep learning has made more limited progress for audio data, mostly in voice recognition and speech processing. Furthermore, most such work has depended on spectrograms and the subsequent adoption of computer vision standard practices, thus treating spectrograms as images and adopting model architectures and augmentation techniques that have been principally developed for image data. In our work we aim at laying the first stepping stones towards a foundation for deep learning for audio data, and in particular for animal sound classification under scarce data conditions. To do so, we handle four basic concerns, using raw input audio data and convolutional neural network models: 1) how to determine hyper-parameters such as filter size and sequence; 2) what data augmentation techniques work best; 3) how do models trained on raw input compare to spectrogram trained models; 4) how do model, augmentation, and input choices impact computer memory and time resource requirements. We corroborate common assumptions that more data and deeper models improve performance, validate the common practice of increasing filter size with depth. We also discover that increasing maxpool filter sizes boosts performance. Our augmentation experiments suggest that all audio data is invariant to time stretching, while only some data types are invariant to pitch shift, notably human speech. However, noise injection (random perturbation and Gaussian noise) did not improve performance. We conclude that models trained on raw inputs can achieve comparable or better results than spectrogram trained models when data is structured and simple, while spectrogram models still outperform raw input models when data is complex or noisy/irregular. Additionally, using spectrograms allows for shallower models and shorter training time.	en_US
dc.description.sponsorship	This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562.	en_US
dc.language.iso	en_US	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Convolutional neural network	en_US
dc.subject	Sound classification	en_US
dc.subject	Raw inputs	en_US
dc.subject.lcsh	Animal sounds	en_US
dc.subject.lcsh	Deep learning (Machine learning)	en_US
dc.subject.lcsh	Neural networks (Computer science)	en_US
dc.subject.lcsh	Random noise theory	en_US
dc.title	Deep learning for animal audio classification - a foundational approach	en_US
dc.type	Thesis	en_US
dc.rights.holder	© 2022 Rafael Melendez-Rios	en_US
dc.contributor.committee	Marcano Velazquez, Mariano
dc.contributor.committee	Megret Laboye, Remi
dc.contributor.campus	University of Puerto Rico, Río Piedras Campus	en_US
dc.description.note	Defense day was June 10, 2022. Corrections finalized on July 18, 2022	en_US
dc.description.graduationSemester	Summer (3rd Semester)	en_US
dc.description.graduationYear	2022	en_US
thesis.degree.discipline	Maths	en_US
thesis.degree.level	M.S.	en_US

Files in this item

Name:: UPRRP_MATE_MELENDEZRIOS_2022.pdf
Size:: 2.140Mb
Format:: PDF
Description:: thesis for Masters program in ...

View/Open

This item appears in the following Collection(s)

Theses & Dissertations
Tesinas, Tesis y Disertaciones

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States