Show simple item record

dc.contributor.advisorCorrada Bravo, Carlos
dc.contributor.authorMelendez-Rios, Rafael
dc.date.accessioned2022-07-28T14:32:18Z
dc.date.available2022-07-28T14:32:18Z
dc.date.issued2022-06-10
dc.identifier.urihttps://hdl.handle.net/11721/2877
dc.description.abstractDespite impressive successes in computer vision, deep learning has made more limited progress for audio data, mostly in voice recognition and speech processing. Furthermore, most such work has depended on spectrograms and the subsequent adoption of computer vision standard practices, thus treating spectrograms as images and adopting model architectures and augmentation techniques that have been principally developed for image data. In our work we aim at laying the first stepping stones towards a foundation for deep learning for audio data, and in particular for animal sound classification under scarce data conditions. To do so, we handle four basic concerns, using raw input audio data and convolutional neural network models: 1) how to determine hyper-parameters such as filter size and sequence; 2) what data augmentation techniques work best; 3) how do models trained on raw input compare to spectrogram trained models; 4) how do model, augmentation, and input choices impact computer memory and time resource requirements. We corroborate common assumptions that more data and deeper models improve performance, validate the common practice of increasing filter size with depth. We also discover that increasing maxpool filter sizes boosts performance. Our augmentation experiments suggest that all audio data is invariant to time stretching, while only some data types are invariant to pitch shift, notably human speech. However, noise injection (random perturbation and Gaussian noise) did not improve performance. We conclude that models trained on raw inputs can achieve comparable or better results than spectrogram trained models when data is structured and simple, while spectrogram models still outperform raw input models when data is complex or noisy/irregular. Additionally, using spectrograms allows for shallower models and shorter training time.en_US
dc.description.sponsorshipThis work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562.en_US
dc.language.isoen_USen_US
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subjectConvolutional neural networken_US
dc.subjectSound classificationen_US
dc.subjectRaw inputsen_US
dc.subject.lcshAnimal soundsen_US
dc.subject.lcshDeep learning (Machine learning)en_US
dc.subject.lcshNeural networks (Computer science)en_US
dc.subject.lcshRandom noise theoryen_US
dc.titleDeep learning for animal audio classification - a foundational approachen_US
dc.typeThesisen_US
dc.rights.holder© 2022 Rafael Melendez-Riosen_US
dc.contributor.committeeMarcano Velazquez, Mariano
dc.contributor.committeeMegret Laboye, Remi
dc.contributor.campusUniversity of Puerto Rico, Río Piedras Campusen_US
dc.description.noteDefense day was June 10, 2022. Corrections finalized on July 18, 2022en_US
dc.description.graduationSemesterSummer (3rd Semester)en_US
dc.description.graduationYear2022en_US
thesis.degree.disciplineMathsen_US
thesis.degree.levelM.S.en_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States