dc.contributor.advisor | Corrada Bravo, Carlos | |
dc.contributor.author | Melendez-Rios, Rafael | |
dc.date.accessioned | 2022-07-28T14:32:18Z | |
dc.date.available | 2022-07-28T14:32:18Z | |
dc.date.issued | 2022-06-10 | |
dc.identifier.uri | https://hdl.handle.net/11721/2877 | |
dc.description.abstract | Despite impressive successes in computer vision, deep learning has made more limited progress for audio data, mostly in voice recognition and speech processing. Furthermore, most such work has depended on spectrograms and the subsequent adoption of computer vision standard practices, thus treating spectrograms as images and adopting model architectures and augmentation techniques that have been principally developed for image data. In our work we aim at laying the first stepping stones towards a foundation for deep learning for audio data, and in particular for animal sound classification under scarce data conditions. To do so, we handle four basic concerns, using raw input audio data and convolutional neural network models: 1) how to determine hyper-parameters such as filter size and sequence; 2) what data augmentation techniques work best; 3) how do models trained on raw input compare to spectrogram trained models; 4) how do model, augmentation, and input choices impact computer memory and time resource requirements. We corroborate common assumptions that more data and deeper models improve performance, validate the common practice of increasing filter size with depth. We also discover that increasing maxpool filter sizes boosts performance. Our augmentation experiments suggest that all audio data is invariant to time stretching, while only some data types are invariant to pitch shift, notably human speech. However, noise injection (random perturbation and Gaussian noise) did not improve performance. We conclude that models trained on raw inputs can achieve comparable or better results than spectrogram trained models when data is structured and simple, while spectrogram models still outperform raw input models when data is complex or noisy/irregular. Additionally, using spectrograms allows for shallower models and shorter training time. | en_US |
dc.description.sponsorship | This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. | en_US |
dc.language.iso | en_US | en_US |
dc.rights | Attribution-NonCommercial-NoDerivs 3.0 United States | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/us/ | * |
dc.subject | Convolutional neural network | en_US |
dc.subject | Sound classification | en_US |
dc.subject | Raw inputs | en_US |
dc.subject.lcsh | Animal sounds | en_US |
dc.subject.lcsh | Deep learning (Machine learning) | en_US |
dc.subject.lcsh | Neural networks (Computer science) | en_US |
dc.subject.lcsh | Random noise theory | en_US |
dc.title | Deep learning for animal audio classification - a foundational approach | en_US |
dc.type | Thesis | en_US |
dc.rights.holder | © 2022 Rafael Melendez-Rios | en_US |
dc.contributor.committee | Marcano Velazquez, Mariano | |
dc.contributor.committee | Megret Laboye, Remi | |
dc.contributor.campus | University of Puerto Rico, Río Piedras Campus | en_US |
dc.description.note | Defense day was June 10, 2022.
Corrections finalized on July 18, 2022 | en_US |
dc.description.graduationSemester | Summer (3rd Semester) | en_US |
dc.description.graduationYear | 2022 | en_US |
thesis.degree.discipline | Maths | en_US |
thesis.degree.level | M.S. | en_US |