Deep learning for animal audio classification - a foundational approach
AdvisorCorrada Bravo, Carlos
MetadataShow full item record
Despite impressive successes in computer vision, deep learning has made more limited progress for audio data, mostly in voice recognition and speech processing. Furthermore, most such work has depended on spectrograms and the subsequent adoption of computer vision standard practices, thus treating spectrograms as images and adopting model architectures and augmentation techniques that have been principally developed for image data. In our work we aim at laying the first stepping stones towards a foundation for deep learning for audio data, and in particular for animal sound classification under scarce data conditions. To do so, we handle four basic concerns, using raw input audio data and convolutional neural network models: 1) how to determine hyper-parameters such as filter size and sequence; 2) what data augmentation techniques work best; 3) how do models trained on raw input compare to spectrogram trained models; 4) how do model, augmentation, and input choices impact computer memory and time resource requirements. We corroborate common assumptions that more data and deeper models improve performance, validate the common practice of increasing filter size with depth. We also discover that increasing maxpool filter sizes boosts performance. Our augmentation experiments suggest that all audio data is invariant to time stretching, while only some data types are invariant to pitch shift, notably human speech. However, noise injection (random perturbation and Gaussian noise) did not improve performance. We conclude that models trained on raw inputs can achieve comparable or better results than spectrogram trained models when data is structured and simple, while spectrogram models still outperform raw input models when data is complex or noisy/irregular. Additionally, using spectrograms allows for shallower models and shorter training time.