In this blog, Roman Shrestha, a researcher at Intelligent Voice, looks at an important, but often overlooked, machine learning challenge: “How to identify different bird songs in the wild?”. An effective, automatic wildlife monitoring system based on bird bioacoustics, which can support the manual classification done by an ornithologist or an expert birder, can be pivotal for the protection of the environment and certain endangered species. Of more than academic interest, this has applications as diverse as tracking climate change to identifying the locations of videos shot of child abuse. Roman’s unique approach shows a significant improvement over the state of the art.

Birds embody peculiar phonic and visual traits that distinguish them from 10,000 distinct bird species worldwide. Birds are well-known for their instinctual ability to promptly respond to the changes in their environment, providing reliable insights on its ecological state. Considering their gift of flight, small size and propensity to lodge in the trees and bushes, tracking then visually can be an onerous task. Hence, the majority of non-invasive automatic wildlife monitoring systems rely on avian bioacoustics.

In modern machine learning, classification of birds `in the wild’ is still considered as an esoteric challenge owing to the convoluted patterns present in the bird songs along with background noise, and the complications that arise when numerous bird species are present in a common setting. To overcome these challenges, we have implemented a novel Faster Region-Based Convolutional Neural Network (R-CNN) bird audio diarization system that incorporates object detection in the spectral domain for bird-specific spectral pattern recognition from spectrograms.

Spectrograms generate distinct visual patterns based on the energies possessed by avian vocalisation and these patterns differ for every bird species. The Faster R-CNN model is capable of learning and performing object detection in the spectral domain for effective spectral pattern recognition thereby providing an important insight on which bird sang when.

How does Bird Audio Diarization work?

Diarization, commonly known as the “who spoke when?” problem, is the process of partitioning an input audio stream into homogeneous segments according to speaker identity.

In terms of bird audio diarization, the system accepts an input audio stream and recognizes all the bird species present in the audio along with the precise timestamp of when they occur in the recording as illustrated in the figure below which cannot be achieved by using a traditional classification approach.

The Faster R-CNN architecture

The Faster Region-Based Convolutional Neural Network (R-CNN) is a specialized Convolutional Neural Network (CNN) architecture for object detection presented by Ross Girshick, Shaoqing Ren, Kaiming He and Jian Sun in 2015 that can perform highly accurate and speedy object detection. Faster R-CNN is the preferred architecture because it can be easily customized and trained with custom data and performs better compared to generic methods like Selective Search and EdgeBoxes The four major components of this architecture are discussed below.

CNN Feature Extractor: The CNN Feature extractor extracts fixed length features from the image.

Region Proposal Network (RPN): Based on the features extracted, the RPN generates bounding boxes to locate the object of interest (spectral patterns in our case) with various scales and aspect ratios.

The Classifier: The Faster R-CNN classifier is responsible for detecting objects from the Regions of Interest specified by the RPN.

Scoring: A confidence score between 0 and 1 is provided for the detected object along with the generation of bounding box to locate the object of interest within the image.

Data Acquisition, Pre-processing, Model Training and Testing

Bird Songs from Europe corpus, a subset of the Xeno-canto database, containing well-labelled intrinsic audio recordings of the 50 most common European bird species was used for training and evaluating the Faster R-CNN model.

Initially, the raw audio input was downsampled from 16 KHz to 8KHz and the mp3 files were converted to wav. Then, the audio files were segmented into uniform 2 second chunks and audio augmentation was performed with 50% overlap followed by merging the audio segments randomly between different species to simulate the presence of multiple bird species. A Pydub based bird audio detector operates on the recordings for automatically labelling the segments containing the bird species within the audio. The spectrograms were generated from the audio segments and partitioned for training (80%), validation (10%), and testing (10%).

A Faster R-CNN model with ResNet50 Feature Pyramid Network (FPN) backbone pre-trained on the COCO dataset with a Region Proposal Generator was used to train the model. The spectrograms along with labels highlighting the corresponding annotations were provided for training the model. For effective transfer learning, the latest version of Fastai makes use of several fit one cycle iterations to fine-tune modules with pre-trained weights more efficiently. Hence, the Fastai library was implemented utilising functionalities from the IceVision package.

Using an NVIDIA GeForce GTX 1080 Ti GPU, the total time to train the model was 8 days and 12 hours, with an average training time of 3 hours and 13 minutes per epoch. The model was trained for a total of 60 epochs, during the first 5 epochs the ResNet50 FPN backbone was frozen and only the model head was trained. This was followed by the remaining 55 epochs to train all the layers and adjust the parameters accordingly.

The weights of the model instance exhibiting minimum validation loss were saved and used for generating predictions on the unseen test set for inference. During inferencing, this trained Faster R-CNN classifier [7] was used to generate predictions for the spectrograms from the unseen test set. The figure below displays a sample prediction, where an audio input provided to the bird audio diarization system whose ground truth values were correctly predicted and accurate bounding boxes were generated.

Benchmarking Results

The results obtained by the trained Faster R-CNN model on the test set were reported as Diarization Error Rate (DER) of 21.81, Jaccard Error Rate (JER) of 20.94 and F1, precision and recall values of 0.85, 0.83 and 0.87 respectively. Three models which have used a similar number of species from comparable datasets were used for validating the performance of our model.

Silla Jr. and Kaestner, approached acoustic bird species classification with 48 species using another subset of the Xeno-canto database, using the Global Model Naive Bayes (GMNB) algorithm and reported 0.50 F1 score.

Incze et al. finetuned a pre-trained MobileNet-based CNN architecture to classify bird species from another subset of the Xeno-canto database. Even though the approach worked well for fewer species, the accuracy dropped to 20% while classifying 50 species.

F.Lima performed transfer learning on the pre-trained VGG16 CNN architecture and achieved a bird audio classification accuracy of 73.5% on the evaluation set, on the same Bird Songs From Europe dataset consisting of 50 classes.

Table 1 outlines the performance of these three approaches against the performance achieved in this work, based on an evaluation of bird species obtained from the Xeno-canto database. From the obtained results, it can be observed that the Faster R-CNN model outperforms standard classification approaches and has the potential to cope with the challenges associated with automated biodiversity monitoring in the wild.


A huge amount of research has been invested to build a fully functional automated non-invasive biodiversity monitoring system. However, bird songs can be easily occluded by various environmental noises and other simultaneously vocalising species which can seriously impact the accuracy of these systems. Compared to the traditional classification approaches, bird audio diarization is able to separate intrinsic avian vocalisations into separate homogeneous segments according to their species and determine the length of their songs alongside identifying the multiple simultaneously vocalising species in an ecosystem.


R. Shrestha, C. Glackin, J. Wall and N. Cannings, “Bird Audio Diarization with Faster R-CNN”, 30th International Conference on Artificial Neural Networks (ICANN), 14 – 17 Sep 2021 , Springer. [Online]

F. Lima, “Bird songs from Europe (Xeno-canto),” 2020. [Online]. Available:

S. Ren, K. He, et al., “Faster R-CNN: Towards real-time object detection with region proposal networks,” Adv Neural Inf Process Syst (NeurIPS), 2015.

Howard, S. Gugger, “Fastai: A layered API for Deep Learning”, Information, vol. 11, no. 2, p. 108, 2020

Vazquez, F. Hassainia, “Icevision: An agnostic object detection framework,” Github, 2020. [Online]. Available:

N. Silla Jr., C.A.A. Kaestner, “Hierarchical classification of bird species using their audio recorded songs,” IEEE Int Conf Systems, Man, and Cybernetics, 2013

Incze, H. Jancso, et al., “Bird sound recognition using a Convolutional Neural Network,” IEEE Int Symp Intelligent Systems and Informatics (SISY), 2018

Lima, “Audio classification in R,” poissonisfish, 2020. [Online]. Available:

Leave a Reply

Your email address will not be published. Required fields are marked *

− 1 = 4