Chir-PCA: Signal Processing and the Power of the Covariance Matrix

AUGUST 25th 2023

In a year dominated by trillion-parameter generative models, many practitioners have forgotten the effectiveness of low-cost statistical analysis techniques.

Principal component analysis (PCA) is one of these extremely powerful, yet deceivingly simple, techniques which takes a set of data in high-dimensional space, and compresses it into as few dimensions as possible while retaining maximum information. Successful applications of PCA allow us to greatly reduce our problem space and compute time, often allowing us to visualize complex data in 3 or fewer dimensions.

The goal of this 48-hour project was to show how incredibly versatile PCA is and how it can even be used to cluster lengthy bird song audios in 3 dimensions. Today, dimensionality reduction might be done through diffusion maps or autoencoders, but we will show how useful PCA is even when our original input space is complex enough to warrant deep learning.

The BirdCLEF project

Is a collection of over 16,000 audio recordings of bird calls and songs collected in the Kenyan wild, belonging to 264 different species (4 of which will be used in our study).

The recordings we used were collected at 32000hz, and varied from 4s to upwards of 2 minutes (120s). This means that the unprocessed feature space in the time domain varied from 128 thousand to over 3.84 million features.

Hearing the Song

Working in millions of dimensions

  • Bird calls generally range from 1khz-10khz. Considering the gradual roll-off of our bandpass filter, setting the low and high bounds to 2500hz and 6000hz allows most of the bird song to be captured. Focusing our efforts on the signal within this range also enhances the effectiveness the song extraction step.

  • A short portion at the start of the audio is used to calculate the noise floor. Using this reference and a uniform kernel, we can find decibel levels of the rest of the signal, cutting out portions containing just background noise. This gating process avoids contaminating our PCA model with environmental noise and decreases our audio length and file size, significantly lowering the compute time of the final step.

  • Due to the variable length of recording, each audio’s FFT has differing granularity. We aggregate the power of neighboring frequencies into 200 fixed frequency bins. This ensures that the FFTs across all bird songs are in the same feature space, which simplifies our projection to 3D space.

The Bird Songs

*Symbols and colors below each bird image match the PCA plot below.*

All photos under creative commons license

Seeing the song

Down to 3 Dimensions

I provide the detailed mathematical formulation for PCA here, but the beauty of the algorithm is that it boils down to one or two basic steps depending on your approach:

SVD (Singular Value Decomposition) Approach:

  • Calculate the Singular Value Decomposition of our FFT data matrix, where the left singular vectors will be the principal components and the explained variance will be proportional to the square-rooted singular values.

EVD (Eigenvalue Decomposition) Approach:

  • Calculate the covariance matrix of our transposed FFT data.

  • Compute the eigenvalues and eigenvectors of this covariance matrix. The eigenvectors are the principal components, and the explained variance in each PC direction will be given by the eigenvalues.

Projecting onto our PCA subspace shows an obviously imperfect, but useful clustering within bird songs of the same species. The retention of information is incredible when you note that we have kept merely 3 out of 200 principal components and have vastly decreased our feature space from the original thousands/millions of dimensions.

It is important to note that this feat would not have been possible without rigorous treatment of the data beforehand, illustrating the importance of careful feature engineering based on domain knowledge, and feeding your models clean data.

Previous
Previous

DECODER ONLY TRANSFOMERS FOR HANDWRITTEN DIGIT GENERATION

Next
Next

NLP THESIS