Minimal: November 2008

Wednesday, 12 November 2008

Comparing 2 Audio (*.wav) Files

Posted by Juri at 11:44 Labels: audio, Fourier

We already know the structure of wave file and thus we are able to read it byte by byte (actually we need to read the data chunk sample by sample). Now the next move is to get "fingerprints" from our files.

Fingerprint

A fingerprint can uniquely and compactly represent an audio file. It consists of several points of local energy maximum in audio spectral density. How the spectral density varies in time can be shown by a spectrogram. The most common format is a graph with two geometric dimensions: the horizontal axis represents time, the vertical axis is frequency; a third dimension indicating the amplitude of a particular frequency at a particular time is represented by the intensity or colour of each point in the image, e.g., the brighter the shade, the more energy is contained in the time-frequency point. The only thing we need is a amount of acoustic energy in predefined frequency and time.

A spectrogram of 30 seconds of the part of Pet Shop Boys "West End Girls" song starting from 1:00

Spectrograms can be obtained by Short Time Fourier Transform. The audio samples are grouped into analysis time windows (preferably overlapping) w_i of equal length N, with w_i denoting the i-th window. For each window the Fourier transform is calculated, giving a complex vector v_i = STFT(w_i) of the same length as the window. Because in this case the input given to the Fourier transform always are vectors of real numbers, the output complex vectors obey the symmetry:

v_i[q] == -v_i[N-q+1]

So the complete information is contained in the first N/2 components of the complex vector v_i.

The Fourier transform decomposes the signal given by the samples inside each input window in terms of sine waves of discrete frequencies. These frequencies are integer multiples of the fundamental frequency which is determined by the window length N and the sampling rate S of the waveform representation. The frequency F_k for a particular index k in the complex vector may be calculated by using the following formula:

F_k = k * S / N , where k = 0, ... ,N/2

The first frequency F₀ is always zero. If we have a sampling rate of 44100 Hz and a window length of 1024 samples, the base frequency F₁ is 43.0664 Hz and the maximum frequency F₅₁₂ is the Nyquist rate 22050 Hz.

Then we need the absolute value for each component of v_k to get a measure of how strongly a discrete frequency F_k is present in the decomposition of the i-th window of audio file. This data can then be used for plotting the spectrogram. When the spectrogram is plotted the fingerprint points are chosen to be points that are local maximum within regions of fixed size surrounding the point. Larger region size leads to fewer but possibly more significant points. The resulting features are saved as pairs of integer numbers (i, k) with i being the window index and k being the frequency index.

Friday, 7 November 2008

Structure Of *.wav Audio Files

Posted by Juri at 10:43 Labels: wav

Wave file structure is very simple. The structure can be divided into 3 parts (chunks).

First chunk: The first 4 bytes should be "RIFF". Then come 4 bytes, which indicate the size of file. Then comes "WAVE".

Second chunk: It starts with "fmt ". Then come 4 bytes showing the length of "fmt " chunk. Then come audio format, number of channels, sample rate, Byterate, Block align and bits ber sample.

Third chunk: It is the audio data itself. As always 4 first bytes - name of the chunk. 4 bytes after that - the length of the chunk in bytes. After that come samples itself. 2 * number_of_channels bytes each sample.

There can be also other chunks between the first and the third, but they are really not widely used. If you got interested in it, a very good article about "Wave file format" is on The Sonic Spot.

So we have here an example. We see the first chunk (purple). It shows the length of the file - 0x(00 00 08 24) = 0x824 = 2084

Bytes should be read in reverse direction. First read byte has the smallest rank!

You can check yourself: left channel sample #5 = 0xE734 and right one is 0xA623.

Continued there...

Going to English

Posted by Juri at 10:42 Labels: blog

I decided to continue writing here in English. So I can be read by much more audience. I hope my English is as good as my native language. Will see if it works so. :)