What's on video is frequencies (spectrum), lower frequencies shown lower.
Blue part is the preview (seeing frequencies as they are played
seem to be very "sudden" to perceive).
Frequencies are normalized - so if you see the voice part "take over" and engine/music/other sounds fade
away
that's because voice part is more significant and normalization makes other sounds fade.
I can probably write it in about 30 lines of quite simple Python (no dependencies) – it's that simple.
It's so simple that sometimes I think it could be found with brute force of simplest algorithms over an
input.
But that's just an author's bias, of course: I had to go through a couple hundred non-working ones over months
to find this one.
It's extremely lightweight – I can produce a spectrum using less than 10 * N * Foperationsoverall.
It's highly likely that I have invented the most efficient/lightweight spectrum algorithm ever!
It is very lightweight, but {the current edition} is comparable to FFT.
It's also conceptually the simplest. I think I could convey the idea in 5 sentences
to anyone who passed codejam's round 1 (or, maybe, round 2).
A4 sheet of information would be enough for a 14 year old (probably, a bit math inclined).
More on comparison with FFT
Fourier transform works perfectly on a periodic signal.
In order to be applied to sound, a 'sliding window trick' is used.
Moreover: as a random window doesn't perfectly match at ends, another 'smoothing' must be used.
Pick a window too wide, and you're likely to lose high frequencies.
Pick a window too narrow, and you're restricted from analyzing low frequencies.
All this suggests that there's a better way.
Please compare analysis of modem handshake sound: FFT
vs mine.
Or this FFT vs mine.
Or violin sound decomposed.
Facebook's wav2letter uses 25 ms sliding window and 10 ms stride.
I simply don't need this.
While technically FFT can be computed O(N·logN) at best, given that the sliding window's length is
fixed,
that's O(N) time complexity to process the signal.
However, in practice my method would need 10-100
times less operations to build a spectrogram.
The algorithm can largely improve the quality of home speakers and online speech-to-text services.
It can largely save battery on phones (especially for hot phrase detection)
and even enable speech recognition on watches and earbuds!
It's certainly useful for all kinds of signal processing. E.g. real time software-defined radios could raise
their limits.
Can I try it?
No. I used to publish a web service, but it didn't receive much attention. E-mail me if you're interested.