Coffee Coder

Shubham Jain's Weblog

Building a Whistle Detector Using WebAudio API

| Comments

There are lots of things to get excited about HTML5 and the one which caught my curiosity was HTML5 Audio / Video API. I was overwhelmed with ideas of practical applications like face detect login or inline dictation but I chose to start with something small – a whistle detector. Although, not wholly accurate it works quite well with very a good accuracy. I used M. Nilsson’s research paper, “Human Whistle Detection and Frequency Estimation” to implement this. It took me a while to get understand exactly what the paper narrates with its mathematical notations but luckily my wandered at the right place to get the right idea.

For the first part, I would try to explain Successive Mean Quantization Transform (SMQT) which prepares the audio data for further processing.

Successive Mean Quantization Transform

Transformation in mathematics is an operation to map one set to another set. SMQT is a similar method to do the same to remove bias or gain resulting from disparity between various kinds of sensors (microphones) and other factors. In SMQT, we recursively take mean of data set and split it into two halves and do the same on each half. Data values above the mean are assigned, “1” and below are assigned “0”. The recursion is carried out to a pre-defined depth, at the end of which we have a binary tree with 1s and 0s. Sounds confusing? Lets take and example of set:

X = [89, 78, 63, 202, 90, 45, 112, 79, 95, 87, 90, 78, 54, 34, 66, 32].

Mean(X) = 80.875

The values above mean are assigned “1” while below are assigned “0”. So it becomes – [1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0]. Let this procedure be called as U(X). Data values corresponding to “0” propagate left of the binary tree while “1” propagate right. So we have a tree which looks like,

Continue this process recursively till you reach a depth of L. ( Note: L = 8 in our application. )

After this, you weight each level by multiplying the bits by 2cur_level – 1 and add it up to the top of tree. So, if you have a tree which looks like,

Multiply D, E, F, G by 22 which gives [4 0 4 0], [0 0 4 0], [0 0 4 4], [4 0 0 0] and so on. Lets call this procedure of weighing individual arrays as W(X). After we are done weighing, we add to the node its subtrees. For eg, B = W(B) + (W(D) . W(E)). So we have now have audio data that is bias and gain free. (Gist).

Normalization and FFT

For the purpose of this detector we will use chunks of 512 elements for which we will calculate SMQT to a max depth of 8. After we have taken SMQT of audio data, we would normalize this result so that its values fall within range of [1, -1]. we will divide the values by 2L – 1 and subtract “1” from the result.

Taking the Fast Fourier Transform of the normalized data will give us an array of N = 256 elements. Let this FFT be denoted by F(T).of frequency. Point to be noted here is, because we are using 256 elements to represent a range 0 – 22Khz, each element will represent about ~43 Hz of frequency. To detect a whistle, we will need to extract two feature vectors.

Calculation of feature vectors

Human whistle generally falls in the range of 500 – 5000Hz. Want to try? Take a look at FFTExplorer. Our estimation of whistle will involve calculation of two feature vector (or, values in simple terms). In the first step, we will find result of band-pass and band-stop filter applied on F(T) in the frequency range (500 – 5000Hz), called pbp(t) and pbs(t). Although, the way I have implemented filters is pretty basic (and wrong). I have attenuated amplitudes to a fixed value but filters are generally much more complex than that.

The two feature vectors aim at finding out the spikes in our frequency range, strongly suggesting presence of a whistle. The two feature vectors result from following requirements:

  • The largest value in pbp(t) should typically be larger than the mean of pbs(t) in the presence of whistle

  • In presence of whistle pbp(t) has typically a few very dominant values.

First feature vector

For the first feature vector we will use the following equation,

The value must be greater than 25.

Second feature vector

The second one is bit tricky. First we will take find a new array by.

Next, we calculate two vectors, given by

The theory behind this is to detect peaks by comparing both the vectors. For measuring the similarity, we will exploit Jensen Difference, given by,

The value of J(v, v') must be around .44.

Note: The threshold values are only meant for general terms. If you find suitable you can use other values to suit your needs.

Threshold positives and accuracy

The problem of false positives will still persist for various kind of noises and sounds. To be more precise, we can can calculate no of positive samples within X no. of samples and compare it with our threshold. If it exceeds threshold positives, then it probably is a whistle. For our project, the chosen threshold is 6 which can be increased for further accuracy.

The whistle detector is quite accurate even under influence of acceptable noise but the accuracy will decrease with lowering of threshold. However,increasing to much higher value may fail to detect to even a long whistle. So it must be around to be accurate enough for a practical application.

Why Desktop Apps Still Make Sense

| Comments

So you think desktop apps will die a slow death? The sentiment of demise of desktop apps has been professed by many people, Jeff Atwood, Patrick McKenzie to name a few. With colossal jump in our web technologies, both performance and capability wise, something that couldn’t have been possible few years back, the idea is getting even more traction. When you see demos like this, you are tempted to think if web browsers would be able to address the performance issue, what would withhold for web to be used for everything from games to essential business software?

So why do I think desktop apps still make sense?

  • Passiveness of income: While it may be true, web apps fare great when it comes to potential revenue, cross-platform comparability, and reach, it is indispensable to avoid working actively on it. Have you ever heard of someone who was pulling off revenue his age old SaSS app without putting in any work? On the other side, I still get sales from one of my stupid script that I made a long time ago and never even marketed. May be a generalized scenario; it might be possible someone earns a passive income from his web app without adding anything but it is hard to think that the developer can avoid tasks like marketing, server monitoring, dealing with quirky consumer issues, or scaling.

  • Less overhead in selling: Selling a web app means, integrating it with an API to accept payments, offer a X days trial, send emails reminding trial period is ending and charge card on recurring basis. By any chance, if you happened to use paypal and their API be prepared to pull your hair off in doing this. In contrast, selling desktop apps is much easier with many services available for selling downloads – gumroad, softpedia, CNET.

  • Lesser obligations to deal with issues: With a web app, you are expected immediately to address any issue that pops up and with many peculiarities of CSS and HTML, it is reasonable to expect that in “some browser” on “some device”, the text is overflowing outside the container. If you are dealing with a desktop app, you have plenty of time till the next release is due (except for security issues, of course).

So it boils down to how desktop apps are about writing code and shipping where as web apps require you to be there always. Although, you probably won’t lose customer over some little CSS issue but the way your mind works, you won’t be able to stop yourself from addressing it immediately and that is where desktop apps have an upper hand – you can prioritize which things to add / fix / remove in the next release. One example of this has been documented by Joel Spolsky.

As Excel 5 was nearing completion, I started working on the Excel 6 spec with a colleague, Eric Michelman. We sat down to go through the list of “Excel 6” features that had been cut from the Excel 5 schedule. We were absolutely shocked to see that the list of cut features was the shoddiest list of features you could imagine. Not one of those features was worth doing. I don’t think a single one of them was ever done, even in the next three releases. The process of culling features to fit a schedule was the best thing we could have done. If we hadn’t done this, Excel 5 would have taken twice as long and included 50% useless crap features — Painless Software Schedules

Whether or not desktop apps will cease, web apps certainly aren’t a de-facto choice when it comes to making a product.

Stop Using Captchas That Can Be Broken With Two Lines of Code

| Comments

The de-facto bot prevention technique sprawls everywhere on the web but I am surprised how people overestimate the difficulty of breaking a captcha. Here are some of them which I encountered.

It won’t take rocket science to convert them to text. In fact, the only two free tools needed for this purpose are: tesseract and Imagemagick.

Convert command
convert captcha.jpg -threshold 5% a.jpg

Adjust the threshold value to get a binary image with no noise.

tesseract -l eng a.jpg text

This will create a “text.txt” file with the captcha text in it.

I ran a test on effectiveness of these commands and the success rate was nearly 9 / 10, clearly implying how weak or better say, made-from-scratch captcha implementations are as good as having none at all.

Making Earphone Presses Useful With PyAudio and VLC HTTP API

| Comments

Note: I am not an audio expert or even close to one. This post may pose amateur attempts to do something very trivial. Link to the Github repository.

Ever had one of those moments, when you are super excited about accomplishing a challenge, having put something useful on the table, only to realize it is not even close to the greatness you imagined; to be bitter, futile? This weekend I build something to detect earphone button presses and control VLC media player with it but it was not so useful afterall.

Earphone Presses

"Samsung Earphones"

(Link to original image)

I own a pair of Samsung earphones, intrigued how the buttons used to switch / pause tracks in smartphones work, I plugged in the pieces in my combo jack, used audacity, pressed a button and the result was:

Wave form for button press

“Great! Awesome find!” exclaimed my mind. So how can we make this into something useful?

Reading MP3 ID3 Tags in Native PHP

| Comments

This week I went crazy about file formats. I tried to understand specifications of many popular formats like MP3, FLV, PDF. Its amazing to see that no matter how complex these technologies are or the algorithms they use to store media efficiently, at the lower level it is just a clever arrangement of bits that makes sense and with a bit of experimentation and hacking around MP3 format (a Hex Editor is a invaluable tool in this), I was able to read them in PHP without using any extension. The source has been put on GitHub.

Binary File Reader

The native method for reading a binary file is unpack(). The problem with it was that it can’t handle variable length chunks, and I found it tough to understand the format of packing codes. Unluckily, I realized it quite late (damn!), that I can create the reader more efficiently by using unpack() function. (Gist)

Why You Should Never Freelance on Freelancing Sites.

| Comments

Back in the days, I used to be the crazy money minded programmer writing kLOCs of crap with no code quality concern, for projects I often found on freelancing sites., oDesk and likes which seem to be quite popular among employers looking for cheap third world country coders but honestly, if you think about being a better programmer, never log in to them. Why?