a playful generator of the normal distribution

thanks to Annie Spratt

In my previous (1) and (2) posts I presented some different animated visualizations showing how the normal distribution arises from a recursive or iterative process interpreting the Central Limit Theorem.

But this not the only way to arrive at the normal distribution. In this post I’m going to speak about the derivation that I like the most and that inspired me another simple, interactive application.

Imagine that you have to hit a target. You aim at its center but since you aren’t a crack shot your throws will departure to some extent from it.

visualizing the uncertainty in a football league table

Visualizating uncertainty is a trending topic in the data visualization community. It aknowledges the fact that often data is subject to many sources of error: sampling error, measurement error, processing error, etcetera.

A football league table typically doesn’t suffer from these problems. Nevertheless, it makes sense to discuss how uncertainty works. Simply, it is a concept which has to be applied in a specific way.

In Italy we use to say that the championship is uncertain (“il campionato รจ incerto”) as an equivalent manner to say that it is fought and, ultimately, engaging.

In other words it’s uncertain when ranking table changes or could change at every matchweek. Changes or could change: this is the key sentence that I’m going to expand upon.

visualizing the central limit theorem /2


Several different sources (starting from Wikipedia) state that the Galton box is a (visual) demonstration of the Central Limit Theorem. This claim is actually bothering me a little because this result is only incidental.

Indeed, the Galton box simulates the outcomes of a binomial variable by dropping several balls across an interleaved grid of pegs, showing that when the number of balls becomes large their bottom arrangement yields a very good approximation of binomial distribution. In other words, it represents an empirical proof of the Law of Large Numbers.
On the other hand, my previous visualization shows that increasing the number of trials in a sequence of theoretical binomial distributions the normal one comes into view very quickly. This is the real sense of the Central Limit Theorem, in its simplest version. And the reason why the Galton machine works.

Let me rephrase the above distinction formally by writing that if Fm(Xn) is the sample distribution of a collection of n binomial variables which count the number of successfully events having constant probability p in m trials, then it holds, with some abuse of notation:

The first relation describes the Law of Large Numbers; the second one the Central Limit Theorem.

If you think about them statically, the Galton box points out that for a large value of n, even a moderate value for m is enough to obtain a sample distribution very close to the normal one.

But here I insist that the key word to really appreciate the meaning of the two laws is: convergence. From this point of view, in my opinion there is a misunderstanding about the Galton box: it refers to the former law but it is credited to the latter: increasing the number of the balls more and more, their distribution gets closer and closer to the binomial distribution, whose approximation to the normal one is good but it cannot get better because the number of peg levels doesn’t change.

Is it possible to imagine a different mechanism to highlight the distinction between the two convergence laws?

visualizing the central limit theorem

I found several animations of the Central Limit Theorem on the web. Most of them are implementations of Galton box that shows how the binomial distribution is close to the normal distribution.

While sketching out the flow chart for a different kind of program, I realized that the same concept can be illustrated in another way, focusing on the analytical aspect of the approximation process (charting the exact probability distribution) instead of the empirical one (sampling from the exact probability distribution) as in the Galton box.

To describe it, consider the classic example of a fair coin to flip repeatedly. In all the tosses the coin has probability 50% of landing heads and probability 50% of landing tails. Now let reason about the number of heads after each toss.

launch of wordpress spam analytics

As I have recently written, blog spam comments could be a cool source of data to analyze, even only for fun. At first I was attracted to comment contents, but then an article on marketpress suggested me some other interesting directions to explore data and inspired me the idea to represent numerical results into interactive charts.

a comment reply by email solution for a self-hosted WordPress blog

For my self-hosted WordPress blog I don’t delegate comment management to platforms like Disqus or wordpress.com. I have nothing in principle against them; on the contrary, I acknowledge that they offer an useful and convenient service. Simply, I belong to (I presume) that minority of people that nowadays prefer a their own solution than a third-party one.

So said, I would have liked very much to enjoy the reply to comment by mail feature that is available for wordpress.com users. It is very very cushy (and in some occasions it is the only way) to continue a discussion within the mail client without the need to open the blog site.

visualizing Simpson’s paradox

Time changes things and ideas. Lately I was thinking that maybe the chart I’ve described three years ago for a two-way table could be not so impressive as I was expecting. Nevertheless I thought continually if in some cases it could be useful to feature its two simple properties: different bar widths and equivalence of their areas with the underlying cluster area to show the contribution of each row (or column) to the overall total and the contribution of each value to the row (or column) total.

Eventually I figured out that the visualization of Simpson’s paradox is one of these instances. Wikipedia page shows three known different graphic representations of Simpson’s paradox: a correlation scatterplot for the continuous case, and two diagrams corresponding to its vector and physical interpretation for the discrete case.

a heat bubble matrix chart for two-way tables


A fundamental concept I reminded in the article I wrote shamefully some very much time ago, is that different charts correspond to different analysis needs and reproduce different views of a two-way table. While at the time I have discussed some charts translating the table by horizontal or vertical sections, that is, by rows or columns, here I’m going to introduce a chart beloging to the group of symmetrical representations, those which do not change transposing the table and hence which deal with rows and columns on the same level either.

from table to chart in WordPress

After having discovered HighCharts, I’ve been cursed by its elegance and simplicity. So I started to make some experiments and then I’ve decided to build a little WordPress plugin which allows everyone to automatically create charts starting from tables of numerical data in blog posts.

Eventually I’ve given birth a plugin which, with a monstruous effort of imagination, I’ve called Table2Chart.

use google spreadsheet as a proxy

Notice: This article can be outmoded due to the availability of new Google services. It should be saved for the uniqueness of its main idea.

Applications of Google spreadsheet functions for external data are virtually endless, and my previous article on how =IMPORTXML() can help to automate the process of collecting web data is just an example. Since fantasy has no limits, I want to show that another possibility is to build… a proxy. Yes, a proxy inside Google preadsheet. Let see.

According to Google help function =IMPORTDATA() retrieves information from a CSV or TSV file, but really it can be used for whatever web page. So, if I am precluded to visit a certain web site, say www.example.com, in theory in an opened spreadsheet I could get the source code of its web page by =IMPORTDATA("www.example.com"), copy and paste the returned string in a notepad window, save it as a html file and finally open it with the browser. If I need another page, I have to repeat all these steps, but clearly this is a bit clumsy.