All Things Data Science: De Moivre's equation and the solar panels of Lo-Reninge

A few weeks a go I saw an innocent little article on solar panels in the Flemish quality newspaper 'De Standaard', entitled "Niemand maakt meer zonne-energie dan inwoners Lo-Reninge", which roughly translates to "no one produces more solar energy than the inhabitants of Lo-Reninge". The article reports on the production of solar energy by individual households, typically produced by small installations on rooftops. The Flemish authorities support solar energy by subsidizing households who install solar panels. An important part of the subsidies is handled by issuing so called 'Green certificates' (or renewable energy certificates) per fixed amount of 'kilowatt per hour' produced. See here for more details on solar power in Belgium. De Standaard newspaper, citing data from the Flemish Regulator of the Electricity and Gas market (VREG), reported on the number of these certificates issued in 2012 relative to the number of inhabitants per municipality. It appeared that Lo-Reninge came up as number one.

If you're not from Belgium, you've probably never heard of Lo-Reninge. Well, I live in Belgium and there are only slightly more than 300 municipalities in Flanders and I had never heard of Lo-Reninge either. And it's exactly that that sparked my attention. Here's the top 10 reported by the newspaper:

Municipality	Certificates per inhabitant
Lo-Reninge	0.39210
Peer	0.34080
Opglabbeek	0.34051
Bocholt	0.32443
Zuienkerke	0.32198
Vleteren	0.31637
Balen	0.31401
Nieuwerkerken (Limburg)	0.31060
Wellen	0.30895
Alveringem	0.30774

In fact, in this top 10, only Balen has more than 20,000 inhabitants. This gives the impression that very small municipalities have more solar panels than larger ones. In general there is indeed a relationship between the size of a municipality and the production of solar energy with small installations, but that relationship is more subtle than this table seems to suggest. The problem here is that this real effect is confounded with, what Howard Wainer calls "The most dangerous equation" or "de Moivre's equation". But before we come to that, let me give you an artificial example that demonstrates the effect more easily.

Suppose that the journalist of 'De Standaard' visited each and every municipality in Flanders and asked every inhabitant to flip a coin. Similar to the solar panel case, where the count of certificates per municipality was divided by the number of inhabitants, he could now report the number of heads divided by the number of inhabitants. Most us would expect these figures to be close to 0.50. And, they would be right. But there's more to this than meets the eye. What would happen if he would have reported the top 10 of municipalities with the highest proportion of heads?

I did this through simulation, and here's my top 10:

Zuienkerke
Herne
Oudenburg
Baarle-Hertog
Veurne
Ledegem
Herenthout
De Haan
Kortessem
Vleteren

No Lo-Reninge this time, but beer-drinkers might remember that our number 10, Vleteren, from Westvleteren Brewery fame, was the number 6 in the solar panel competition. Also, our number 1, Zuienkerke was number 5 in De Standaard article. While Lo-Reninge did not appear in our coin flipping Top 10 this time, it most certainly would if we would repeat the exercise a few times. I simulated 1000 repeats of the coin flipping competition and Lo-Reninge ended up 164 times in the top 10 of municipalities with the highest proportion of 'heads'. ... It also ended up155 times in the worst 10 (i.e. those with the lowest proportion of 'heads'). Is there a magical correlation between having solar panels and coin flipping? Of course not, if you know something about statistics, Flemish geography or beer, you will by now have realized that some of the municipalities involved are so small that they are more likely than others to show extreme numbers. Take for instance Herstappe, Flanders smallest municipality with less than 100 inhabitants. In my 1000 repeats of the coin flipping competition, Herstappe ended up 462 times in the top 10 and 416 times in the bottom 10. So only in 12.2% of the repeats Herstappe did not end up in the Top 10 of the most 'heads' or the Top 10 of the most 'tails' (out of about 300 municipalities in Flanders). What we see here is de Moivre's equation at work:

$\sigma_{\overline x}= {\sigma \over\sqrt{n} },$

in which $\sigma_{\overline x}$ is the standard error of the mean, $\sigma$ is the standard deviation of the sample and $n$ is the sample size.

We can further illustrate this by plotting the proportion of heads per municipality on the y-axis and the number of inhabitants on the x-axis for a typical run of such an experiment.

The mean is indicated by the red line. As expected the outcomes of the municipalities are all scattered around the overall mean (and close to the theoretically expected 0.50), but there is much more dispersion around that mean in the small municipalities than there is in the larger ones. We can use the equation of de Moivre above to calculate the standard deviation in function of $n$ and then plotting the usual + and - 2 times that standard deviation. This results in the blue lines. As you can see most outcomes of our experiment fall within what you reasonably could expect.

The solar panel case is a bit more complicated. The graph below shows the plot of the number of certificates issued in 2012 per municipality relative the number of inhabitants. It's difficult to see from the plot right away, but the LOESS-regression, summarized by the green line, indicates that effectively there is a relationship between the population size of a municipality and the (relative) usage of solar panels in that the larger a municipality (in number of inhabitants), the lower the relative number of certificates per inhabitant becomes.

But you can also see that there is a clear relationship between the variability of the ratio's and the number of inhabitants of a municipality. For those of you who care, the dot on the far right (no pun intended) is Antwerp and the dot in the middle (at around 250000) is Ghent.

From a data journalism point of view I don't think you can blame the journalist here. To start with, the solar panel case is less clear cut than the coin flipping example, because, effectively, there is a relationship between the number of inhabitants and the usage of solar panels. But also, I know a few statisticians who don't always fully realize what the consequences of de Moivre's equation are. And, in my younger days, I got fooled a few times myself; Finally, there are some well known cases where the consequences of the misunderstanding where much harder. Howard Wainer lists some interesting cases in the article that I referred to earlier. Here, I'll just pick one of those examples. Based, a.o., on 'the observation that among high-performing schools, there is an unrepresentatively large proportion of smaller schools', by the end of the last century, there was a growing movement to support smaller schools. The Bill and Melinda Gates Foundation, for instance, was offering grants to education projects that supported smaller schools. Howard Wainer and Harris Zwerling showed that the smaller schools were not only over represented in the high performing group, but also in the low performing group, which is consistent with what we would expect from de Moivre's equation. Taking this into consideration they found that, overall, students at bigger schools do better. In the mean time, the Gates Foundation has announced it was 'moving away from its emphasis on converting large high schools into smaller ones'.

Just two more things regarding the solar panel case. To start with I noticed that either VREG or the journalist have used the population figures of the first of January 2011, and not those of the 31st of December of that same year. That would have made more sense since the certificates issued in 2012 were considered. Secondly, in my effort to replicate the calculations of VREG or the journalist I could not help but notice that my top 10 matched up to 4 or 5 decimals with the table that was published in De Standaard. The only exception was that I have Kinrooi listed as number two with a ratio of 0.34318, while, according to the map in De Standaard the number of certificates relative to the number of inhabitants if Kinrooi is a meager 0.1133. A quick check reveals that the number of inhabitants of neighboring Maaseik was used instead of those of Kinrooi.

To conclude, I wouldn't go as far as saying that De Standaard found a rather convoluted way to track Flander's smallest municipalities, but the least you could say is that the top 10 presented is probably strongly influenced by the number of inhabitants rather than being a pure reflection of usage of solar panels.

All Things Data Science

Sunday, September 15, 2013

De Moivre's equation and the solar panels of Lo-Reninge

12 comments:

About Me