Op vrijdag 25 Oktober 2013 verscheen er in "De Standaard" een artikel onder de kop "Vluchtelingen moeten het doen met beloftes". Het artikel zelf is prima, het handelt over het probleem van de vluchtelingen in Europa, dat omwille van de ramp voor Lampedusa, hoog op de Europse agenda is geraakt. De grafiek bij het artikel, echter, is niet onmiddellijk een schot in de roos te noemen.

Het probleem bij deze grafiek is dat men de oppervlakte van cirkels gebruikt om verhoudingen te vergelijken, en dat is bijzonder moeilijk. Neem bijvoorbeeld het Verenigd Koninkrijk. Ongeveer de helft (14600) van de 28200 asielaanvragen wordt goedgekeurd. De oppervlakte van de rode cirkel is dan ook ongeveer de helft van de blauwe cirkel. Ik heb het eens nagerekend, en het klopt vrij aardig, maar de modale lezer zal allicht niet onmiddellijk aan die verhouding denken. Maar bon, de getallen zelf staan er netjes bij, dus ook al werkt het visueel niet goed, dan heb je toch nog de getallen

Erger is dat deze grafiek het bijhorende verhaal niet echt ondersteunt. Het gaat erom dat de zuiderse landen vinden dat ze het grootste gedeelte van de lasten van de vluchtelingen moeten opnemen, maar dat de cijfers dit beeld nuanceren. En inderdaad, achteraan in lijstje vinden we Malta, Griekenland en Spanje terug. In het artikel wordt er, terecht, op gewezen dat men de cijfers moet bekijken in het licht van het aantal inwoners per land, maar de cijfers van de grafiek worden wel niet relatief gegeven. Als voorbeeld wordt Italië genomen, maar helaas zit die in de grafiek in de bovenste, betere, helft. Verder meldt het artikel dat ook Frankrijk tot de gelegenheidscoalitie hoort. Maar ook dat land zit in de bovenste helft van de grafiek.

Overigens vind ik dat de journalist best had aangegeven waarom sommige landen wel en andere landen niet zijn opgenomen in de grafiek.

Deze grafiek moet beter kunnen. Ik heb de cijfers overgenomen en de bevolkingscijfers voor 2012 van Eurostat er aan toegevoegd. Vervolgens heb ik het aantal asielaanvragen per miljoen inwoners uitgedrukt. Omwille van de moeilijke interpretatie van oppervlakten van cirkels kies ik hier voor een eenvoudige staafdiagram.

De grafiek is geordend van het hoogste relatieve aantal verleende asielaanvragen naar het laagste (i.e. het groene gedeelte van de staaf). De afgekeurde asielaanvragen staan in het rood. Op deze wijze valt zowel de verhouding goedgekeurde en afgekeurde aanvragen per land op, en is het meteen ook duidelijk welke landen relatief veel asielaanvragen goedkeuren (t.o.v. hun bevolkingsaantal) en welke niet. De informatie die je hier niet ziet, en die wel aanwezig was in de grafiek van De Standaard, zijn de absolute aantallen. Dat is een nadeel, maat anderzijds is het zo dat dit niet onmiddellijk onderwerp uitmaakte van het artikel.

Op de herwerkte grafiek zie je dat de gelegenheidscoalitie helemaal onderaan de grafiek bungelt, enkel Malta heeft allicht een punt, en staat in deze grafiek helemaal bovenaan. Ik denk dat de journalist een sterker verhaal had kunnen maken als hij/zij een betere grafische voorstelling had gekozen.

Los daarvan zie je ook dat, als je Malta buiten beschouwing laat, de noordelijke landen relatief meer asiel verlenen dan de zuidelijke landen. Bemerk ook dat België eerder bij de Scandinavische landen aansluit dan bij de Zuiderse landen. Je ziet ook goed dat bij de vier landen in de staart, Frankrijk en Griekenland veel meer aanvragen krijgen dan Spanje en Italië.

In het licht van dat laatste zou ik er toch op willen wijzen dan de cijfers bij het artikel betekenen dat Spanje, bijvoorbeeld, in 2012 slechts 600 asielaanvragen zou hebben goedgekeurd. In principe zou dat kunnen, bvb. mocht er een asielstop zijn in dat land, maar dat er slechts 2600 aanvragen zouden zijn geweest in dat jaar lijkt me heel sterk, zeker als je weet dat elders in de krant van dezelfde dag er gewag wordt gemaakt van honderden bootvluchtelingen voor die dag alleen al. Toegegeven, het gaat hier om zevenhonderd vluchtelingen opgepikt bij vijf verschillende reddingsoperaties in Italië en niet in Spanje, maar mij lijkt het waarschijnlijker dat het werkelijke aantal asielzoekers in Spanje, en allicht ook Italië, veel hoger is dan wat je op basis van deze administratieve gegevens zou kunnen denken. Allicht zijn er andere kanalen dan deze vorm van naturalisatieaanvragen om in Spanje en Italië te verblijven, maar dat laat ik aan de migratie specialisten over.

# All Things Data Science

Istvan Hajnal's musings on Data Science, Big Data, Market Research and Data journalism

## Saturday, October 26, 2013

## Wednesday, October 23, 2013

### Managing Data Scientists

With the rise of the 'Data Scientist', a lot has been said about the definition, role, qualifications and skills of the Data Scientist, and how to hire them. A somewhat neglected topic is how to manage data scientists. Indeed, data scientists, by their very nature, are hard to manage.

They love to resolve problems, but those problems are not always the business problems you want them to tackle. They are ace players, but they're not always the best team players and some of them can sometimes have difficulty in dealing with (higher) management. They can have bright ideas, but they often lose interest when it comes to implementing those ideas in a profit making activity. They will find clever solutions for you, but they don't always excel in making sure that a structured process is place, let alone the administrative follow up that comes with it. Some of them were hired as 'rock-stars' and have developed an ego that goes with that...

On the other hand, they are (sometimes) the 'heroes' of the company so you need to deal with it, it comes with the territory, as they say. Also, very often you can't apply the usual bag of tricks that 'ordinary managers' can use, simply because these tricks don't always work with them.

If your data scientists are all well behaved in this respect, this blog post is not for you. If you have experienced the issues I described above, read on!

One of the things I picked up early on as a manager was that a good manager should help his people rather than command them. Often I found myself doing things that my reports were asking me to do rather than doing what my manager was asking me to do. Mind you that I would take the general strategy and direction from my manager or people above her/him, but to make it happen I found it often more useful to listen what people who were closer to reality were saying. I would help them to make them more efficient in achieving my goals. And my goals were generally the goals of my boss. I've always tried to avoid micro-management and over reliance on procedures. But I will admit that in some cases I did micromanage and I did emphasize procedures. The thing is that I only did that when a certain unit was in problems, not when it was successfully achieving its goals.

Another thing I noticed is that data scientists, but also statisticians and some top coders, often have difficulties in accepting orders from managers who don't have technical skills themselves. This does not mean that they would publicly disobey, but rather they would use some technical excuse to do whatever they wanted to do, knowing very well that the manager didn't have the technical knowledge to challenge them. Coming from an IT and statistics background gave me (just enough) credibility to be taken seriously, and that gave me a head start compared to other managers.

But nonetheless, I had my share of problems managing data scientists.

When I was working for a large market research company a few years ago, I had to work with a lot of statisticians and the like. Some of them were direct reports, some of them indirect and sometimes, horror oh horror, we were acting in a matrix organization. I believe I had some credit with them because I was able to speak the same (technical) language as they did. But still I had difficulties in making sure standard procedures and administrative follow up was done correctly. Now there are two opposite ways to react in such a situation. On the one hand, you can put all your energy in making sure the administrative procedures are followed, or, you can let go of any administrative follow up completely. The former will make it very hard for you to get your ace players on board, because they generally hate this stuff, and the latter might cause problems with higher management, might create chaos and is seldom sustainable. So, as most things in live, the truth is somewhere in the middle. But how do you prioritize?

When I tried to explain my vision on these things I found it useful to use the following schema:

They love to resolve problems, but those problems are not always the business problems you want them to tackle. They are ace players, but they're not always the best team players and some of them can sometimes have difficulty in dealing with (higher) management. They can have bright ideas, but they often lose interest when it comes to implementing those ideas in a profit making activity. They will find clever solutions for you, but they don't always excel in making sure that a structured process is place, let alone the administrative follow up that comes with it. Some of them were hired as 'rock-stars' and have developed an ego that goes with that...

On the other hand, they are (sometimes) the 'heroes' of the company so you need to deal with it, it comes with the territory, as they say. Also, very often you can't apply the usual bag of tricks that 'ordinary managers' can use, simply because these tricks don't always work with them.

If your data scientists are all well behaved in this respect, this blog post is not for you. If you have experienced the issues I described above, read on!

One of the things I picked up early on as a manager was that a good manager should help his people rather than command them. Often I found myself doing things that my reports were asking me to do rather than doing what my manager was asking me to do. Mind you that I would take the general strategy and direction from my manager or people above her/him, but to make it happen I found it often more useful to listen what people who were closer to reality were saying. I would help them to make them more efficient in achieving my goals. And my goals were generally the goals of my boss. I've always tried to avoid micro-management and over reliance on procedures. But I will admit that in some cases I did micromanage and I did emphasize procedures. The thing is that I only did that when a certain unit was in problems, not when it was successfully achieving its goals.

Another thing I noticed is that data scientists, but also statisticians and some top coders, often have difficulties in accepting orders from managers who don't have technical skills themselves. This does not mean that they would publicly disobey, but rather they would use some technical excuse to do whatever they wanted to do, knowing very well that the manager didn't have the technical knowledge to challenge them. Coming from an IT and statistics background gave me (just enough) credibility to be taken seriously, and that gave me a head start compared to other managers.

But nonetheless, I had my share of problems managing data scientists.

When I was working for a large market research company a few years ago, I had to work with a lot of statisticians and the like. Some of them were direct reports, some of them indirect and sometimes, horror oh horror, we were acting in a matrix organization. I believe I had some credit with them because I was able to speak the same (technical) language as they did. But still I had difficulties in making sure standard procedures and administrative follow up was done correctly. Now there are two opposite ways to react in such a situation. On the one hand, you can put all your energy in making sure the administrative procedures are followed, or, you can let go of any administrative follow up completely. The former will make it very hard for you to get your ace players on board, because they generally hate this stuff, and the latter might cause problems with higher management, might create chaos and is seldom sustainable. So, as most things in live, the truth is somewhere in the middle. But how do you prioritize?

When I tried to explain my vision on these things I found it useful to use the following schema:

This rule has helped me in focusing on the priorities by not trying to force successful people and groups in a very rigid process driven structure, but on the other hand it was also a warning for those people and groups that they could only get away with it as long as they were successful. This rule also took some of the fear out my teams that were in trouble. If they were in trouble but they followed the normal procedures, there was no reason for fear. On the contrary, I would help them in resolving the problem. I'm sure this might have led to some situations that you might call micro management, but at least it was micro management applied on disfunctional groups and it would leave the successful ones doing whatever they were doing. Essentially there's nothing new with this rule and I guess you can't apply it to all situation or in all industries.

But for me, it worked.

## Thursday, October 17, 2013

### A small experiment with Twitter's language detection algorithm

Some time a go I captured quite a lot of geo-located tweets for a spatial statistics project I'm doing. The tweets I collected were all confined to be in Belgium. One of the things I looked at was the language of tweets. As you might know, Belgium officially has three languages, Dutch, French and German. Of course, when you analyze a large set of tweets, you can't manually determine the language, on the other hand blindly relying on Twitter's language detection algorithm doesn't feel good either.

That's why I set up a little experiment to assess to what extent Twitter's language detection algorithm can be trusted, in the context of my geo-location project. I stress this because I don't have the ambition to make overall judgments on how Twitter takes care of language detection.

First, let's look at the languages as determined by the Twitter language detection algorithm of the 150,000 or so tweets I collected. The barchart below shows the frequency of each of the languages.

I'm not sure if this chart is readable enough, so let me guide you through it. The green bars are the 3 official languages of Belgium, Dutch, French and German. French and Dutch take the top positions, German is on the seventh position. Based on population figures you would expect more Dutch posts than French posts, while this data shows the opposite. There can be many good reasons why this happens. To start with the obvious, the twitter population is not the general population, and hence the distribution of languages can be different as well. Another obvious reason is that tweets can also come from foreigners, tourists for instance. While the sample is large (about 150,000 tweets), I need to rely on Twitter on providing a good sample of all tweets, and I'm not too sure about that. Also, it might be possible that Dutch speaking Belgians tweet more in English than their French speaking counterparts. And finally, it is possible that the Twitter detection algorithm is more successful in detecting some languages than others.

The fact that English (the blue bar) comes in third will not come as a surprise. Turkish is fourth (the top red bar), which can be explained by the relative large immigrant population coming from Turkey. The other languages, such as Spanish and portuguese (the remaining red bars) decrease quite rapidly in terms of frequency. But notice that the scale of the chart is somewhat deceiving in that the lower ranked languages such as Thai and Chinese, that are barely visible in the chart still are representing 40 and 20 tweets respectively. Overall this looks like another example of a power law, where we see that a few languages are responsible for the vast majority of tweets, while a large number of languages are used in the remaining tweets

You will have noticed that the fifth most important language, the orange bar is "Undecided", these are the tweets where the Twitter detection algorithm was not able to detect which language was used. Two other cases stand out (purple bars) on positions 9 and 10 are Indonesian and Tagalog. Tagalog is an Austronesian language spoken in The Philippines. In a blog post on the Twitter languages of London by Ed Manley (@EdThink) had noticed that Tagalog came on the seventh place in London. He writes:

Back to the experiment. I took a simple random sample of 100 tweets and asked 4 coders (including myself) to determine in what language a tweet was expressed. I gave the coders only minimal instructions in an attempt not to influence them too much. I did provide them with a very simple 'coding scheme', based on the most common languages (Dutch, French, or English, and a category for both the cases where the coder was not able to determine the language used and all other languages). Now, this might sound like a trivial exercise, but a tweet like "I'm at Comme Chez Soi in Brussel", can be seen as English, French or Dutch, depending on how you interpret the instructions.

This results in datamatrix consisting of 100 rows and 5 columns (i.e. the language assessments of Twitter and the 4 coders). A data scientist will immediately start to think how to analyze this (small) dataset. There are many ways of doing that. Let's first start with the obvious, i.e. comparing the Twitter outcome with one of the coders. You can easily represent that in a frequency table:

EN FR NL WN

EN 14 2 1 2

FR 3 34 0 3

NL 1 0 24 0

WN 5 5 1 5

The rows represent the language of a tweet according to Twitter (EN=English, FR=French, NL=Dutch and WN=Don't know or another language). The columns represent the language according to the first coder. We now have different options. Some folks do a Chi-Square-test on this table, but this is not without problems. To start with, testing the hypothesis of independence is not necessarily relevant for assessing the agreement between two coders and we can get into troubles with zero or near zero cells and marginals. Either way, here are the results for such a test:

X-squared = 136.6476, df = 9, p-value < 2.2e-16

As the $p$-value is smaller than the usual 0.05, we would reject the null hypothesis and thus accept that the two coders are not independent and hence somehow 'related'. Again, this seems to be a rather weak requirement given the coding task at hand. Also, $\chi^2$ is sensitive to sample size, so just simply increasing the number of tweets would eventually lead to significance in case we wouldn't have reached it at $n=100$.

One of the alternatives for that is to normalize the $\chi^2$-statistic somehow. There are many ways to do that, one approach is to divide by the sample size $n$ and the number of categories (minus 1). This is called

$$r_V=\sqrt{{\chi^2 \over n \times \min[R-1, C-1]}}$$,

where $C$ is the number of columns and $R$ is the number of rows.

Sometimes simpler or at least more obvious approaches are used, such as taking the proportion of the items for which the two coders agreed. If we assume that both coders have used the same number of categories $G=R=C$, we can formalize this with:

$$r_{pca}= {\sum_{i=1}^G f_{ii}\over n}$$.

In the example this results in $r_{pca}=0.77$. So in more than three quarters of tweets, Twitter and the first coder agree on the language.

The drawback here is that we don't account for chance agreement.

$$r_\kappa={r_{pca} - E(r_{pca})\over 1-E(r_{pca})}$$,

with

$$E(r_{pca})={\sum_{i=1}^G{f_{i.}\times f_{.j}\over n}\over n}$$,

in which $f_{i.}$ and $f_{.j}$ are the marginal frequencies. Calculating this for our example yields $r_\kappa=0.6766484$

Yet another interesting alternative are approaches which consider the ${n \choose 2}$ pairs of judgments rather the $n$ judgments directly. This approach is popular in the cluster analysis and psychometrics literature, with indexes such as the

But one of the issues that is tackled less often in the literature is the fact that in this type of situations we have often more than one coder or judge. The classical approach is then to calculate all pairwise combinations and take a decision from there.

Incidentally, there are a few areas in research where multiple coders are often used, i.e. in qualitative research. Indeed, qualitative research, has a long tradition to handle situations where 'subjectivity' can play an important role. Very often this is done, amongst others, by using multiple coders. The literature on the methodology is quite separate from the mainstream statistical literature, but nonetheless there are some interesting things to learn from that field. One of the popular indices in qualitative research is

In Content Analysis

$$ \alpha=1-{D\over E(D)}$$,

in which $D$ is a disagreement measure and $E(D)$ its expectation, and the details of the calculation would lead us too far. A simple example is available on the wikipedia page.

The index can be used for any number of coders, it deals with missing data, and can handle different levels of measurement such as binary, nominal, ordinal, interval, and so on. It claims to 'adjusts itself to small sample sizes of the reliability data'. It is not clear to me where and to what extent these claims are proven. Nonetheless in practice this index is used to have one single coefficient that allows to compare reliabilities 'across any numbers of coders and values, different metrics, and unequal sample sizes'.

I used the

There were 84 tweets where all 4 human coders agreed on. In 71 of those 83, Twitter came up with the same language as the human coders. That's about 85%. That's not excellent, but it's not bad either.

Let's take a look at a few examples where all 4 human coders agreed, but Twitter didn't:

That's why I set up a little experiment to assess to what extent Twitter's language detection algorithm can be trusted, in the context of my geo-location project. I stress this because I don't have the ambition to make overall judgments on how Twitter takes care of language detection.

First, let's look at the languages as determined by the Twitter language detection algorithm of the 150,000 or so tweets I collected. The barchart below shows the frequency of each of the languages.

I'm not sure if this chart is readable enough, so let me guide you through it. The green bars are the 3 official languages of Belgium, Dutch, French and German. French and Dutch take the top positions, German is on the seventh position. Based on population figures you would expect more Dutch posts than French posts, while this data shows the opposite. There can be many good reasons why this happens. To start with the obvious, the twitter population is not the general population, and hence the distribution of languages can be different as well. Another obvious reason is that tweets can also come from foreigners, tourists for instance. While the sample is large (about 150,000 tweets), I need to rely on Twitter on providing a good sample of all tweets, and I'm not too sure about that. Also, it might be possible that Dutch speaking Belgians tweet more in English than their French speaking counterparts. And finally, it is possible that the Twitter detection algorithm is more successful in detecting some languages than others.

The fact that English (the blue bar) comes in third will not come as a surprise. Turkish is fourth (the top red bar), which can be explained by the relative large immigrant population coming from Turkey. The other languages, such as Spanish and portuguese (the remaining red bars) decrease quite rapidly in terms of frequency. But notice that the scale of the chart is somewhat deceiving in that the lower ranked languages such as Thai and Chinese, that are barely visible in the chart still are representing 40 and 20 tweets respectively. Overall this looks like another example of a power law, where we see that a few languages are responsible for the vast majority of tweets, while a large number of languages are used in the remaining tweets

You will have noticed that the fifth most important language, the orange bar is "Undecided", these are the tweets where the Twitter detection algorithm was not able to detect which language was used. Two other cases stand out (purple bars) on positions 9 and 10 are Indonesian and Tagalog. Tagalog is an Austronesian language spoken in The Philippines. In a blog post on the Twitter languages of London by Ed Manley (@EdThink) had noticed that Tagalog came on the seventh place in London. He writes:

Here are the eight first Tagalog tagged tweets in my dataset:One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language. On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’. I don’t know much about Tagalog but it sounds like a fun language.

- @xxx hahaha!!!
- @xxx hahaha
- @xxx das ni goe eh?
- @xxx hahaha
- SUMBARIE !
- Swedish couple named their kid "Brfxxccxxmnpcccclllmmnprxvclmnckssqlbb11116." The name is pronounced "Albin.
- #LRT hahahahahaha le salaud
- hahah

*(My thoughts go to the poor researchers in The Philippines who must face quite a challenge when they analyze Twitter data. On the other hand, they now have, yet another, good reason not to touch Twitter data ;-)*This results in datamatrix consisting of 100 rows and 5 columns (i.e. the language assessments of Twitter and the 4 coders). A data scientist will immediately start to think how to analyze this (small) dataset. There are many ways of doing that. Let's first start with the obvious, i.e. comparing the Twitter outcome with one of the coders. You can easily represent that in a frequency table:

EN FR NL WN

EN 14 2 1 2

FR 3 34 0 3

NL 1 0 24 0

WN 5 5 1 5

The rows represent the language of a tweet according to Twitter (EN=English, FR=French, NL=Dutch and WN=Don't know or another language). The columns represent the language according to the first coder. We now have different options. Some folks do a Chi-Square-test on this table, but this is not without problems. To start with, testing the hypothesis of independence is not necessarily relevant for assessing the agreement between two coders and we can get into troubles with zero or near zero cells and marginals. Either way, here are the results for such a test:

X-squared = 136.6476, df = 9, p-value < 2.2e-16

As the $p$-value is smaller than the usual 0.05, we would reject the null hypothesis and thus accept that the two coders are not independent and hence somehow 'related'. Again, this seems to be a rather weak requirement given the coding task at hand. Also, $\chi^2$ is sensitive to sample size, so just simply increasing the number of tweets would eventually lead to significance in case we wouldn't have reached it at $n=100$.

One of the alternatives for that is to normalize the $\chi^2$-statistic somehow. There are many ways to do that, one approach is to divide by the sample size $n$ and the number of categories (minus 1). This is called

*Cramer's v*:$$r_V=\sqrt{{\chi^2 \over n \times \min[R-1, C-1]}}$$,

where $C$ is the number of columns and $R$ is the number of rows.

*Cramer's v*is often used in statistics to measure the association between two categorical variables. If there is no association at all it becomes 0 and perfect association leads to 1. In this example $R=C=4$ because we consider 4 language categories which then results in $r_V=0.6749016$.Sometimes simpler or at least more obvious approaches are used, such as taking the proportion of the items for which the two coders agreed. If we assume that both coders have used the same number of categories $G=R=C$, we can formalize this with:

$$r_{pca}= {\sum_{i=1}^G f_{ii}\over n}$$.

In the example this results in $r_{pca}=0.77$. So in more than three quarters of tweets, Twitter and the first coder agree on the language.

The drawback here is that we don't account for chance agreement.

*Cohen's*$\kappa$ is alternative for that. This is generally done by subtracting the original statistic by its expected value and by dividing by the maximum value of that statistic minus the expected value. In the case of*Cohen's*$\kappa$ this results in:$$r_\kappa={r_{pca} - E(r_{pca})\over 1-E(r_{pca})}$$,

with

$$E(r_{pca})={\sum_{i=1}^G{f_{i.}\times f_{.j}\over n}\over n}$$,

in which $f_{i.}$ and $f_{.j}$ are the marginal frequencies. Calculating this for our example yields $r_\kappa=0.6766484$

Yet another interesting alternative are approaches which consider the ${n \choose 2}$ pairs of judgments rather the $n$ judgments directly. This approach is popular in the cluster analysis and psychometrics literature, with indexes such as the

*Rand Index*and all sorts of variations on the that index, such as the*Hubert and Arabie Adjusted Rand Index*. Recently I stumbled on a very interesting article "*On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index*" in Journal of Classification by Matthijs J. Warrens, that I recommend very strongly.But one of the issues that is tackled less often in the literature is the fact that in this type of situations we have often more than one coder or judge. The classical approach is then to calculate all pairwise combinations and take a decision from there.

Incidentally, there are a few areas in research where multiple coders are often used, i.e. in qualitative research. Indeed, qualitative research, has a long tradition to handle situations where 'subjectivity' can play an important role. Very often this is done, amongst others, by using multiple coders. The literature on the methodology is quite separate from the mainstream statistical literature, but nonetheless there are some interesting things to learn from that field. One of the popular indices in qualitative research is

*Krippendorff's*$\alpha$.In Content Analysis

*reliability data*refers to a situation in which independent coders assign a value from a set of instructed values to a common set of units of analysis. This overall reliability or agreement is expressed as:$$ \alpha=1-{D\over E(D)}$$,

in which $D$ is a disagreement measure and $E(D)$ its expectation, and the details of the calculation would lead us too far. A simple example is available on the wikipedia page.

The index can be used for any number of coders, it deals with missing data, and can handle different levels of measurement such as binary, nominal, ordinal, interval, and so on. It claims to 'adjusts itself to small sample sizes of the reliability data'. It is not clear to me where and to what extent these claims are proven. Nonetheless in practice this index is used to have one single coefficient that allows to compare reliabilities 'across any numbers of coders and values, different metrics, and unequal sample sizes'.

I used the

*irr*library in the R-language to calculate*Krippendorff's*$\alpha$ for all 5 coders, which resulted in $0.796$, which is just below the commonly used threshold in the social sciences. So we can't claim that all coders, including Twitter, agreed completely on the language detection task, on the other hand we are not too far of what would be considered good.There were 84 tweets where all 4 human coders agreed on. In 71 of those 83, Twitter came up with the same language as the human coders. That's about 85%. That's not excellent, but it's not bad either.

Let's take a look at a few examples where all 4 human coders agreed, but Twitter didn't:

- Deze shit is hard
- @xxxx Merci belle sœur
- @xxxx de domste is soms ook de snelste
- Just posted a photo @ Fontein Jubelpark / Fontaine Parc du Cinquantenaire
- Mddrrr j'ziar ..!!
- @xxxx ADORABLE!
- OGBU EH! Samba don wound Tiki Taka. The Champs are back!
- I'm at Proxy Delhaize (Sint-Gillis / Saint-Gilles, Brussels)

The examples 1,4 and 8, seem intrinsically hard because there is no correct answer, so we can't hold that against Twitter. The examples 2,3 and 6 seem to be very straightforward cases that Twitter didn't capture. Example 5 was catalogued as French by Twitter, while the human coders put it in the rest/Don't know category.

All in all I believe that the number of obvious mistakes is not too high, although that assessment, of course, depends on the type of application. I can very well imagine that for some applications this is not good enough.

__Based on all the different indices, interpretations and examples, my conclusion is that for my spatial statistics project, the Twitter language detection algorithm is not perfect, but good enough. I will use the language suggestion, but only after regrouping and after making sure that Tagalog and the like are recoded towards 'undecided'.__## Sunday, September 15, 2013

### De Moivre's equation and the solar panels of Lo-Reninge

A few weeks a go I saw an innocent little article on solar panels in the Flemish quality newspaper '

*De Standaard*', entitled "*Niemand maakt meer zonne-energie dan inwoners Lo-Reninge*", which roughly translates to "*no one produces more solar energy than the inhabitants of Lo-Reninge*". The article reports on the production of solar energy by individual households, typically produced by small installations on rooftops. The Flemish authorities support solar energy by subsidizing households who install solar panels. An important part of the subsidies is handled by issuing so called 'Green certificates' (or renewable energy certificates) per fixed amount of 'kilowatt per hour' produced. See here for more details on solar power in Belgium. De Standaard newspaper, citing data from the Flemish Regulator of the Electricity and Gas market (VREG), reported on the number of these certificates issued in 2012 relative to the number of inhabitants per municipality. It appeared that Lo-Reninge came up as number one.
If you're not from Belgium, you've probably never heard of Lo-Reninge. Well, I live in Belgium and there are only slightly more than 300 municipalities in Flanders and I had never heard of Lo-Reninge either. And it's exactly that that sparked my attention. Here's the top 10 reported by the newspaper:

Municipality | Certificates per inhabitant |
---|---|

Lo-Reninge | 0.39210 |

Peer | 0.34080 |

Opglabbeek | 0.34051 |

Bocholt | 0.32443 |

Zuienkerke | 0.32198 |

Vleteren | 0.31637 |

Balen | 0.31401 |

Nieuwerkerken (Limburg) | 0.31060 |

Wellen | 0.30895 |

Alveringem | 0.30774 |

In fact, in this top 10, only Balen has more than 20,000 inhabitants. This gives the impression that very small municipalities have more solar panels than larger ones. In general there is indeed a relationship between the size of a municipality and the production of solar energy with small installations, but that relationship is more subtle than this table seems to suggest. The problem here is that this real effect is confounded with, what Howard Wainer calls "The most dangerous equation" or "de Moivre's equation". But before we come to that, let me give you an artificial example that demonstrates the effect more easily.

Suppose that the journalist of '

*De Standaard*' visited each and every municipality in Flanders and asked every inhabitant to flip a coin. Similar to the solar panel case, where the count of certificates per municipality was divided by the number of inhabitants, he could now report the number of heads divided by the number of inhabitants. Most us would expect these figures to be close to 0.50. And, they would be right. But there's more to this than meets the eye. What would happen if he would have reported the top 10 of municipalities with the highest proportion of heads?
I did this through simulation, and here's my top 10:

- Zuienkerke
- Herne
- Oudenburg
- Baarle-Hertog
- Veurne
- Ledegem
- Herenthout
- De Haan
- Kortessem
- Vleteren

No Lo-Reninge this time, but beer-drinkers might remember that our number 10, Vleteren, from Westvleteren Brewery fame, was the number 6 in the solar panel competition. Also, our number 1, Zuienkerke was number 5 in

*De Standaard*article. While Lo-Reninge did not appear in our coin flipping Top 10 this time, it most certainly would if we would repeat the exercise a few times. I simulated 1000 repeats of the coin flipping competition and Lo-Reninge ended up 164 times in the top 10 of municipalities with the highest proportion of 'heads'. ... It also ended up155 times in the worst 10 (i.e. those with the lowest proportion of 'heads'). Is there a magical correlation between having solar panels and coin flipping? Of course not, if you know something about statistics, Flemish geography or beer, you will by now have realized that some of the municipalities involved are so small that they are more likely than others to show extreme numbers. Take for instance Herstappe, Flanders smallest municipality with less than 100 inhabitants. In my 1000 repeats of the coin flipping competition, Herstappe ended up 462 times in the top 10 and 416 times in the bottom 10. So only in 12.2% of the repeats Herstappe did not end up in the Top 10 of the most 'heads' or the Top 10 of the most 'tails' (out of about 300 municipalities in Flanders). What we see here is de Moivre's equation at work:
$\sigma_{\overline x}= {\sigma \over\sqrt{n} },$

in which $\sigma_{\overline x}$ is the standard error of the mean, $\sigma$ is the standard deviation of the sample and $n$ is the sample size.

We can further illustrate this by plotting the proportion of heads per municipality on the y-axis and the number of inhabitants on the x-axis for a typical run of such an experiment.

The mean is indicated by the red line. As expected the outcomes of the municipalities are all scattered around the overall mean (and close to the theoretically expected 0.50), but there is much more dispersion around that mean in the small municipalities than there is in the larger ones. We can use the equation of de Moivre above to calculate the standard deviation in function of $n$ and then plotting the usual + and - 2 times that standard deviation. This results in the blue lines. As you can see most outcomes of our experiment fall within what you reasonably could expect.

The solar panel case is a bit more complicated. The graph below shows the plot of the number of certificates issued in 2012 per municipality relative the number of inhabitants. It's difficult to see from the plot right away, but the LOESS-regression, summarized by the green line, indicates that effectively there is a relationship between the population size of a municipality and the (relative) usage of solar panels in that the larger a municipality (in number of inhabitants), the lower the relative number of certificates per inhabitant becomes.

But you can also see that there is a clear relationship between the variability of the ratio's and the number of inhabitants of a municipality. For those of you who care, the dot on the far right (no pun intended) is Antwerp and the dot in the middle (at around 250000) is Ghent.

From a data journalism point of view I don't think you can blame the journalist here. To start with, the solar panel case is less clear cut than the coin flipping example, because, effectively, there is a relationship between the number of inhabitants and the usage of solar panels. But also, I know a few statisticians who don't always fully realize what the consequences of de Moivre's equation are. And, in my younger days, I got fooled a few times myself; Finally, there are some well known cases where the consequences of the misunderstanding where much harder. Howard Wainer lists some interesting cases in the article that I referred to earlier. Here, I'll just pick one of those examples. Based, a.o., on 'the observation that among high-performing schools, there is an unrepresentatively large proportion of smaller schools', by the end of the last century, there was a growing movement to support smaller schools. The Bill and Melinda Gates Foundation, for instance, was offering grants to education projects that supported smaller schools. Howard Wainer and Harris Zwerling showed that the smaller schools were not only over represented in the high performing group, but also in the low performing group, which is consistent with what we would expect from de Moivre's equation. Taking this into consideration they found that, overall, students at bigger schools do better. In the mean time, the Gates Foundation has announced it was 'moving away from its emphasis on converting large high schools into smaller ones'.

Just two more things regarding the solar panel case. To start with I noticed that either VREG or the journalist have used the population figures of the first of January 2011, and not those of the 31st of December of that same year. That would have made more sense since the certificates issued in 2012 were considered. Secondly, in my effort to replicate the calculations of VREG or the journalist I could not help but notice that my top 10 matched up to 4 or 5 decimals with the table that was published in De Standaard. The only exception was that I have Kinrooi listed as number two with a ratio of 0.34318, while, according to the map in De Standaard the number of certificates relative to the number of inhabitants if Kinrooi is a meager 0.1133. A quick check reveals that the number of inhabitants of neighboring Maaseik was used instead of those of Kinrooi.

To conclude, I wouldn't go as far as saying that

*De Standaard*found a rather convoluted way to track Flander's smallest municipalities, but the least you could say is that the top 10 presented is probably strongly influenced by the number of inhabitants rather than being a pure reflection of usage of solar panels.## Friday, July 26, 2013

### An introduction to probability theory with Elvis Costello

Last week I released a paper entitled "The Generalized $S^3$-problem. A probabilistic view on Elvis Costello's Spectacular Spinning Songbook". You can find the pdf here.

The paper is bit of a parody on statistical papers, so it shouldn't be taken too seriously. But at the same time it gives a very gentle introduction in some concepts of probability theory (Laplace, independence, the birthday paradox, ...).

Enjoy!

The paper is bit of a parody on statistical papers, so it shouldn't be taken too seriously. But at the same time it gives a very gentle introduction in some concepts of probability theory (Laplace, independence, the birthday paradox, ...).

Enjoy!

## Wednesday, July 10, 2013

### Are partygoers in Belgium using more cocaine?

Last week the Belgian newspaper De Morgen ran an article on drug use amongst Belgian partygoers. The headline of the article was "Partygoers use less cannabis and more cocaine" ("Minder cannabis, meer cocaïne bij feestvierders"). The graph that accompanied the article looked like this:

While this is dutch, the language of drugs is universal, so I'm sure you will have no difficulty in understanding what it says. There are a couple of remarks to make on this graph:

- While there are small grey bars between the 3 groups, Alcohol/Cannabis, Xtc/Cocaine and LSD/GHB/Ketamine, initially I was fooled by thinking they were all using the same Y-axis. They're not, so you need to be careful to take scale into account.
- Secondly, at the first glance there seems to be a drop in cannabis use, but the increase in cocaine that was mentioned in the title is less clear cut (no pun intended).
- Thirdly, alcohol use seems to decline as well, although this is difficult to judge without significance tests
- Fourthly, just to prove my nerdiness, the label for LSD in 2009 is missing.
- Finally, what the hell happened in 2007? All types of drugs studied increased compared to the the 2005 study. OK, I celebrated my 40th birthday in 2007, but surely that can't explain all of this ;-)

Before I discuss the details here, let me first say that, as far as I can judge, the journalist did an OK job in writing up the article based on the press release, so this time I'm not blaming the journalist. Secondly, the study was carried out by the VAD, a non-profit association for alcohol and other drug problems, that has a decent reputation in carrying out these types of studies. Thirdly, the VAD publishes papers on the their methodological approach (see here for a detailed methodological note on this particular study) and VAD collaborators publish in scientific journals (see for instance here). And finally, from a methodological perspective studying drug related issues is notoriously difficult.

Nonetheless there are some things that look strange to me. First, although I can't prove it, I don't

__believe__the 2007 figures are correct. I stress the word believe here because without further information I can't judge it. I scanned through the reports and could not find much explanation for 2007, although the fact that the 2007 figures were odd, was acknowledged in several parts in the report (which is good). The only explanation I could find was at some point in the report where they admitted that the 2007 figures were influenced by the 'special' group of respondents they had in 2007.
This observation, together with the scale issue mentioned above led me to make a similar chart, but using the same scale and with the year 2007 interpolated from 2005 and 2009:

In this graphical representation the variation between the years is now put in different perspective. Notice that:

- We assume the 2007 data was indeed flawed.
- We slightly understate the actual variation between years by interpolating the 2007 figures.

Based on a visual inspection of the new graph I would be inclined to say that cannabis use has dropped, but cocaine has remained more or less at the same level.

Of course that is just visual inspection. According to the methodological note of VAD the sample sizes were between 600 and 700. In 2012 the sample size was 618.

I've learned from Dries Benoit that, from a Bayesian perspective you need to be careful with classical confidence intervals in cases like this (see here for a Dutch blog post on this subject), but nonetheless, a 95% confidence interval for the proportion of cocaine use amongst partygoers in 2012 would be between

$$0.136 - 1.96 \sqrt{{0.136\times0.864 \over 618}}=0.10897$$

and

$$0.136 + 1.96 \sqrt{{0.136\times0.864 \over 618}}=0.16303$$.

All of the previously observed proportions, except in the odd year of 2007, where in that confidence interval, so I would be more inclined to say that the last 10 years cocaine use amongst partygoers has remained at the same level, which is exactly the opposite of the title of the De Morgen article.

## Saturday, June 22, 2013

### Visualisatiefouten deren "De Morgen" niet

Op woensdag 19 juni 2013 verscheen er een artikel in De Morgen met als kop "

Om de tekst iets beter leesbaar te maken voor deze blog heb ik de grafiek iets aangepast:

Let wel dat je rekening moet houden met de lengte verhoudingen in de eerste grafiek.

Het eerste dat opvalt is dat de lengte van de twee kleinste staafdiagrammen niet in verhouding staan met de blauwe getallen (de frequenties, dus). Voor de hoogste frequentie is er nog een excuus omdat daar een zogenaamde schaalonderbreking wordt weergegeven (i.e. de onderbreking halverwege de staaf met de hoogste frequentie). Zoals de grafiek er nu staat had men ook een schaalonderbreking bij de 1.068.500 moeten zetten, maar aangezien de hoogte van de eerste staaf arbitrair is ten opzichte van de voorgestelde frequentie, zouden twee schaalonderbrekingen bij een grafiek met drie staven volledig absurd zijn.

Er is echter (vermoedelijk) nog een ander probleem met deze grafiek. Immers, het lijkt zeer onwaarschijnlijk dat de labeling in deze grafiek klopt. Als we de grafiek van links naar rechts lezen (en dus van de kleinste naar de grootste frequentie) dan zijn die labels respectievelijk, "Meer dan 30 miljoen dollar", "1 tot 5 miljoen dollar" en "5 tot 30 miljoen dollar". Dit gaat in tegen de manier waarop in de Westerse wereld vermogens verdeeld zijn. Ruwweg gezegd zijn er minder mensen in de hogere inkomensgroepen. Er is geen reden waarom dat dit anders zou zijn als men binnen de rijke groep drie subgroepen zou onderscheiden (rijk, heel rijk en superrijk). Theoretisch zou het kunnen afhankelijk van de categorieën die men gebruikt, maar het zou me zeer sterk verbazen dat CapGemini en RBC Wealth Management zo contra-intuïtief zouden hebben gewerkt. Ik vermoed dus zeer sterk dat de labels van de twee hoogste frequenties in De Morgen gewoon werden omgewisseld.

Wat kan je, naast de correcte labeling, nog doen om de grafiek te verbeteren? Wel, mocht dit niet voor een krant zijn, maar voor een meer wetenschappelijke publicatie zijn, zou je allicht de frequentie voorstellen op een logaritmische schaal:

Deze grafiek geeft goed aan dat de frequentie met een factor 10 daalt wanneer je een hogere vermogenscategorie beschouwt. Maar voor een krant lijkt me een logaritmische schaal niet aangewezen. De meeste lezers bekijken zo'n grafiek slechts zeer vluchtig en heel wat lezers zijn allicht niet (meer) vertrouwd met deze voorstellingswijze. Daarom zou mijn voorkeur naar de meest eenvoudige voorstelling gaan, namelijk:

Deze voorstelling geeft nog steeds aan dat de frequentie met ongeveer een factor 10 daalt wanneer je een hogere vermogenscategorie beschouwt en maakt bovendien niet eens gebruik van een logaritmische schaal. Het nadeel is dan wel dat je laatste staaf zodanig plat is dat je er visueel weinig mee kan doen, wat, toegegeven, voor een krant wel belangrijk kan zijn;

*Crisis deert superrijken niet*". Eén van de twee grafieken bij het artikel verdient nadere bespreking. Ziehier de grafiek waar het over gaat:Om de tekst iets beter leesbaar te maken voor deze blog heb ik de grafiek iets aangepast:

Let wel dat je rekening moet houden met de lengte verhoudingen in de eerste grafiek.

Het eerste dat opvalt is dat de lengte van de twee kleinste staafdiagrammen niet in verhouding staan met de blauwe getallen (de frequenties, dus). Voor de hoogste frequentie is er nog een excuus omdat daar een zogenaamde schaalonderbreking wordt weergegeven (i.e. de onderbreking halverwege de staaf met de hoogste frequentie). Zoals de grafiek er nu staat had men ook een schaalonderbreking bij de 1.068.500 moeten zetten, maar aangezien de hoogte van de eerste staaf arbitrair is ten opzichte van de voorgestelde frequentie, zouden twee schaalonderbrekingen bij een grafiek met drie staven volledig absurd zijn.

Er is echter (vermoedelijk) nog een ander probleem met deze grafiek. Immers, het lijkt zeer onwaarschijnlijk dat de labeling in deze grafiek klopt. Als we de grafiek van links naar rechts lezen (en dus van de kleinste naar de grootste frequentie) dan zijn die labels respectievelijk, "Meer dan 30 miljoen dollar", "1 tot 5 miljoen dollar" en "5 tot 30 miljoen dollar". Dit gaat in tegen de manier waarop in de Westerse wereld vermogens verdeeld zijn. Ruwweg gezegd zijn er minder mensen in de hogere inkomensgroepen. Er is geen reden waarom dat dit anders zou zijn als men binnen de rijke groep drie subgroepen zou onderscheiden (rijk, heel rijk en superrijk). Theoretisch zou het kunnen afhankelijk van de categorieën die men gebruikt, maar het zou me zeer sterk verbazen dat CapGemini en RBC Wealth Management zo contra-intuïtief zouden hebben gewerkt. Ik vermoed dus zeer sterk dat de labels van de twee hoogste frequenties in De Morgen gewoon werden omgewisseld.

Wat kan je, naast de correcte labeling, nog doen om de grafiek te verbeteren? Wel, mocht dit niet voor een krant zijn, maar voor een meer wetenschappelijke publicatie zijn, zou je allicht de frequentie voorstellen op een logaritmische schaal:

Deze voorstelling geeft nog steeds aan dat de frequentie met ongeveer een factor 10 daalt wanneer je een hogere vermogenscategorie beschouwt en maakt bovendien niet eens gebruik van een logaritmische schaal. Het nadeel is dan wel dat je laatste staaf zodanig plat is dat je er visueel weinig mee kan doen, wat, toegegeven, voor een krant wel belangrijk kan zijn;

__Besluit:__*ik realiseer me dat ik soms zit te muggenziften als het om cijfers gaat in kranten, maar deze grafiek in De Morgen heeft geen enkele toegevoegde waarde. Meer zelfs, door de (vermoedelijke) fout in de labeling van de grafiek is de toegevoegde waarde negatief. Ik heb de indruk dat bij De Morgen grafieken, letterlijk, als bladvulling dienen, en dat de voorstellingswijze enkel bepaald wordt door de grootte en de plaats van de op te vullen ruimte, eerder dan de best mogelijke voorstellingswijze.**Ik zou De Morgen dan ook aanraden om binnen de redactie een groepje samen te stellen dat de schrijver van een artikel en de grafische dienst ondersteuning kan bieden bij de correcte voorstelling van cijfers. Het hoeven niet eens statistici te zijn. Er zijn heel wat politieke wetenschappers, sociologen, communicatiewetenschappers en psychologen die tijdens hun opleiding hier voldoende ervaring over hebben opgedaan. Ik ben er zeker van dat er binnen de redactie van De Morgen een paar jonge krachten rondlopen die deze taak met gemak op zich zouden kunnen nemen. Je zal er allicht niet meteen "The Guardian" mee worden, maar je kan er wel de meeste gênante gevallen mee vermijden.*
Subscribe to:
Posts (Atom)