Today has started like many of my days with an email from someone declaring a truth on the back of some numbers that they have assessed. I use the word “assess” here to indicate that they took a number at face value and did nothing to understand it beyond basic a/b=c maths
In this case they have undertaken a basic analysis of counting the volume of different <somethings>. They’ve put each of these <somethings> into a different category, then by the joy of Excel they’ve calculated the percentage of those <somethings> that meet the criteria they are wanting to assess.
I’m talking about <somethings> because I don’t want to call out the particular numbers – for the rest of the post I’ll use foodstuffs to illustrate my challenges.
Creating a percentage is all well and good, isn’t it? It’s straightforward maths after all, what could be clearer? Except, what they have inadvertently done is create a meme that contains a mistruth that will take many cycles to rectify. As the old saying goes “a lie can travel halfway around the world before the truth can get its boots on.”
These mistruths take many different forms, their perniciousness coming from their ability to hide undetected within the details of the calculations. They lie there hidden in plain sight making fools of everyone willing to accept them.
In many cases the numbers are obviously wrong if people knew where to look. Looking requires a certain level of scepticism, but it will pay off in the long run. Here are a few of the where my own scepticism leads me when people present me with numbers.
Using apples as an example, perhaps the simplest way of classifying them is as either red or green. The challenge with this classification system is that there are apples which are neither wholly green or wholly red, how are they classified?
There are several ways that you could go, and this is where the purpose of the classification is important to understanding the validity of the information being portrayed.
If your aim was to show that red apples were more popular than green ones, you could classify all apples with a colour as red. This wouldn’t be untrue, it would just be stretching the definition of red.
Another way would be to create a classification for the ones in the middle, let’s call it pigmented. Even then you run into the same problem, how pigmented does something need to be to fall into this classification?
Our motives for the classifications that we choose are complex, sometimes known and often unknown.
Within the UK, based on a European regulation, many adverts that make a statistical claim need to justify that claim. As such it’s common to see at the bottom of a screen for goods like cosmetics something like “XX of YY customers agreed that blah“
What’s interesting about these claims is how often the YY in this claim is tiny. Huge brands that sell to millions make claims on the basis of a few hundred participants at most. There are many times when the sample is less than a hundred.
As the volume of participants reduces the influence of each one increases massively.
Often the volume of participants is a strange number which makes me suspicious that they’ve only surveyed the volume of people required to justify the claim they are making.
(The claim is often completely subjective. It’s not that big an influence on me to know that 87% of people said that their skin was more luminous.)
We do the same in business. We try to base decisions with long-term consequences on the tiniest samples. “We’ve succeeded in doing this for one customer so it will be brilliant for all of our other customers.” It’s a stretch for anti-aging cream, our latest product is no different.
Most samples of data require some level of cleansing. The world is full of data, most of it is littered with inconsistencies. It’s, therefore, necessary to clean the data up, and the easiest way of doing that is to exclude the bits that are outliers. The alternative approach is to only count the things that fit our criteria.
People don’t like to see other as a classification, it’s messy and raises questions, far better to just exclude them.
The problem with excluding some of the data is that it makes the other numbers appear larger when our old friend the percentage is used.
Let’s take types of nuts as an example. If we have a bag of mixed nuts and we separate them out into the various types we may come up with a sample a bit like this:
If we include all of the types above in the scoring then the following is true:
|Nuts||Volume||Percentage of Total|
As we all know – a peanut is not actually a nut, it’s a legume. It may be present in the bag of mixed nuts, but in our data, we can justifiably decide that it’s erroneous. That exclusion has a significant impact on the other numbers:
|Nuts||Volume||Percentage of Nuts|
I can now claim that a third of the nuts in the bag of mixed nuts were Walnuts, can’t I? But I cannot claim that Walnuts were a third of the bag of mixed-nuts.
THe description we give can be very important.
We need to be constantly alert to the quality of the data that we use. Some data is better than others.
Personally I find people’s attitude to certain sources of data a mystery.
There are millions, perhaps billions of pounds spent each year on creating new and better ways of counting things. Many of these systems will count things that are already being counted. The justification for these new systems is regularly a lack of trust in the old system. What fascinates me is a preference to start at the beginning of counting rather than to regain the trust in the old system. Often the lack of trust is based on the flimsiest of reasoning and an under estimation of the complexities of counting things.
I work in IT and one of the things we do is to count the number of systems, servers and the like, that people have. We do this counting across thousands of customers and hundreds of thousands of systems. This environment is not static, every day hundreds of people are adding or removing systems. What’s more a system doesn’t simply go from being there to being not there, it has various states in its lifecycle at the beginning it needs to be commissioned, at the end it needs to be tracked through various stages.
Some of the systems are counted automatically, they tell us they are there on a regular basis. Other systems are manually counted, they don’t have the ability to tell us of their presence, the person working on them is supposed to tell us that they have been added or taken away. Every time you add a human into the process the level of accuracy reduces, but some data is better than no data, isn’t it?
The best that we can hope for in this dataset is that it is broadly correct and most of the time broadly correct is all we need. That’s enough quality for us to make the decisions that we need to make.
Broadly correct is fine for us because we understand the fuzzy parts, we know the bits to trust and the parts to have less trust in. Where it gets tricky is when people start making claims about these numbers in a way that doesn’t reflect that fuzziness. We tend to round things up, or down to the nearest tens of thousands because that’s where we are confident. That’s the level of leeway that we give ourselves. Others declare exact numbers and in so doing give a misleading perspective on the data.
Most of the time we collect data to help us to make decisions. One of the ways in which we guide our decisions is by drawing straight lines.
One of the core skills of humans is to pattern match. We look at items on a graph and cannot help but see a trend. Most of the time the trend that we see is a straight line, sometimes are see a curve. In this age of Covid many of us have looked at charts and wished to see those early signs of a wave slowing down and the curve to head downwards once more.
The problem we have is that our need to see lines is so strong that we really struggle when things aren’t a line, we really dislike charts that are just a scattering of dots. The reality is, though, that many of the things that we look at are random, they are that scattering of dots without a clear concise line.
Beware of seeing lines where they don’t exist.
It’s all about context
Number don’t stand on their own, they exist within a framework of time and place. They are influenced by the way that we create them. We like to make numbers neat and tidy, even when they aren’t. Every number is an interpretation of the person who created it. The things that we exclude say as much about the data as the things that we include.
Without understanding the context in which a set of numbers have been created we can’t derive any true meaning.
The problem that I see, so often, is that the context is hidden and opaque.
It falls upon those of us who produce numbers to make sure that we explain their meaning illuminated by the context in which they were created.
The problem with memes is that they often hide that context, that’s one of the reasons why they are difficult to stop.
Anyway, I’m off to delve deep into an Excel spreadsheet to work out whether we should include the peanuts, or not.
Header Image: This is Small Water which is tucked between Harter Fell and Mardale Ill Bell on a glorious day in the hills. Alongside it runs the Nan Bield Pass which links together the remote communities of Mardale and Kentdale which would, otherwise, be a very long walk around. I have no idea why it’s called Nan Bield Pass, or whether Nan Bield was a person or is describing a feature.