If I take the bus downtown today, what are the chances I’ll run into my friend (let’s call her Alice)? Suppose I’m on the bus for an hour. According to the local metro, there are approximately 31 people using a single bus per hour of operation. Assume that during the hour I’m on the bus I encounter 30 random people from my city. My city has a total population of around 300,000 people.
We can calculate this probability in the following way. The probability that I run into Alice on the bus is the same as Alice being included in a random sample of 30 people out of a population of 300,000. We can figure this out by considering some simpler examples. If we had a sample size of 1, this is just picking a random person out of a group, so the probability is 1/300,000. If we had a sample size of 300,000, then Alice would be guaranteed to be included (the probability would be 1). If we select half the population, a sample size of 150,000, then there would be a 50% chance of Alice being included. It is easy to see that the probability of any individual being included in a sample of size n is n/300,000, like 1/300,000 for a sample of one, 300,000/300,000 for the whole population, or 150,000/300,000 for half the population. So, the probability of Alice being on the bus is 30/300,000 or 1/10,000 or 0.0001.
Despite calculating this probability, this is only an estimation. This is where I think many people misinterpret probabilities. Probability estimations can, in some cases, be highly sophisticated mathematical models with a lot of complicated calculations. Nevertheless, we are not actually calculating the probability of the event, and our answer will never be strictly accurate. In my bus example, note that many of the assumptions are very inaccurate, for example it’s not true that bus riders are a random selection from the entire city’s population. Some people ride the bus every day, some ride only occasionally, and some never ride the bus at all. If Alice always drives to get where she’s going, then the probability of running into her on the bus is zero. Similarly, if Alice rides the bus every day at 5 pm but I’m riding the bus at noon, I will also never run into her. On the other hand, if Alice regularly takes random bus trips downtown, then the probability is far greater than 1/10,000.
This is part of why, I think, people are sometimes surprised when something apparently likely doesn’t happen or when something apparently unlikely does happen (see also: Why do unlikely things happen?). It may even be that, after a large number of trials, the average of the observed outcomes doesn’t converge on the expected value. For example, if I repeatedly take the bus and record whether I run into Alice, it may turn out that on average I see her once out of every 40 bus trips, which is far from the expected 1 out of 10,000.
Compare that to something like rolling a die. When figuring out the probabilities we do make a lot of simplifying assumptions, ignoring things like air resistance, how the die is thrown, and so on. However, the nature of dice-rolling is such that all of these factors “come out in the wash,” so to speak. By that I mean that because dice are almost perfectly symmetrical and the process of rolling is sufficiently random-like, after a large number of rolls the observed average does converge on the expected value. E.g., the chances of rolling a 3 are 1 out of 6 and on average 1 out of every 6 rolls is a 3.
Arguably, the only “true” probabilities (in the sense of being truly random) occur in math and maybe quantum mechanics. Everything else should be interpreted statistically in terms of lack of knowledge about a situation. We could hypothetically predict the outcome of a dice roll if we had precise enough information about the situation. When discussing things related to human behavior, like taking a bus for instance, the situation is vastly more complicated and information about it difficult to obtain. When estimating a probability, then, we must make assumptions to account for things we don’t know, and therefore our confidence in the estimation must depend on what assumptions we make. It is this confidence that I think most laypeople are unaware of.
Consider a statistic like “one in three high school students goes on to drop out of college” (this is made up for illustration purposes). One can easily imagine an auditorium full of high school students at an assembly with the speaker saying, “Look to your left, then look to your right. One of the three of you will drop out of college.” This is such an abuse of statistics! The statistic can be literally true and completely accurate, and yet that doesn’t make it a good estimation of probability. What does the same statistic look like for students from that particular state? Or at that particular school? Does it vary by gender, race, or ethnicity? Is GPA a predictor of whether a student will go on to drop out of college? Applying broad statistical trends to individuals is about as reasonable as my 1/10,000 probability I calculated above, which is to say not reasonable at all. In the absence of any other information, it can give some idea of how likely something is, but only with an extremely low level of confidence.
