This summary of Noise: A Flaw in Human Judgment highlights how random and noisy our decisions can be. In another post, I summarise the suggested ways to reduce noise.
Buy Noise at: Amazon | Kobo (affiliate links)
Key Takeaways from Noise
- There is noise in lots of areas, and usually more than you think. Areas covered in the book include medicine, sentencing, asylum decisions, punitive damage awards, insurance underwriting, forecasts, hiring, fingerprinting.
- Noise audits are a way to identify the amount and type of noise in a system. In a noise audit, you get multiple people to evaluate the same case and you then compare their judgments. A noise audit may sometimes reveal deficiencies in skill or training.
- Even “one-off” decisions can be noisy. A different person could have made a different decision, or the same person on a different day could have made a different decision. Noise is present if some irrelevant aspect influenced your decision. While we can’t measure noise in one-off decisions (unless it’s a one-off prediction maybe), the methods to reduce noise in recurrent decisions could also be useful for one-off decisions. The authors suggest that we should think of singular judgments as recurrent judgments that we make only once. [I think the better way to think is to accept that cognitive biases can result in worse decisions, so decision hygiene to help prevent those errors can help in one-off decisions too.]
- We don’t want noise in judgments. Diversity in tastes is fine. Variability in a competitive situation in which the best judgments triumph is also fine. In competitive situations, the variation is just the first step; there is a second step of selection. But judgments can matter, and
- A common source of error is psychological bias.
- A psychological bias can create a statistical bias if many people share that same bias (e.g. racism in judiciary).
- Psychological biases create noise if judges are biased in different ways or to different extents.
- Noise in different directions doesn’t just cancel out. If we took a bunch of noisy judgments about the same case and then aggregated them, they would cancel out. But if we have noisy judgments in different cases, they add up rather than cancel out. If one insurance underwriter overprices and another underprices, the insurance company has made 2 errors, not 0.
- Objective ignorance is the idea that many things on which the future depends simply cannot be known. Most people grossly underestimate objective ignorance. I’ve summarised the parts about objective ignorance here.
- I have put into separate summaries the chapters on how to reduce noise and common objections to reducing noise, as well as how to make better decisions more generally.
- The book also includes three appendices: one for how to conduct a noise audit; a decision observer checklist, and one on how to correct predictions.
Detailed Summary of Noise
Types of noise
System noise is the unwanted variability in judgments of the same case by different individuals or teams. System noise consists of level noise and pattern noise.
Level noise
Level noise is the variability between different judges’ average judgments. These tend to be somewhat stable between-person differences (e.g. some are consistently more lenient, others consistently underestimate).
[I think that casually, we often use the term “bias” to describe level noise. For example, when we think someone is a harsh marker, we might say they have a negative bias.]
Pattern noise
Pattern noise is a result of the features of a particular case that different people react to differently. For example, if you are generally more lenient, there might be something about a particular case that makes you harsher; or even more lenient than average. Pattern noise arises because individuals are unique and idiosyncratic, so will respond to different cases in different ways.
The authors break down pattern noise further into occasion noise and stable pattern noise.
Occasion Noise is the variability in judgments of the same case by the same person on a different occasion. Things such as mood, stress, or the order that cases are evaluated can cause this variability. Studies generally find that occasion noise is smaller than pattern noise or level noise (i.e. you are more similar to yourself than you are to other people). Examples of occasion noise include:
- In one study, experimenters asked software developers on two separate days to estimate the amount of time required to complete a task. On average, the estimated hours by the same person differed by 71% on average.
- Physicians are more likely to prescribe opioids at the end of a long day.
- Doctors are significantly more likely to order cancer screenings early in the morning than late in the afternoon. The authors suggest this may be because doctors run behind schedule in the afternoons and skip decisions about preventive health measures.
Stable Pattern Noise is the remaining pattern noise.
The authors find that stable pattern noise is usually more significant than level noise and occasion noise. This is important because it is easiest to observe level noise without a noise audit. It is also relatively easy to address (e.g. university scaling grades). So the fact that most noise is not level noise suggests that we are underestimating noise and not correcting for it enough.
Measuring noise
- In predictive or verifiable judgments (e.g. the weight of a cow), we will eventually find out the true value. To measure noise or error for such judgments, you can compare it to the true value or outcome.
- Some judgments are non-verifiable (or at least not verifiable for a long time), or are evaluative (e.g. the “fair” sentence for an offender). To measure noise for these judgments, use the mean of judgments as the true value. This is not always a correct assumption because there could be (statistical) bias in the judgments.
Error equations
- Error (i.e. deviation from the “true” value) = [statistical] bias + noise.
- Statistical bias is a systematic deviation from the true value (e.g. left skew, right skew). It is the average error. We can only measure statistical bias if we know what the true value is.
- Noise is the remaining or residual error.
Gauss proposed the mean squared error (MSE) measure to work out how much individual errors contribute to overall error. MSE is the average of the squares of individual errors.
- Squaring gives large errors much greater weight than small ones, and the direction of errors does not matter (and they do not cancel out).
- MSE does not reflect the real world costs of judgment errors, which may be asymmetric (e.g. imprisoning an innocent person is arguably worse than letting a guilty person go free)
- Overall Error (MSE) = Bias^2 + Noise^2
- System Noise^2 = Level Noise^2 + Pattern Noise^2
- Pattern Error = Stable Pattern Error + Occasion Error
- Pattern Noise^2 = Stable Pattern Noise^2 + Occasion Noise^2
Evaluating the process
You can also assess the quality of judgments by looking at the process used to make a judgment (particularly for non-verifiable ones), and seeing how that process performs over time.
For example, you can’t say that predicting Clinton had a 60% chance to win was an incorrect judgment simply because Trump ended up winning. What you can do, instead, is to apply the process you used to a bunch of other predictions and see how that process performs over time.
Almost all models and rules outperform humans because humans are noisy
Rules, algorithms and formulas consistently perform better than humans when making predictions. The authors argue this is not because of their superior insight but because of their noiselessness.
- Meehl found that very simple mechanical models applying the same rule/weights to all cases tend to outperform human judgments. (Because they are less noisy than humans.)
- Goldberg found that models of particular human judges, that have a 0.80 correlation (PC 79%) with the judges’ actual decisions, usually outperform the judge because it is less noisy. Also, the model does not reproduce all the subtle rules of the human judge (that are not in the modelled) but often that subtlety is not valid.
- Yu and Kuncel found that even random linear models tended to outperform human judges. The authors suggest this is because of the massive amount of noise in human judgment.
Racial bias – models vs humans
The authors note that racial bias is a risk with AI models in principle, even if they don’t explicitly use racial data. This is because it could aggregate predictors that are highly correlated with race (e.g. ZIP code) or the training data could be biased. However, the authors point out that humans have racial bias too. In fact, some AI models are less racially biased than humans. Some examples:
- Bail decisions. By setting the risk threshold to achieve the same crime rate as the human judges’ decisions, the algorithm jailed 41% fewer people of colour.
- Recruitment. Resumes that had been selected by a machine-learning algorithm were 14% more likely than those selected by humans to receive a job offer after interviews. The algorithm group was also more diverse in race, gender and other metrics. The algorithm was more likely to select “non-traditional” candidates who did not go to an elite school, who did not have prior work experience, and who did not have a referral. Human beings tended to favour resumes that checked all the boxes, while the algorithm weighted each predictors.
Simple or “frugal” models are surprisingly good
- Dawes found that an equal weight formula (called an improper linear model), where predictors are equally weighted, was about as accurate as “proper” regression models and far superior to human judgment. The reason for this is that multiple regression minimises error in the original data (i.e. it “overfits” the data). That overfitting can cause mistakes in out-of-sample data because it will consider some factors as relevant when they really aren’t, or over- or under-weight factors. The predictors used for Dawes’ improper linear model do have to be correlated with the outcome though – i.e. actual predictors.
- In real life, predictors are usually correlated with each other. Combining two correlated predictors is not much more predictive than having one predictor itself. This supports the use of “frugal” models that use fewer predictors.
- For example, in 2020 Jung, Concannon, Shroff, Goel and Goldstein used a frugal model to predict flight risk for bail decisions. The model used just two predictors – the defendant’s age and number of past court dates missed. It performed as well as statistical models with much more variables (and much better than human judges).
- Frugal models are good because they are transparent and easy to apply. And this comes at relatively little cost in accuracy.
AI/machine-learning models
AI models just find patterns. They are not magical. Machine-learning models find significant signal in a combination of variables that might otherwise be missed.
One of the advantages of machine-learning models over simple linear models is that they are able to discover “broken legs”. The “broken-leg principle” is the idea that if you know someone has a broken leg, you know they are not going dancing that night even if the model predicts a high chance that they would. In that case, you have decisive information that the model has not taken into account, so you should override it.
But you should not override the model simply because you disagree with their decision. Because you would just be introducing noise.
Groups tend to amplify noise
This is due to social influence and informational cascades (based on who speaks up first). Examples:
- Depending on whether a song randomly appeared on a “popular” list on a website, it could be a spectacular success or failure.
- Lev Muchnik found that a single (random) upvote made a user 32% more likely to give something another upvote. Over time, a single random upvote increased the mean rating of comments by 25%.
- Kahneman, Sunstein and Schkade conducted a study using case vignettes and over 500 mock juries. Statistical juries (using jurors’ independent judgments) turned out to be much more moderate and less noisy than deliberating juries (when jurors talked to each other). Deliberating juries were more polarised/extreme. In fact, they found that 27% of juries chose an award as high as, or higher than, their most severe member.
But note that aggregating independent judgments is a powerful way to reduce noise.
Other interesting things
- Marvin Frankel, a US Judge, identified the problem of noise in criminal sentencing back in the 1970s. He campaigned for reforms that led to sentencing guidelines.
- For example, Frankel pointed out that two men without criminal records were convicted for cashing counterfeit checks of around $35-60. One man was sentenced to 15 years, the other was sentenced to 30 days.
- He was ahead of his time, arguing for the use of computers to aid sentencing.
- In their noise audit of insurance underwriters, the median difference was about 55%. The company’s executives had expected about 10%. [I had actually estimated about 50% before I saw the result.]
- Many professionals maintained an illusion of agreement while in fact disagreeing in their professional judgments. [I think this is related to Cass Sunstein’s work about private vs revealed preferences.] One reason is that many organisations prefer consensus and harmony and have systems in place to minimise disagreements.
- Matching fingerprints is not nearly as clear-cut as many people think. The reason is because latent prints left at a crime scene are often very different from exemplar prints collected in a controlled environment. Latent prints are often partial, unclear, smudged, overlap with other prints, or include dirt. Fingerprint experts exercise judgment in determining whether two prints match, don’t match, or are inconclusive.
- Itiel Dror found in several studies that a fair number of examiners change their minds when given biasing information along with the fingerprints. Some change their minds even with no biasing information (i.e. occasion noise). Most are changes to/from inconclusive but still.
- A high profile case when fingerprint experts got it wrong was the case of the Madrid bombings involving Brandon Mayfield. That case seemed to be infected with confirmation bias because the first fingerprint expert was highly respected.
- Thankfully, examiners tend to err on the side of caution in their judgments. They know how much worse a false positive ID is compared to a false negative. So if in doubt, they’ll mark as inconclusive. False positives are much rarer than false negatives.
- Official agencies tend to be overly optimistic about their budget forecasts. On average they project unrealistically high economic growth and unrealistically low deficits.
- Medicine is noisy, not just in diagnosis but in treatment recommendations. (Psychiatry is particularly noisy.) Always worth getting a second opinion on anything that requires judgment.
- Field trials for DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) found that even highly trained specialist psychiatrists under study conditions were only able to agree that a patient had depression between 4 and 15% of the time.
- Studies find that performance evaluations are incredibly noisy. The person’s performance accounts for about 20-30% of the variance, with the remaining 70-80% possibly being noise. Kahneman also points out that many feedback questionnaires are very complex and increases the time required to provide feedback.
My Thoughts
It was a good book and changed the way I thought about noise and error. I was already familiar with some of the examples such as discrepancies in sentencing, and the noise in marking essays. When I marked essays as a tutor, I recognised how hard being consistent was when giving people an “overall” mark. It could be very hard to distinguish between, say, a B+ and a B. So I did what the authors recommended. I broke down the large decision (what grade to give) into a bunch of smaller decisions (up to 3 points for writing style, up to 2 points for identifying a particular issue, etc). I do believe this made my grading less noisy.
Towards the middle, the book talks about various cognitive biases (e.g. base rate neglect, availability bias, etc). This felt unnecessary. It felt like Kahneman wanted to recycle stuff from his earlier book, or put in findings that didn’t make the cut. It was interesting but I thought much of it belonged in a different book.
I think the distinction between statistical bias and cognitive biases could have been made much clearer. The authors usually just used the term “bias”, which was mildly confusing until I realised they were using the term in two different ways. Near the start, the book talked about how it will focus on noise, not [statistical] bias, as an underappreciated source of error. Later on, the book talked about how [cognitive] biases cause noise. It does eventually clarify this but rather late in the book.
I liked the fact that there was practical advice for how to reduce noise/error in decisions. Some people seem disappointed as many of the ideas in the book don’t seem to be revolutionary. Perhaps they were expecting more given the impact Kahneman’s first book made. Moreover, while noise can affect us all if we’re on the wrong side of a “noisy” judgment, the book seems more relevant to organisation leaders who are better placed to fix things. I enjoyed it from a policy perspective.
Buy Noise at: Amazon | Kobo <โ These are affiliate links, which means I may earn a small commission if you make a purchase. Iโd be grateful if you considered supporting the site in this way! ๐
Have you read this book? What did you think about it? Please share your thoughts in the comments below!
If you enjoyed this summary of Noise: A Flaw in Human Judgment, you may also like: