Review of "The Effect of Videogame Violence on Physiological Desensitization to Real-World Violence," by Nicholas L. Carnagey, Craig A. Anderson, and Brad J. Bushman, Journal of Experimental Social Psychology, 2006. Many thanks to Andy Havens for sending us this link.
The authors present the result of an experiment in which 257 college students played video games and then were exposed to films of real-world violence. Heart rate (HR) and skin conductance (sweat; SC) were measured. These unconscious measures are accepted as valid indicators of arousal. In the main experimental condition, subjects were separated into two groups, one that played a violent video game and one that player a non-violent video game. HRs and SCs were measured before the experiment, after video game play, and after films of real-world violence. The data indicate that for those who played nonviolent video games, arousal rose during video game play, and again during the violent films. Those who played violent video games experienced an increase in arousal during the video games, and a decrease during the violent films. The authors argue that these patterns are significant, and indicate that those who play violent video games become desensitized to violence.
While the experiment appears to have been competently conducted, the statistical treatments that lead the authors to conclusions of significant desensitization collectively constitute a veritable handbook of quantitative error and deception. This review can only select some of the more egregious failures. Unfortunately, the abysmal quantitative skills evidenced here are in fact all too common in this literature and most of its close relatives. Entire disciplines seem to believe that they are discovering things, when it seems to this reviewer that they are making up their discoveries as they go along.
In particular, the primary failings one sees again and again in this literature are:
1. Failure to honestly and fully report data
2. Failure to distinguish statistical from substantive significance
3. Failure to develop statistics relevant to prior theory
4. Failure to draw careful and appropriate policy inferences
The Carnagey et. al heart rate data will be used as examples of each failing.
1. The authors do not fully and honestly report their heart rate data. Six small numbers are all that's necessary: the authors should simply tell the reader what they found. What were the average heart rates for people who played violent or non-violent games, at the three times in the experiment? They bury four of these numbers in parenthesized text on page 5, but some digging reveals: Heart rate before game play, 66.4 beats per minute (bpm) and 65.5 bpm for violent and non-violent games; and heart rate after game play but before film, 69.3 and 68.4. The authors do not report heart rate after the film, which of course is the most important set of numbers. Instead, the reader is referred to a figure (p. 5). As best one can tell, heart rate after the film looks like about 70.5 for the non-violents, and about 68.5 for the violents. Thus, the data are
Before experiment, after game, after film:
Violent: 66.4 69.3 68.5
Non-Violent: 65.5 68.4 70.5
It would also be necessary for the reader to see the measured standard deviations of the respondent heart rates across the respondent sample. Otherwise, it is impossible to tell whether these variations in heart rate averages are substantively significant (see below). Absent these data from Carnagey et al., we have to rely on outside data. It turns out that average heart rates at rest for human beings run between 65 and 70 beats per minute. A range of 50-100 is normal (source; source).
Instead of these numbers, the authors report their HR data using a single figure, indeed one that makes use of one of the most frequently denounced practices of statistical charlatans: the vertical axis is not grounded at zero. The range of HR values in the figure runs from 60 to 75 bpm - the entire range falls within the normal range for a human heart. On such a graph, of course, the numbers reported above do look like significantly different numbers. But it is a deception of significant magnitude, well worthy of an 'F' in an introductory econometrics course.
2. The authors cannot distinguish (or do not understand) the difference between statistical and substantive significance. An overview of similar follies can be found here. The idea is simply this: a long time ago, statisticians invented a certain kind of test for the relationship between the mean of a variable and its variance. They called the test a "test of statistical significance." If the mean turned out to be bigger than its standard deviation, they said that the mean was "statistically significant." In one of the most unfortunate linguistic twists imagineable, generations of quantitatively clumsy followers have morphed this notion into a general test of whether a variable is large or not. It's unfortunate becuase statistical significance is just an artifact of the data, it can never tell us whether a number matters or not. In fact, any statistic will become statistically significant if the sample is large enough; as the sample gets larger, the variance around the mean gets smaller, and presto! statistical significance. But theoretical questions don't change on the basis of whether the data set is big or small. The classic story is of an animal husbandry scholar who found that the length of hair on the left side of a sheep's back is statistically significantly longer than on the right side. Of course, he had data from thousands and thousands of sheep. Even a tiny difference in average length - 0.0001 inches - would become statistically significant if you measured hair on a million sheep. Yet the test of substantive significance - "is this a meaningful difference in hair length or not?" is absolutely independent of sample size. And substantive significance is all we should care about (see item 4 below).
Nonetheless, the bugbear of statistical significance is loose among poorly-grounded fields, among which one must now, on the basis of their acceptance of this paper, sorrowfully include experimental social psychology. It is common among bad statisticians to
a) Drop the word 'statistical' when referring to the significance of a finding
b) Report only statistical significance tests, not substantive significance tests
c) Report all statistically significant findings and ignore all statisticall insignificant ones
Carnagey et al. commit at least one of these three of these errors in every paragraph of their results discussion, and they frequently commit all three. Almost all of their statistical discussion is devoted to F-tests, which are statistical significance tests. On page 5 they assert, for example, that the difference in HR from game to film was "large" for both groups, inserting a parenthetical F-test result as their only support for that assertion. Large. The reader may judge: is the difference in HR of 69.3 to 68.5 bpm "large" by any standard of substantive significance? This reviewer does not think so, especially given the understanding that individuals may have HRs between 50 and 100 bpm and still be considered normal.
To report statistical significance tests as tests of substantive significance is more shamefully deceptive than the simple graph cheat identified in (1), but it is of the same color.
3. The authors fail to develop these statistics within the context of a sound theory. Carnagey et al. do spend a long time talking about theory, but interesting things start to happen once they apply their theories to the data. In their "Preliminary Analyses" on page 4, the authors describe what they call "significant" and "insignificant" modifiers to the study's results. One might theorize, for example, that prior exposure to video games might affect how HR responds. Or, perhaps being male or female might matter. Family background might make a difference. Since this is a random-assignment study, it's unlikely that these effects will be important. Still, the authors were careful to do a post-hoc assessment of the data along these kinds of lines. Where their practices turn shady, however, is when they conclude from "insignificance" of difference that a given variable can be completely dropped from the study. It probably does not need saying that the standard of significance here is statistical significance, and therefore this practice is to use the old statistical significance bugbear to substantively alter what is considered theoretically important, prior to the construction of the study's primary statistics. The proper procedure is to complete all theoretical reasoning prior to data manipulation. If theory suggests that a variable such as sex matters, it should be included in the entire analysis. The correct protocol for studying any effect is to embed it in a regression analysis so that its effect can be isolated while holding the effects of other variables constant. Again random-assignment is one way to do what regression is supposed to be doing, namely, to hold other factors at bay. But it is even better to do what Carnagey et al. apparently do, which is to approach the post-experimental data using regression as well. The bad practice comes in dropping entire variables from the analysis simply because some aspect of them was statistically insignificant at a prior step. If theory dictates that they matter, they should be in the final regression. To exclude them for some reason related to an ad hoc statistical significance test is another terribly bad practice; very likely, the inclusion of all theoretically-relevant variables in the analysis would make the HR differences reported even smaller.
4. The authors do not draw careful and appropriate policy inferences. The policy issue in this line of research is whether violent video games are so bad for us that our use of them should be controlled, either by governments, our loved ones, or ourselves. Carnagey et al. reveal themselves to be utterly insensate to this question. Rather, they conclude that any measurement of desensitization, so long as it passes a statistical significance test, is worthy of public notice. Returning to the heart rate data: playing a violent video game reduces heart rate when viewing subsequent violent content by two bpms. This indicates something about arousal. One can debate whether it is "significant". Let us assume it is. Does this arousal effect indicate a significant amount of desensitization to violence in the real world? Carnagey et al. apparently believe so, judging from the title of their paper. Should individuals therefore decrease their exposure to this content? That is indeed the implicit message running through this paper. That conclusion is far from being warranted, however. The true policy issue is this: would a significant decrease in exposure to violent video games lead to a significant decrease in real-world violence? The statistics in this paper do not support such a conclusion.
Indeed, because of its ham-handed and deceptive treatment of data, this paper probably should not have been published.