Simpson’s Paradox Hides NAEP Gains (Again)

Is the education space mature enough to handle NAEP tests every two or four years? I’m not so sure.

NAEP is the “Nation’s Report Card.” It takes a representative sample of United States students and tests them in reading, math, social studies, science, the arts, etc. There are several versions of NAEP intended to sample different groups of students–the nation as a whole, individual states, or large cities–but its overall goal is to provide citizens a snapshot of how we’re doing as a country.

The United States is a big country, and it takes a long time to move the needle on student achievement scores. Depending on the subject and sample, NAEP releases test results every two or every four years. When those scores come out, they almost always look flat. Once this same “flat” result gets repeated over and over, that starts to seep into our collective consciousness about how American students are doing.

But that’s the wrong way to look at it. From a long-term perspective, the achievement levels of American students are at or near all-time highs. Some groups of students are doing particularly well. The achievement scores of black, Hispanic, and low-income students have increased dramatically.

Because NAEP takes a representative sample, it’s also vulnerable to something called Simpson’s Paradox, a mathematical paradox in which the composition of a group can create a misleading overall trend. As the United States population has become more diverse, a representative sample picks up more and more minority students, who tend to score lower overall than white students. That tends to make our overall scores appear flat, even as all of the groups that make up the overall score improve markedly.

Recent NAEP results in history, geography, and civics illustrate this trend once again. Education Week reported that scores were “flat” from 2010 to 2014. That’s mostly true–the scores were all higher than in 2010 but didn’t meet the standard for statistical significance. But scores are up over longer periods of time. Here are the gains since 2001 on geography (* signifies statistically significant):

  • All students: +1
  • White students: +4*
  • Black students: +7*
  • Hispanic students: +9*
  • Students with disabilities: +8*
  • English Language Learners: +7

Here are the gains since 2001 on history:

  • All students: +7*
  • White students: +9*
  • Black students: +11*
  • Hispanic students: +17*
  • Students with disabilities: +15*
  • English Language Learners: +12*

And here are the gains since 1998 on civics (civics has a slightly longer time period of comparable data):

  • All students: +3*
  • White students: +6*
  • Black students: +6
  • Hispanic students: +14*
  • Students with disabilities: +13*
  • English Language Learners: +14*

A few things jump out from these longer-term results. First, overall scores are up a little bit, but particular groups of students are making big gains. One rule of thumb suggests that 10-15 points on the NAEP translates into one grade level. Applying that here, scores for most groups of students have improved by roughly a full grade level over the last 15 years or so. Second, achievement gaps are closing as lower-performing groups are catching up to higher-performing ones. Third, Simpson’s Paradox makes the overall scores look relatively “flat.” Don’t let that mislead you. Although we might wish for faster progress, American achievement scores are rising.

8 thoughts on “Simpson’s Paradox Hides NAEP Gains (Again)

  1. sandy kress

    Chad – thanks for this very helpful analysis. But there are more lessons left lying in the data.

    The gains from 2001 to 2010 are generally far stronger than what came before or what came after. I don’t want to make any sort of causal claim for anything, including the possible impact of accountability at its peak. Let’s just say the two coincide.

    For example, for 8th grade students with disabilities, in geography, gains were 10 points from ’01 to ’10, but there was a loss of 2 points from ’10 to ’14.

    For these same students, in history, gains were 14 points from ’01 to ’10, but the gain was only 2 points from ’10 to ’14.

    For the same students, in civics, the gains were good in both periods, slightly better actually in the latter.

    For black 8th grade students, the gains were only in the range of 3-4 points from the base year in the 90s to 2001. But the gains averaged 10 points in geography and history from ’01 to ’10. The civics numbers were up only slightly through the whole period.

    For Hispanic 8th graders, achievement was flat in the earliest period. Gains were impressive, however, between 2001 and 2010 (6 points in geography, 12 points in history, and 10 points in civics, from 1998). Unlike the other subgroups, gains have continued for Hispanics since 2010.

    You’re right in saying the results are up over the past 20 years, but the improvement did not happen uniformly during that whole period. And the differences are worth noting and studying for lessons to be learned.

  2. Jay P. Greene

    It is not appropriate to explain away the lack of aggregate progress in academic achievement by referencing Simpson’s Paradox and dis-aggregating results by racial/ethnic group. I explained this mis-use of Simpson’s Paradox in a blog post a few years ago. See

    Here is a taste of the argument:

    “the unstated argument behind the use of Simpson’s Paradox to explain the lack of educational progress [is that] minority students are more difficult to educate and we have more of them, so holding steady is really a gain.

    The problem with this is that it only considers one dimension by which students may be more or less difficult to educate — race. And it assumes that race has the same educational implications over time. Unless one believes that minority students are more challenging because they are genetically different [which I do not think Chad believes], we have to think about race/ethnicity differently over time as the host of social and economic factors that race represents changes. Being African-American in 1975 is very different from being African-American in 2008. (Was a black president even imaginable back then?) So, the challenges associated with educating minority students three decades ago were almost certainly different from the challenges today.

    If we want to see whether students are more difficult to educate over time, we’d have to consider more than just how many minority students we have. We’d have to consider a large set of social and economic variables, many of which are associated with race. Greg Forster and I did this in a report for the Manhattan Institute in which we tracked changes in 16 variables that are generally held to be related to the challenges that students bring to school. We found that 10 of those 16 factors have improved, so that we would expect students generally to be less difficult to educate.” See

    1. Chad Aldeman Post author

      Jay, I’m not trying to “explain away” the lack of progress except to note that we’re not comparing apples and oranges when we look at the overall totals. Although gaps are closing, disadvantaged student populations continue to score lower than less-advantaged peers. If we change the ratio of higher-scoring to lower-scoring groups over time, it’s not assuming “genetic differences” when we disaggregate results; it’s just math. Why not just talk about how white students, black students, Hispanic students, low-income students, etc. are doing, rather than trying to draw one big sweeping conclusion by lumping them all together?

      1. Jay P. Greene

        Chad — You ask “Why not just talk about how white students, black students, Hispanic students, low-income students, etc. are doing..” The reason why you shouldn’t do that is that “whiteness,” “blackness,” etc don’t mean the same thing over time — that is, unless you think it is genetic. If you think that race/ethnicity is a proxy for a set of social and cultural factors which are changing over time, then any trend in achievement could be explained by that change in social and cultural factors and not in the effectiveness of schools. It’s only “math” if you think the effect of race/ethnicity on educational achievement is unchanging over time. I don’t think you think that.

        1. Chad Aldeman Post author

          Jay, your argument makes some sense as a moral appeal, but I don’t think it stands up for educational purposes. First, if race/ethnicity is just some cultural construct, why would educational researchers look at the effects of an intervention on white/ black/ Hispanic students? (You do this in your work–I assume you don’t think there’s something genetic in whether minority students gain from live theater or field trips?)

          Two, what about disaggregating the scores along lines that are not genetically based, like English Language Learners or FRPL or parents’ educational level? In the scores I cite above, ELLs are lower-scoring than the overall population, but they’re gaining faster than the overall population, and they’re a growing share of all students.

          Three, your argument also makes a lot less sense on shorter timeframes. As in the NAEP civics, history, and geography results I mentioned above, we’re talking about ~15 years, not the 40-year timeframe you used in your first comment. Is your opposition to this sort of disaggregation the same for all time periods?

  3. Chad Aldeman Post author

    Sandy, the NAEP data can get noisy in smaller chunks, so I generally prefer to include as many years of comparable data as we have available (the numbers above include all of the years in which NAEP tested in those subjects and allowed accommodations). Looking at only short-term results or slices of time can be a bit misleading, especially if we’re trying to ascribe particular causes for any changes. For example, when would we start crediting NCLB for affecting NAEP scores–2002, when the law was signed, or 2005, when most of the law took effect? And then how do we explain the big NAEP score jumps that were common between the 1998/ 1999 tests and the 2001/ 2002 tests?

    Going forward, when did the Common Core start? How about waivers? They were first issued in the spring of 2012 and took effect during the 2012-13 school year (but not for all states). I think we’re on a slippery slope when we try to eyeball NAEP score changes and ascribe them to particular (federal) policy changes.

  4. Sandy Kress

    Chad – if you read my comment carefully, you’ll notice I did not attribute the gains in the 2000s to NCLB. I didn’t suggest any relationship there – either as causal or correlational.

    The main reason I’m careful about discussing the impact of NCLB is that many of the policies that were fundamental to NCLB had begun to be implemented in the states in the mid-late 1990s. So, it’s very hard to isolate the specific impact of NCLB as separate from the impact of all accountability policy that had gone into effect in the states. As Hanushek and others have written, consequential accountability was a major policy force by the end of the 1990s, with 39 states implementing some significant version of it by 1999. NCLB extended and deepened it in its own ways across the nation. This is why I like to talk about the contribution of consequential accountability, which includes NCLB but which had its beginning some years earlier.

    The point I am making on the basis of observation, research, and data is that something happened in the late 1990s and the early-mid 2000s that is correlational with a wide variety of student achievement gains. The pattern popped up again here, and I noted it: flatness in the 90s, gains in the 2000s, and then relative flatness in this current decade.

    I have a hypothesis about it the new flatness, but I believe it needs to be researched just as consequential accountability was tested out by Hanushek and many others throughout the 2000s.

    As to Common Core and the waivers, we’ll have to see. My own belief is that neither, at least yet, has had enough enduring effect to have made any real difference. But the researchers will need to study it to see if that’s right. And I hope they will. Merely noticing that there were some subgroup gains over 20 years in which the pattern was flat-up-flat is of very limited value to policy makers, practitioners, and the public.

    1. Chad Aldeman Post author

      Sandy, I think we’re more or less on the same page (I cite Hanushek and others in a forthcoming paper on accountability). My point was merely that policy discussions can veer into short-term NAEP analyses when it would be better to rely on empirical work (like Hanushek’s) whenever possible. That’s not possible in all cases–I’d love to know more about the causes for periods of flatness and the late 1990’s pop–but when do go to the NAEP data I generally prefer to include the longest trend data that’s available.

Comments are closed.