Predictive validities: figures of merit or veils of deception?

Predictive validities: figures of merit or veils of deception?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The ETS has recently released new estimates of validities of the GRE for predicting cumulative graduate GPA. They average in the middle thirties - twice as high as those previously reported by a number of independent investigators. It is shown in the first part of this paper that this unexpected finding can be traced to a flawed methodology that tends to inflate multiple correlation estimates, especially those of populations values near zero. Secondly, the issue of upward corrections of validity estimates for restriction of range is taken up. It is shown that they depend on assumptions that are rarely met by the data. Finally, it is argued more generally that conventional test theory, which is couched in terms of correlations and variances, is not only unnecessarily abstract but, more importantly, incomplete, since the practical utility of a test does not only depend on its validity, but also on base-rates and admission quotas. A more direct and conclusive method for gauging the utility of a test involves misclassification rates, and entirely dispenses with questionable assumptions and post-hoc “corrections”. On applying this approach to the GRE, it emerges (1) that the GRE discriminates against ethnic and economic minorities, and (2) that it often produces more erroneous decisions than a purely random admissions policy would.


💡 Research Summary

The paper critically examines the surprisingly high predictive validities for the Graduate Record Examination (GRE) reported by the Educational Testing Service (ETS) in its recent study of cumulative graduate GPA. While earlier independent investigations consistently found multiple‑correlation coefficients in the .15‑.20 range for both short‑term (first‑year GPA) and long‑term (cumulative GPA) criteria, the ETS report claims values around .35, essentially double the prior estimates. The authors argue that this discrepancy is not due to a genuine increase in the GRE’s predictive power but stems from a flawed statistical aggregation method used in the ETS analysis.

The ETS report employs what the authors call the “Method of Pooled Department Analysis” (PDA). In PDA, each department or sub‑sample computes its own multiple correlation (R) between the GRE sub‑scores (Verbal, Quantitative) and the criterion (GPA). These department‑level Rs are then averaged, weighted by department enrollment, to produce a single “overall” validity. This contrasts with the conventional approach of aggregating all individual scores into one large sample and calculating a single correlation directly.

To test the impact of PDA, the authors conduct extensive Monte‑Carlo simulations. They generate multivariate normal data with pre‑specified population correlation matrices, ensuring known “true” validities ranging from zero to .40. A total sample size of roughly 1,000 is divided into varying numbers of sub‑samples (NSS) with different sub‑sample sizes (SSS), mimicking the small departmental samples (often < 50) found in the ETS data. For each simulated dataset they compute three estimates: (1) PDA (average of department‑level Rs), (2) the conventional aggregation (agr), and (3) a simple sum of standardized predictors (sum). Bias is defined as population parameter minus the estimate; negative bias indicates over‑estimation.

The simulation results, presented in Tables 2‑4, reveal a systematic pattern:

  1. PDA Over‑estimates Small Validities – When the true validity is zero, PDA yields an average R of about .27 for sub‑samples of size 25, and .16 for size 50, whereas aggregation produces .04 and .03 respectively. Even when the true validity is modest (.10‑.20), PDA inflates the estimate by .07‑.09 on average.
  2. Bias Inversely Related to Sub‑sample Size – Smaller departmental samples generate larger upward bias. As SSS increases toward 77, the difference between PDA and aggregation diminishes but remains noticeable.
  3. Bias Inversely Related to True Validity – The lower the population correlation, the larger the PDA’s upward bias. This is especially problematic for long‑term criteria like cumulative GPA, where genuine validities are typically low.
  4. Aggregation and Sum Methods Perform Better – The conventional aggregation method consistently shows the smallest bias across all conditions; the sum of standardized predictors is also relatively stable, though slightly less accurate than aggregation in some scenarios.

Beyond the pooling issue, the authors critique the ETS’s correction for restriction of range (R_c). Such corrections assume linear relationships, identical population distributions across restricted and unrestricted groups, and precise knowledge of the degree of restriction—assumptions rarely satisfied in real GRE data. Consequently, the corrected validities may be as misleading as the uncorrected ones.

The paper then shifts from statistical estimation to practical decision‑making. It argues that test utility cannot be captured solely by a correlation coefficient. Real‑world admissions involve base‑rates (the proportion of applicants who would succeed regardless of test scores), institutional quotas, and the costs of false positives (admitting a low‑performing student) versus false negatives (rejecting a high‑performing student). By converting GRE scores into a binary admission decision and calculating misclassification rates, the authors demonstrate that the GRE discriminates against ethnic and socioeconomic minorities and, paradoxically, can produce more erroneous admissions decisions than a random selection policy.

In sum, the authors conclude that the inflated GRE validities reported by ETS are an artifact of the PDA pooling method, which magnifies upward bias especially when (a) departmental sub‑samples are small and (b) true validities are modest. The paper calls for abandoning PDA in favor of direct aggregation of all scores, for careful scrutiny of restriction‑of‑range assumptions, and for evaluating test usefulness through misclassification analysis rather than abstract correlations. Such reforms would yield a more honest assessment of the GRE’s predictive power and help avoid policies that unintentionally disadvantage under‑represented groups.


Comments & Academic Discussion

Loading comments...

Leave a Comment