Do we measure novelty when we analyze unusual combinations of cited references? A validation study of bibliometric novelty indicators based on F1000Prime data

Do we measure novelty when we analyze unusual combinations of cited   references? A validation study of bibliometric novelty indicators based on   F1000Prime data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Lee, Walsh, and Wang (2015) - based on Uzzi, Mukherjee, Stringer, and Jones (2013) - and Wang, Veugelers, and Stephan (2017) proposed scores based on cited references (cited journals) data which can be used to measure the novelty of papers (named as novelty scores U and W in this study). Although previous research has used novelty scores in various empirical analyses, no study has been published up to now - to the best of our knowledge - which quantitatively tested the convergent validity of novelty scores: do these scores measure what they propose to measure? Using novelty assessments by faculty members (FMs) at F1000Prime for comparison, we tested the convergent validity of the two novelty scores (U and W). FMs’ assessments not only refer to the quality of biomedical papers, but also to their characteristics (by assigning certain tags to the papers): for example, are the presented findings or formulated hypotheses novel (tags “new findings” and “hypothesis”)? We used these and other tags to investigate the convergent validity of both novelty scores. Our study reveals different results for the novelty scores: the results for novelty score U are mostly in agreement with previously formulated expectations. We found, for instance, that for a standard deviation (one unit) increase in novelty score U, the expected number of assignments of the “new finding” tag increase by 7.47%. The results for novelty score W, however, do not reflect convergent validity with the FMs’ assessments: only the results for some tags are in agreement with the expectations. Thus, we propose - based on our results - the use of novelty score U for measuring novelty quantitatively, but question the use of novelty score W.


💡 Research Summary

The paper investigates whether bibliometric novelty indicators based on unusual combinations of cited references truly capture the novelty of scientific papers. Specifically, it validates two scores: the “U” score (derived from Uzzi et al., 2013 and revised by Lee et al., 2015) and the “W” score (proposed by Wang et al., 2017). Both metrics quantify how often a paper cites pairs of journals that rarely appear together in the literature, but they differ in computation: U is the log‑ratio of observed to expected co‑citation frequencies, while W is a standardized z‑score of co‑citation frequencies.

To assess convergent validity, the authors compare these scores with expert evaluations from F1000Prime, a post‑publication peer‑review platform in the biomedical domain. Faculty Members (FMs) not only rate papers with a 1‑3 star score but also assign qualitative tags such as “new finding”, “hypothesis”, “controversial”, “confirmation”, etc. The tags in bold in the paper (e.g., “new finding”, “hypothesis”, “novel drug target”) are interpreted as indicators of novelty, whereas others are expected to show zero or negative correlations.

The dataset comprises 2,534 papers published between 2012 and 2018 that were selected by F1000Prime. For each paper the authors extracted the list of cited journals, computed U and W, and recorded the number of times each novelty‑related tag was assigned. Because tag counts are over‑dispersed count data, the authors employed Negative Binomial Regression models, controlling for publication year, journal impact, and article type. The regression coefficients (β) represent the percentage change in expected tag counts for a one‑standard‑deviation increase in the novelty score.

Results for the U score are consistent with theoretical expectations. A one‑standard‑deviation increase in U is associated with a 7.47 % rise in the expected number of “new finding” tags (β = 0.072, p < 0.01). Similar positive effects are observed for “hypothesis” (β ≈ 0.058) and “novel drug target” (β ≈ 0.045). These findings suggest that papers with higher U scores are more likely to be judged by experts as containing novel results or ideas.

In contrast, the W score shows weak or absent relationships with most novelty tags. The coefficient for “new finding” is not statistically different from zero, and the only significant positive association appears for the “controversial” tag, which is ambiguous regarding novelty. Moreover, W exhibits a negative relationship with the “confirmation” tag, contrary to the expectation of zero correlation. These patterns indicate that W does not reliably reflect expert judgments of novelty.

The authors discuss why the two metrics behave differently. U directly measures the rarity of journal pairings relative to a random baseline, making it a straightforward proxy for “unusualness”. W, however, standardizes co‑citation frequencies across the entire dataset, rendering it sensitive to field‑specific citation practices and temporal trends; this may dilute its ability to capture genuine novelty. Additionally, the assumption that any rare combination automatically signals innovative content is challenged by the empirical findings for W.

Limitations are acknowledged. F1000Prime covers only a small, non‑random subset of biomedical literature, and the expert tags, while informative, are subjective and may vary across faculty members. The reliance solely on cited journals ignores other knowledge sources (e.g., patents, conference proceedings) that could contribute to novelty. The authors suggest future work to integrate multiple data sources and to test the indicators in other disciplines.

In conclusion, the study provides robust evidence that the U novelty score possesses convergent validity with expert assessments of novelty and can be recommended for quantitative studies of scientific creativity. The W score, however, fails to demonstrate consistent validity and should be refined or replaced before being used for evaluative purposes. The paper underscores the importance of empirically validating bibliometric indicators against peer judgments before they are adopted in research assessment or policy contexts.


Comments & Academic Discussion

Loading comments...

Leave a Comment