Using evidence quality to restructure incentives in science
Metrics will inevitably bias science, so let’s switch to metrics that reflect what science should be biased by
This is my entry for the Astera 2026 Essay Competition: Identifying Systemic Bottlenecks to Science
Tenure was supposed to free me from the metrics that distort science. No longer would citations and publication counts determine if I remained employed. Instead, I found that simply wanting to help my trainees succeed brought the pressures of these science metrics back. Taking the time to replicate a study? Not when my graduate student is going to be looking for a postdoc position soon. Using that new highly rigorous method to better avoid false positive results? Not when my postdoc’s competitors aren’t using it and positive results for a publication are needed before he goes on the job market.
There is apparently no escaping the metrics of science.
Given that metrics will always bias science (we will always need ways to evaluate scientific contributions), what if we developed metrics that better reflect what science should be biased by?
I propose the development of a multi-factor evidence quality index (EQI). This is a substantial update of my 2022 proposal of the same name. I will first detail this metric, then describe an experiment to test the metric’s ability to positively impact systemic incentives in science.
The new EQI
EQI scores are designed to quantify the contribution of a study to scientific understanding. Accordingly, EQI scores are based on a study’s features and their relationship to its scientific claims. A scientist’s EQI scores are the average of the EQI scores of their studies.
EQI-Mr (methods-reliability) quantifies the likelihood that, based on the methods used by a study, its results would replicate.
EQI-Mv (methods-validity) quantifies the degree to which the study’s methods and logic support the study’s scientific claims. EQI-Mr and EQI-Mv can be averaged into a single EQI-M score.
EQI-E (evidence) quantifies the Bayesian evidence for the study’s scientific claims based on the current study’s results and the results of other studies (the priors). Evidence strength is based on results’ effect sizes and statistical certainty modulated by their EQI-M scores. EQI-E scores are ultimately the quantities that scientists should be trying to accurately estimate.
EQI-C (contribution) quantifies the contribution of the current study to EQI-E scores, reflecting the amount the EQI-E scores for the study’s scientific claims increased or decreased as a result of the current study. EQI-C incentivizes both novel studies (i.e., contributing to claims with no EQI-E score) as well as replication studies (e.g., contributing certainty to claims with low EQI-E scores).
Seeing a study’s EQI profile rapidly provides clear information regarding 1) the methodological quality of the study (EQI-M), 2) the extent to which the study’s claims are supported by other studies (EQI-E), and 3) how much the study contributes evidence for its scientific claims (EQI-C).
EQI scores range from 0 to 100, are assigned immediately upon publication, and are updated annually.
While these scores are challenging to compute, they are designed to be close to what science is supposed to be optimizing for: finding valid and reliable evidence for (or against) scientific claims.
Example EQIs and an evidence quality map (EQM)
EQIs are fundamentally relational, as EQI-E and EQI-C scores are based on relationships between multiple studies and their scientific claims. Accordingly, EQIs are accompanied by evidence quality maps (EQMs) that represent the relations between studies and their claims in knowledge graphs. Unlike standard knowledge graphs, however, EQI scores contextualize relations in terms of scientific rigor, improving the scientific inferences that can be made.
I have chosen an example publication from outside my area of research because it dramatically illustrates some of the deep issues with scientific practices, but also science’s redeeming quality of self correction. An EQI-informed EQM illustrates this nicely.
When my daughter was born in 2017 I saw conflicting guidance on whether to expose her to peanuts to prevent a peanut allergy or, rather, to avoid exposing her to peanuts to prevent a peanut allergy. Being a scientist, I looked at the immunology literature to find out what was going on. It was challenging, as I am a neuroscientist by training. I was nonetheless able to determine that recent studies had overturned decades of guidance to avoid food allergens during infancy. Instead, it was clear that it is much better to expose infants to potential allergens to help them avoid developing allergies. Not only was the prior research invalidated, but the conclusion was the opposite of what decades of research had suggested!
As shown in this interactive figure of a rough draft EQM, consistent evidence for avoiding allergens was eventually overturned by a later study with an extremely high EQI-M. This study (Du Toit et al. 2015; node S11 in the figure) had a higher EQI-M because it had several high EQI-Mr features (large sample size, preregistration, high subject retention, oral food challenge assessment) and several high EQI-Mv features (randomized controlled trial, double-blind, clear alignment between the study’s results and claims).
The EQM not only highlights this crucial study, but also puts the study in context. We can see that older studies supported the “Delayed introduction of allergenic foods prevents food allergy” claim (node C1 in the figure), while Du Toit et al. (2015) found strong evidence against that claim. Moreover, we can see that studies following Du Toit et al. (2015) also found evidence against the “Delayed introduction” claim. Instead, Du Toit et al. (2015) and these later studies supported the opposite claim: “Early oral peanut introduction prevents peanut allergy” (node C5 in the figure).
Crucially, putting the studies in context within an EQM allows us to compute EQI-E and EQI-C scores. EQI-E scores are computed for each claim node based on all incoming links, with the resulting score quantifying the likelihood that a claim is true given the aggregate evidence across the linked studies. These EQI-E scores are then assigned to each of those studies. Each study’s EQI-C score is computed based on that study’s contribution (with precedent given to earlier studies) to the current EQI-E score of each of that study’s claims. Thus, since Du Toit et al. (2015) had a massive impact on the EQI-E score for the “Delayed introduction” claim, its EQI-C score is very high.
Note that the business of calculating EQI scores is very much the business of understanding the contributions of studies to cumulative scientific progress. This is what makes EQI a promising metric for quantifying and tracking the contributions of scientists to scientific progress.
Why EQIs now? EQI scalability and systemic incentives
Developing a robust EQI ecosystem is timely because, until the advent of modern large language models (LLMs), EQI estimation would not have been scalable. Rather than having humans manually identify key details of each study (e.g., what claims each study makes) and compute EQI scores, we can use LLMs, machine learning, and carefully engineered algorithms to automate the process. The plausibility of EQI/EQM scalability is illustrated by the fact that I used an LLM to autonomously (i.e., based on a high-level description of EQI/EQM principles) create the rough draft EQM interactive figure. Countering known issues with LLMs, it is likely that the many inferential constraints imposed by the EQI estimation process will substantially reduce LLM hallucinations. Even if these inferential constraints are not enough, however, the use of EQMs provides a shared representational substrate – a map – that facilitates error correction via algorithms, scientists, and other LLMs.
Now that EQIs are technically scalable, is there likely to be demand that drives that scaling? Yes – EQIs are also timely because they align with broader systemic incentives, beyond the need for better metrics for evaluating scientists.
For instance, there is a strong need for better strategies to understand the scientific literature. When reading the scientific literature, it is difficult to determine 1) the methodological quality of a given study, 2) how much the study contributes evidence for scientific claims, and 3) whether the study’s claims are supported by other studies. All of these problems are severely amplified when a study is outside one’s very specific area of expertise. These are the exact issues addressed by EQI scores (EQI-M, EQI-C, and EQI-E, respectively). Notably, EQI scores can assist with understanding the contribution of studies to scientific understanding before, during, and after peer review.
Thinking beyond single studies, there are millions of new studies published every year, and EQIs and EQMs are designed to allow rapid assessment of not just single studies but entire collections of studies making related claims.
If, as I hypothesize, these strong needs drive strong demand for EQIs then I expect EQIs to scale in multiple scientific fields. Once EQI is a known study metric for a given field it will be natural to also use it to quantify the contributions of scientists.
Metrics that resist gaming
It has been said that “When a measure becomes a target, it ceases to be a good measure” [1]. For example, teachers may “teach to the test” rather than ensuring students actually understand the subject matter. Similarly, software developers may write bloated, inefficient code to meet their “productivity” (lines of code written) quota. This kind of gaming of science metrics like citation count and number of publications has led to major distortions in science [2]. This was dramatically illustrated by a 2016 simulation study (Smaldino & McElreath, 2016), who showed that optimizing for today’s scientific metrics leads to worse scientific methodology over time.
EQIs are designed to strongly resist this kind of gaming. Indeed, by aligning EQIs closely enough to the target outcome (good science), creative attempts to maximize EQIs are likely to yield better outcomes for science. For instance, EQI-M’s inputs include things like preregistration and adequate sample size, such that attempts to “game” EQI-M is the desired behavior. Further, EQI-E depends on cross-study convergence individual authors typically cannot control.
Realistically, however, there are likely unforeseen ways to game EQI scores that would not be good for science. Because of this, EQI scores will be updated annually. This will allow modification of EQI scores retroactively, not only countering gaming in future studies but also past studies. This updating also has the benefits of: 1) keeping EQI-E scores up to date as new studies change the evidence regarding each claim, and 2) keeping EQI-M scores accurate with respect to current best practices in each scientific field. These features incentivize replication studies (e.g., to increase EQI-E for one’s past studies) and method validation studies (e.g., to increase EQI-M for one’s past studies).
Testing the impact of EQIs on science’s incentive structure
As an initial step toward developing and scaling EQIs, I propose an experiment to test the efficacy of EQIs in altering science’s incentive structure. The focus of this experiment is on tenure-track science faculty hiring, since this process has an outsized impact on trainees vying for these jobs, as well as on those who pass this selection process.
In this initial experiment, a cohort of current scientists who have been on faculty hiring committees will be recruited. Using a randomized controlled trial design, participants will either receive CVs of mock applicants with EQI scores (group 1) or without EQI scores (group 2). The same mock applicants will be included for both groups, with standard science metrics (number of publications, citation counts, list of publications) included for both groups. Participants will be asked to rank all applicants and indicate which applicants should be hired. A survey will be used to assess their reasons for choosing each top-ranked applicant.
Hypothesis: That EQI scores shift scientists’ perception of scientific contributions away from standard science metrics (e.g., number of publications, h-index) and toward each candidate’s contribution to valid and reliable evidence for scientific claims. This will be tested via comparison of the two groups, with mock applicants with higher EQI scores expected to rank better for the EQI group. The stated reasons for the EQI-based ranking is expected to better reflect scientific processes (e.g., which methods were used, what theoretical claim was addressed), rather than more surface properties for the non-EQI-based ranking (e.g., how popular a study is).
There is another possibility, however: scientists may already be making EQI-correlated inferences (e.g., based on the factors driving EQI scores). In such a case the primary hypothesis would be falsified due to non-EQI-based ranking reflecting EQI scores. This would suggest that standard scientific metrics have less of an impact on hiring decisions than is typically assumed. Simply informing scientists of the current importance of EQI-correlated factors in hiring and promotion decisions may lessen the distorting impact of standard science metrics (e.g., job candidates may be less likely to oversell their studies for citations), even without a shift to use of EQI scores in hiring decisions.
Next steps toward scalable EQIs
We started with the problem of metrics distorting the science being done by trainees about to enter the job market. This contrasted with my vision of the scientific freedom I had hoped tenure would enable. This vision of science without the distorting effects of suboptimal metrics helped motivate the development of the EQI. These improved metrics hold the promise of a kind of freedom for those with a passion for contributing to humanity’s scientific understanding of the universe and little interest in surface features like citation counts and publication in prestige journals. EQIs have the potential to enable a future system in which a scientist doing the right thing for science is also the career-optimal choice.
I anticipate the next steps toward truly useful and scalable EQIs will involve public discussion, careful EQI development, and validation experiments like the one outlined above. While my lab is motivated to pursue EQI/EQM development, it is my hope that the strong need for the kind of broad reform that EQIs can enable will drive formation and funding of an independent institute to oversee EQI development, dissemination, and annual updating.


This really resonated: "Instead, I found that simply wanting to help my trainees succeed brought the pressures of these science metrics back." It's funny that I didn't see that coming in my own career.
I like the idea of engineering in ways to update/improve goal posts regularly. Maybe no matter what the solutions are for how you quantify, keeping things dynamic is important because almost everything can be gamed. Red Queen hypothesis FTW. Will be interesting to think about how this could be implemented at a large scale though!