Tuesday, May 07, 2013

Probability Analysis to Improve Scientific Peer Review


Use of a proposed Bayesian approach to the information from a sequential peer review process may improve the quality of decisions and reduce the effort required of peer reviewers. The process can be carried out with a limited amount of data relatively easily obtained from an existing panel review process.
I have a suggestion for an alternative way to do peer review. It is applicable to organizations such as the National Science Foundation and the National Institutes of Health that regularly review large numbers of proposals with a significant number of reviewers per proposal.

As a preliminary for this method, the organization would ask that reviewers in addition to their normal review tasks provide a rating for each research proposal. The rating would  be the subjective probability that the proposal would merit funding when the research was completed. Since these would be subjective probabilities, the reviewer might simply specify whether that probability is between 0.0 and 0.1, 0.1 and 0.2, 0.2 and 0.3........ 0.9 and 1.0. The results of this exercise would be a table such as the following:

Table 1

Each column would correspond to the range of subjective probabilities that a proposal would be judged to be probably worth funding. A specific proposal would be assigned to the column if the average of reviewer probabilities fell in the specific range (taking each at the mid point of the assigned range of values).  Each row corresponds to the proposals given the rating shown in the left hand column. This of the 68 ratings of 0.0 to 0.1 given to proposals by reviewers, 51 would have corresponded to a proposal with average value between 0.0 to 0.1, 11 to a proposal with average value between 0.1 and 0.2, 5 to a proposal with average value between 0.2 and 0.3, and one to a proposal with average value between 0.3 and 0.4. None of the proposals with a single review in the lowest range would fall into a column further to the right.

From this table a second table would be derived such as the following:

Table 2

The bottom row, labeled P(A), each entry is the probability of a proposal falling in the specified  range of subjective probabilities, as estimated from the sample of proposals reviewed. The probability is estimated by dividing the total at the bottom of the corresponding column in Table 1 by the total at the bottom of the right hand column of Table I.

Given that the survival of research laboratories is based on writing proposals for worthwhile research projects, it would seem likely that there would be higher probabilities of good proposals. I have shown such values in the sample tables (which are manufactured and not based on real data).

Similarly, the right hand column of Table 2, labeled P(B), is the probability of a reviewer assigning the specified probability to a randomly selected proposal. It is calculated from the top table by dividing the corresponding entry in the right hand column of Table 1 by the overall total at the bottom of the right hand column of Table 1.

Finally, the individual boxes in the table, which might be labeled P(B/A) are the estimated probabilities that the average probability of the reviews will correspond to that in the column given that the rating by the reviewer is that in the row given.

Now consider Bayes Theorem:


Consider then a process by which a new proposal is received. It is assigned to a single reviewer, who assigns it to a probability range. The a priori probabilities of a random proposal falling in the specified categories are given by the bottom row of Table 2. Thus there are 10 a priori probabilities for the 10 different values of A1, A2,......, A10.

The a posteriori probability of each category is given by the equation above, using the value of P(B) and P(B/A) from Table 2.

Since the reviewers can be assumed to be independent, a second reviewers rating can be used to produce a new a posteriori set of values from the first by the same procedure.

So What? A Screening Procedure

One could have a single reviewer evaluate a single proposal, giving his/her rating in terms of the subjective probability that the project would be worthwhile. With a transition probability table such as that shown in Table 2, one could then use Bayes Theorem to calculate the a posteriori probability of each rating.

For example, if the first proposal review were a rating of 0.9 to 1.0, using the data in Table 2 one would calculate that the a posteriori probabilities were:
  • range 0.9 to 1.0: 83%
  • range 0.8 to 0.9: 13%
  • range 0.7 to 0.8: 3%
  • range 0.6 to 0.7: 0.5%
  • zero for all the other ranges.
That single review might be sufficient to accept the proposal. Say that an added review of the proposal were added. The probabilities of the new rating could be calculated from the a posteriori probabilities given above and the data in Table 2. Thus, in the example, the probabilities of the second rating given the first was 0.9 to 1.0 would be:
  • probability of rating 0.9 to 1.0: 63%
  • probability of rating 0.8 to 0.9: 23%
  • probability of rating 0.7 to 0.8: 11%
  • probability of rating 0.6 to 0.7: 2.5%
  • probability of other ratings would be zero or negligibly small.
Thus a relatively small number of reviews would probably be sufficient to determine whether the proposal should be accepted or not.

Once data had been gathered to create the table of conditional probabilities such as that shown in Table 2 it would be a relatively straight forward job to create software to update the marginal probabilities P(A) and P(B) for every proposal. All that would be needed for the user of the software would be to input each new rating to see the updated probabilities.

Refinements

At one point I looked at the correlation between ratings by reviewers and discovered that some individuals were more correlated with the average review than others. Indeed, I found one reviewer whose ratings were negatively correlated with the consensus of other reviewers. This, collecting data of the kind suggested might help to judge whether individuals should be invited to continue providing reviews.

It is also likely that there will be consistent biases in the reviews of individual reviewers. Some reviewers, for example, have been relatively unwilling to give either very high or very low ratings to proposals, preferring to give middle range reviews. Others are more willing to call a proposal either very good or very poor. Given such knowledge, the ratings given by reviewers might be adjusted to be more comparable to a standard distribution.

No comments: