Saturday, May 11, 2013

From Estimating Probabilities of Ratings to Figures of Merit and Rankings

Two previous posts advocated a quantitative approach to peer review based on probability theory, Bayes rule and information theory. This post focuses on a figure of merit for ranking of submissions and a sequential process that seeks to maximize information where it is most needed deciding among submissions.
This is the third in a series of posts on quantitative approaches in reviewing scientific proposals and publications. The first two are:
A funding agency might have to select 150 proposals out of 1000 submitted  When I was involved in that kind of decision making, we seemed to find ourselves dividing the proposals into three groups on the basis of peer review:
  • Those so highly rated that they clearly were to be funded;
  • Those rated so low that they clearly were not to be funded;
  • A third group near the "cut off line" which might either be funded or not funded.
It doesn't much matter whether a proposal is ranked first or fifth in a set of 1000 if one is to fund 150; in either case it would be funded and the ranking is purely an internal aid to decision making. Nor does it matter much whether a proposal is ranked 700th or 800th as in either case it would not be funded. But it would matter a great deal whether it were ranked 150th or 151st. That difference might determine which of the two was funded and which was not, with a significant influence on the careers of the scientists involved.

Note however that the rating of proposals are subjective judgments. The rankings depend on estimates of what is likely to happen in the future if the research is funded. The judgment is made on the basis of a research proposal, and all research proposals are approximations. Moreover, reviewers are always less interested in the proposals that they are reviewing than in their own work, and are usually busy with other responsibilities. The uncertainty about the outcomes of the 150th and 151st proposals is almost certainly greater than the actual differences between those potential outcomes.

Still it is useful to have a defined procedure with quantitative indicators to formalize decision making. Such a procedure can be satisfying to both those managing the review process and those submitting proposals.
Consider the use of a figure of merit for proposals. I suppose that the standard approach would be to use the average of reviewer ratings for a proposal.

In the previously described procedure of sequential independent peer reviews, one might use the sum of the rating times its probability over all ratings for a given proposal. In the example, with ten possible ratings, they might be assigned values one through ten. The highest possible figure of merit then would be 10, were a proposal to have a 100% probability of the highest rating; the lowest possible figure of merit would be 1, were a proposal to have a 100% probability of the lowest rating. As described in the previous posts, such an indicator would incorporate more information on the reviewers and the correlations among ratings than would a simple average of ratings.

At any point in the review process, proposals could be rank ordered by their figure of merit.
  • Before the first reviews, all proposals would have the same figure of merit, since all would be characterized by the same a priori probability distribution of proposals over ratings.
  • After one review was received for each proposal, all proposals would have one of ten values of the figure of merit, since the first review could have only one of ten ratings. However, at that point the probabilities of the values of the ratings for the second review could be estimated. These would form something like a distribution around the actual rating received. The proposals could then (if desired) be shown on a graph. The X axis would be the rank order of the proposal; the Y axis would be the figure of merit and the possible figures of merit after the second round of reviews.
  • After the second round of reviews there would be more values of the figure of merit for proposals and narrower bands of potential values of the figure of merit. These too might be graphed.
  • Eventually the graph of the figure of merit versus rank order of the proposal would appear almost continuous.
At an early point in the process it would become apparent that no further reviews would be needed for some proposals. Those with very low estimated figures of merit could be eliminated from further reviewing since they clearly would not be funded; so too eventually some proposals would have figures of merit so high, and variance so low about their figures of merit that they would surely be funded. Evaluation effort could then focus more on the proposals still in doubt.

For example, while a couple of reviews might be required for all the proposals, perhaps a third review might be needed for only half of them, and further reviews for smaller and smaller portions of the field. Such a procedure would greatly reduce the demand on reviewers.

As a result the range of likely values of the figure of merit would be broader far from the competitive range (where relatively few reviews were used) and narrower in the competitive range (where there would be more reviews).

Thus, the effect would be to get more and more precision in the Figure of Merit for the proposals in the competitive range.

Note too, that in a final selection, one could look at the probability distributions of the ratings of the borderline proposals to be sure that there was a suitable high probability that the proposals being funded merited a higher rating than the proposals being rejected.

No comments: