Friday, October 10, 2014

The New Tool of Big Data Raises New Interpretation Problems.


I came across an interesting article on the need for replication of scientific results. It quotes the great Irish scientist of the 17th century, Robert Boyle:
“Though the testimony of a single witness shall not suffice to prove the accused party guilty of murder; yet the testimony of two witnesses . . . shall ordinarily suffice to prove a man guilty,” Boyle wrote. In the same way, he said, a scientific experiment could not establish a “matter of fact” without multiple witnesses to its result.
We should now be more cautious about believing eye witnesses in criminal proceedings, but Boyle's point was very good. Scientists should invoke peer review and replication of results to assure the credibility of reported experimental results.

The article goes on to state:
Study after study finds that many scientific results are not reproducible, Heller and colleagues note. Experiments on the behavior of mice produce opposite results in different laboratories. Drugs that show promise in small clinical trials fail in larger trials. Genes linked to the risk of a disease in one study turn out not to have anything to do with that disease at all. 
Part of the problem, of course, is that it’s rarely possible to duplicate a complicated experiment perfectly. No two laboratories are identical, for instance. Neither are any two graduate students. (And a mouse’s behavior can differ depending on who took it out of its cage.) Medical trials test different patients each time.
As I have suggested in previous posts, scientific journals are more interested in publishing surprising results than in publishing new observations that further support already well supported hypotheses. But it is the surprising results that are more likely to be the results of experimental error, statistical flukes, or failures in interpretation of the experiment and its result. So I do not find it surprising that many published results fail the replication test. It is unfortunate when scientists do not volunteer to do the relatively thankless work involved in replication studies.

Incidentally, the long history of cold fusion research shows that replication is needed, and that for an important enough assertion, scientists do step forward to follow up on controversial assertions and research.

The article focuses on a relatively new kind of research. Today it is possible to study the genomes of a large number of people suffering from the same genetic disease to identify genes that might be implicated as causing the disease. These are called "genome-wide association studies" (GWAS).
A GWAS analysis of Crohn’s disease, for instance, tested over 600,000 genetic variations in more than 3,000 people. A follow-up analysis re-examined 126 of those genetic variations.
One would expect that looking at so large a number of genetic variations in so small a sample, a lot of the correlations with the Crohn's disease would turn out to be spurious. Even the follow-up of 126 of the candidate causal genes (which may work in combination) might throw up false positives (and false negatives).

The basic point is that big data analysis is an important emerging tool for scientific research, but it brings with it new problems in validating results.

I recently posted in this blog reflecting some concerns about a criminal investigation in which a task force had thrown up thousands of people suspected of committing a series of crimes. The law occers got a "hot lead" on one of these suspects, and decided that he must be the criminal. They then changed the nature of the investigation, focusing only on evidence to convince a jury of the guilt of their chosen suspect. (They probably were right in the case.) I am doubly skeptical of the police approach to dealing with the results of "big data" analysis to identify large numbers of persons of interest from which to narrow down to still significant numbers of suspects. Failure of police to understand the statistics involved, and failure of lawyers and judges to properly inform juries on how to interpret the results of such investigations, may well lead to convictions of the innocent.

No comments: