Making the Web Work for Science
The World Wide Web was created for science. Tim Berners-Lee's invention of the Web (1989) was motivated by the need to manage information about experiments in high-energy physics. Today, anyone with a Net connection and a browser can find obscure music with a few mouse clicks, or custom-design a new pair of shoes, or auction the paintings in their attic to an audience of millions. Yet in comparison, science has been affected very little. To be sure, scientists have embraced some aspects of the Web, like putting databases and journals into digital formats. But the overall effect has been to support business as usual. There's been nothing in science like the transformational impact of the Web on news, shopping, or entertainment.
Are you looking to buy a Canon Powershot A590 digital camera? Type "Powershot A590" into Google and click on "shopping," and you'll see pages of stores offering the camera for sale, conveniently ordered by price, with buyer recommendations and links that you simply click on to purchase.
In contrast, imagine that you're a malaria researcher investigating glycophoren A as a drug target. A Google search under "malaria glycophoren A drug target" brings up 1300 documents, with no overall summary to guide you; and that's only the free literature, not the closed-access journal publications. Suppose you wade through those papers and find an interesting result you'd like to replicate. Can you click on a link to order the cell lines? Hardly. Getting those samples requires contacting the researchers, negotiating a materials transfer agreement with their institution, a negotiation that could easily take months, if you can get the materials at all. A study in 2002 (Campbell et. al.) found that 47% of academic geneticists had been rejected in their efforts to get data or materials from other academics.
If we lack a game-changing Web of Science, it's not for lack of vision or technology.
As far back as the 80s, Bob Futrelle at Northeastern University was prototyping computer programs where you could ask about molecular biology research results, and back would come graphs of experiments – illustrations extracted from the pages of journal publications – together with machine-generated textual interpretations of the graphs. A single query could in principle let you view and compare the graphical results of dozens of experiments, mined from the literature. As Futrelle wrote (1992), "If future electronic documents are to be truly useful, we must devise ways to automatically turn them into knowledge bases."
Twenty years later, there are still no tools like this in common use as research aids. MIT computer engineers could implement them, but the MIT Libraries couldn't deploy them because the licenses the Libraries sign with academic journal publishers specifically prohibit data mining of papers and reuse of figures.
To liberate science from prohibitions on turning documents into knowledge bases, we need more mandates for open-access publishing like the one from the National Institute of Health (NIH), and more institutional open-access policies like the one the MIT faculty unanimously adopted at their March faculty meeting.
One of the best steps the new administration can take to promote the progress of science is to remove barriers to realizing the Web's potential as a tool for scientific research. Here, an important message for policy makers is "First, do no harm." The "Fair Copyright in Research Works Act" (HR 6845), now before the House Judiciary Committee, would repeal the NIH mandate and forbid government agencies from making any similar mandates. This bill is downright destructive to scientific progress, and it should be scuttled. Quite the opposite should occur. NSF and other agencies should be directed to follow NIH's lead in ensuring that the results of publicly funded research are publicly available
Open access to publications is only a first step. Imagine being able to treat the more than a thousand molecular-biology databases, together with the published literature in molecular biology, as a unified system where you can pose questions and get precise answers. A researcher interested in potential drug targets for Alzheimer’s disease might want a list of genes involved in signal transduction that are related to pyramidal neurons. Typing that into Google results in tens of thousands of hits, primarily titles of papers, with no real sense of what they have to do with the question. To get instead an actual list of genes requires software that can interpret the statements in the papers and combine information from different databases that use different vocabularies. One approach to that kind of massive-scale data integration relies on document markup (metadata) and the methods of the Semantic Web research effort that Berners-Lee now leads at MIT CSAIL (Computer Science and Artificial Intelligence Laboratory). CSAIL also hosts Science Commons, part of the independent public-interest organization Creative Commons, which is using this approach to create an open knowledge base of annotated biomedical abstracts, integrated with major neuroscience databases.
So a second priority for the administration should be to promote tools and standards that facilitate data mining and integration across research results. The NSF's Office of Cyberinfrastruture articulated such a program in 2007, but it was never fully implemented. It's time to do so.
Can we go further? How about the "fantasy" of researchers ordering biological materials by simply clicking on links in published papers? That's doable, provided institutions can pre-negotiate standard agreements so that requests for materials can be automated.
This is another area where Science Commons is working, collaborating with the Association of University Technology Managers (AUTM) to formulate those agreements, and with plasmid depositories like Addgene that can provide distribution and manufacturing. That infrastructure is already deployed through the Kauffman Foundation's Bridge Network for research and technology transfer. Even so, effecting real change will require confronting a university culture where scientists often respond to a first-to-publish reward system by withholding data and materials.
Amazon, Wikipedia, eBay, and YouTube show how the Web can transform culture and commerce. Science research is a lot harder than contributing to a collaborative encyclopedia or uploading videos. But if policy makers can focus on removing the obstacles to making the Web work for science, the payoff in innovation would be huge.
Berners-Lee, T. (1989). Information Management: A Proposal. CERN. <http://www.w3.org/History/1989/proposal>
Campbell, E.G., et. al., Data Withholding in Academic Genetics. JAMA, 287(4): :473-80.
Futrelle, R. P. (1992). The Conversion of Diagrams to Knowledge Bases. In IEEE Workshop on Visual Languages (pp. 240-242). Seattle, WA: IEEE Computer Society Press.
Cyberinfrastructure Vision for 21st Century Discovery, NSF Office of Cyberinfrastructure, March 2007, <http://www.nsf.gov/pubs/2007/nsf0728/nsf0728.pdf>