|
Quick links
DIG news blog (Here Be Data)
Data wiki (coming soon)
Open Data (SPARC)
|
Data Initiatives Group - MIT Engineering and Science Libraries (ESL)
Explorations by library, information, and data professionals on the challenges of managing, sharing, and communicating scientific data
Glossary (in progress)
Cyberinfrastructure
Within the United States, the term cyberinfrastructure has been used to describe the computing and network infrastructure that enables research environments such as the “collaboratory, co-laboratory, grid community/network, virtual science community, and e-science community. (O'Brien, E-Research, p 66.)
Data
The National Science Board Report, Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century September 2005, defines and describes scientific and engineering data in the context of our efforts.
...the nature of data in a collection may be diverse, including numbers, images, video or audio streams, software and software versioning information, algorithms, equations, animations, or models/simulations.
...Data can also be distinguished by their origins – whether they are observational, computational, or experimental. This distinction is crucial to choices made for archiving and preservation. Observational data, such as direct observations of ocean temperature on a specific date, the attitude of voters before an election, or photographs of a supernova are historical records that cannot be recollected. Thus, these observational data are usually archived indefinitely. A different set of considerations applies to computational data, such as the results from executing a computer model or simulation. If comprehensive information about the model (including a full description of the hardware, software, and input data) is available, preservation in a long-term repository may not be necessary because the data can be reproduced. Thus, although the outputs of a model may not need to be preserved, archiving of the model itself and of a robust metadata set may be essential.
Experimental data such as measurements of patterns of gene expression, chemical reaction rates, or engine performance present a more complex picture. In principle, data from experiments that can be accurately reproduced need not be stored indefinitely. In practice, however, it may not be possible to reproduce
precisely all of the experimental conditions, particularly where some conditions and experimental variables may not be known and when the costs of reproducing the experiment are prohibitive. In these instances, long-term preservation of the data is warranted. Thus, considerations of cost and reproducibility are key in
considering policies for preservation of experimental data.
Finally, processing and curatorial activities generate derivative data. Initially, data may be gathered in raw form, for instance as a digital signal generated by an instrument or sensor. These raw data are frequently subject to subsequent stages of refinement and analysis, depending on the research objectives. There may be a succession of versions. While the raw data may be the most complete form, derivative data may be more readily usable by others. Thus, preservation of data in multiple forms may be warranted in many circumstances.
E-Science
The term e-science has been used to describe large-scale, distributed, collaborative science enabled by the Internet and related technologies. E-research is a broader term that includes nonscientific research but that also refers to large-scale, distributed, national, or global collaboration in research. It typically “entails harnessing the capacity of information and communication technology (ICT) systems, particularly the power of highcapacity distributed computing, and the vast distributed storage capacity fuelled by the reducing cost of memory, to study complex problems across the research landscape.”
FRBR
FRBR (Functional Requirements for Bibliographic Records) is a system developed by IFLA in the 1990s. The goal of FRBR is to manifestly link together bibliographic records that are linked to one another on a conceptual level. An example might be linking an edition of Huckleberry Finn with all versions of Huckleberry Finn, including translations. Moreover, books about Huckleberry Finn might be included. All of this is implemented in the structure of the cataloging record. Such groupings improve information retrieval. An example of a utility that uses FRBR is Google Scholar; when more than one edition of a book or when more than one version of an article are available, all are presented in the appropriate categories.
Grid
Sharing hardware and software, as well as data can be accomplished on a grid. This ultimate sharing includes software and hardware that is interconnected and always available so that when a scientist needs extra computing power, machines within the grid can be put into service "to perform tasks that were traditionally restricted to supercomputers at a fraction of the cost." (Science in 2020, p. 14)
Informatics
...is the discipline of science which investigates the structure and properties (not specific content) of scientific information, as well as the regularities of scientific information activity, its theory, history, methodology and organization. OED - http://libraries.mit.edu/get/oed
...includes the science of information and the practice of information processing. Informatics studies the structure, behaviour, and interactions of natural and artificial systems that store, process and communicate information. It also develops its own conceptual and theoretical foundations. Since computers, individuals and organizations all process information, informatics has computational, cognitive and social aspects. Used as a compound, in conjunction with the name of a discipline, as in medical informatics, bio-informatics, etc., it denotes the specialization of informatics to the management and processing of data, information and knowledge in the named discipline. Wikipedia - http://en.wikipedia.org/wiki/Informatics
Life Cycle Model
One of the models for dealing with data can be called “The Life Cycle Model.” This data model follows the workflow of data from conception to final product. The strengths of the Life Cycle Model include: the potential for a data set to grow; the ability to track the evolution of data over time; and the inclusion of provenance in the data model (Hunter).
Lineage
The generation and subsequent manipulation of a datum or set of data is very important to the researcher using data from “data warehouses.” The lineage of the data being used is an important issue in database management because if the researcher data available in “warehouses,” must be traced to their sources to ensure validity. Computer scientists create algorithms that support data lineage tracing. Workflow and lineage are components of provenence.
Meaningful Data
Imagine a cyber-world where you could retrieve a list of wines from a particular region of France and get exactly what you needed. If you use Google to search for “Bordeaux wine region,” for example, you will get information on Bordeaux wines, individual Bordeaux vineyards, tours of the Bordeaux wine country, etc. With RDF a similar search would retrieve a simple list of wines from Bordeaux.
Metadata
Metadata is data about data. Two kinds of metadata are used when describing an object: administrative metadata and descriptive metadata. Metadata is useful because it allows users to access data according to content. However, creating useful metadata stuctures is difficult. In an effort to make metadata universal and easy to code, the metadata frequently get simplified to a useless level. Moreover, to create a system with integrity it is necessary to have a unified set of meaningful descriptors from, for example, a thesaurus. Making metadata systems interoperable is difficult as well because different organizations provide differing levels of metadata as well as use different vocabularies which hinder meaningful interaction among data elements. Creating metadata is also difficult because researchers must enter descriptive metadata as they produce documents and data. This takes time and skill. Integrating metadata creation into the workflow of data generation is difficult.
Ontology
An ontology is a “systematic description of a given phenomenon, often includes controlled vocabulary and relationships, (and) captures nuances in meaning and enables knowledge sharing and reuse.” (NSF) As librarians we are familiar with ontologies. The Library of Congress Subject Headings (LCSH) are an ontology. So is the UMLS. The key to ontologies used by the Semantic Web is that they are distributed across the network, across organizations and even across individuals. Anyone may create an ontology. Of course, ontologies that aren't used are not very effective. And each ontology creator is responsible for his/her own authority control, a sticky issue not frequently mentioned in the Semantic Web literature. Unlike the LCSH, Semantic Web ontologies work together, invoking one another when necessary. A more general ontology may invoke a more specific ontology.
Provenance
Provenance is a term traditionally used in museums and archives to mean all the material accompanying an object or document. In the case of data, it could be defined as “context”. For example, the boiling point of water may be x at sea level but y at an altitude of 7,600 feet. Another way of describing provenance is by asking, “Who was responsible for what, when and where?” (Lagoze) The so-called truth of data is dependent of perspective – provenance. (Taylor) The evolution of data over time is critical to provenance as well.
RDF
RDF (Resource Description Framework) is a system that allows data elements to interact with one another on a human as well as machine level. HTML is useful for human understanding, but useless for machine understanding. Let's say I wanted to download a theater schedule from my favorite theater's web site onto my calendar at work. I could search for all the upcoming shows and copy and paste them into my calendar. But suppose I wanted my calendar to be able to “scrape” the data from the theater's web site into my calendar, how would I do that? This is possible using RDF. Here are some key elements of RDF:
- RDF allows computers, not just people, to exchange meaningful data
- Each element of information is stored in an RDF “triple-store”
- Each element of information uses a controlled vocabulary term from an appropriate ontology
- Each RDF element is assigned a unique identifier (URI) that gives it a web address, allowing it to be accessed and downloaded from any web-ready device
- RDF requires the use of an implementation language like XML, which is much more difficult to use than HTML
- RDF requires that people assign metadata to their information, which is time-consuming
- RDF requires ontologies, which are difficult to construct because they take a lot of work and consensus-building
- RDF requires that people abide by certain formatting rules, which may be difficult to implement since in HTML nearly “everything goes”
RDF Triple-store
The key to RDF is that meaning is conveyed via “triple-stores” -- subject::relationship::object. An example of a triple store (referring to the original theater example) might be: The Tempest::date::June 15, 2006. If the theater web site uses RDF to code its data, and my calendar is also encoded with RDF, the actual content can be scraped from one site to the other. This is just an example, because without an ontology this information is meaningless. RDF triple-stores point to shared data elements that are meaningful because they are elements of ontologies.
Science Commons
The Science Commons Data project explores ways to assure broad access to scientific data.
Science Commons is a branch of the nonprofit Creative Commons, devoted to easing unnecessary barriers to the flow of scientific knowledge and technical information. Science Commons works to encourage scientific innovation by making it easier for scientists, universities, and enterprises to share scientific literature, data, and materials. Our goal is to encourage stakeholders to create – through standardized licenses and other means – areas of free access and inquiry; a ‘science commons’ built out of private agreements, not imposed from above.
Creative Commons, with its philosophy of “some rights reserved,” depends on intellectual property. The idea is that authors can specify easily, cheaply and in a standardized way, just how their work may be used and shared. By choosing what they wish to share, and under what terms, those authors have created a privately constructed “cultural commons” composed of millions of licensed works. Science Commons plans to use many of the same tools – taking account of the considerable legal and institutional differences. CC licensing has been adopted by the Public Library of Science and BioMed Central, as well as by MIT’s Open CourseWare. At MIT, Science Commons shares space, personnel, and inspiration with MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). (Source: Science Commons FAQ) http://sciencecommons.org/
Semantic Web
The Semantic Web is a vision: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. In order to make this vision a reality for the Web, supporting standards, technologies and policies must be designed to enable machines to make more sense of the Web, with the result of making the Web more useful for humans. The Web can reach its full potential only if it becomes a place where data can be shared and processed by automated tools as well as by people. (W3C Semantic Web Activity Statement 2001; Chitty) http://www.w3.org/2001/sw/Activity
Semantic web for life sciences http://www.w3.org/2001/sw/hcls/charter#Scope
http://esw.w3.org/topic/SemanticWebForLifeSciences
Semantic web applied to chemical data (Taylor et al)
http://pubs.acs.org/cgi-bin/article.cgi/jcisd8/2006/46/i03/pdf/ci050378m.pdf
URI
A URI allows information to be communicated universally across the Internet. An object is a series of ontological pieces (in sum, a triple-store) with an address. Each piece of the object also has a URI. For example, a triple store like this: Amy's Honda = Honda::belongs to::Amy could be expressed as (these are fictional URIs): http://names.ontology.org/Amy::http://relationships.ontology.org/BelongsTo::http://cars.ontology.org/Honda
Work flow
A workflow involves the steps taken by a researcher and/or a group of researchers and their machines, in the process of generating data. The workflow must be recorded as part of the metadata so that the process can be repeated, or understood, by future users of the data.
XML
XML (eXtensible Markup Language) is a markup language that is designed to describe a variety of kinds of data and make them interoperable. XML, like HTML, consists of tags that organize information. However, in the case of XML, the tags have meaning. A simple example of XML is:
<name>
<first name>Amy</first name>
<last name>Stout</last name>
</name>
What makes XML interoperable with other XML documents is that the meanings of the XML tags are defined in the Document Type Declaration (DTD). XML documents subscribing to the same DTD can easily communicate with one another. Like a style sheet, a DTD can be contained within or outside of a document.
XML can be used to encode RDF. An example might look like this:
<Property ID="suds"
s:label="Soapsuds Index"
p:minimum="0.0"
p:maximum="1.0"/>
<Property ID="density"
s:label="suds density">
<s:range rdf:resource="#DensityValue"/>
</Property>
Note how much more complex XML is than HTML.
|