Research Context

 

This proposed research lies at the intersection of three problem domains.  Much progress has been made to date in each area.  This proposal defines a research collaboration at the intersection of these three domains, through which we believe we can demonstrate stronger results than we could through independent research conducted in isolation.  The problem domains are:

 

1.      Libraries and Institutional Information Management (ala DSpace, MIT Libraries)

2.      Personal and Collaborative Information Management (ala Haystack, David Karger)

3.      The Semantic Web (ala Worldwide Web Consortium, Eric Miller)

 

 

The Libraries Problem Domain:

 

One of the core competencies of libraries has traditionally been support for the research activities of information consumers such as university faculty and students. Research activities typically include searching and browsing metadata to find relevant information in many formats including books, journals, maps, visual images, archives, audio/visual material, datasets, computer files, and so on. In supporting information retrieval activities libraries have evolved from isolated and eccentric modes of describing their collections and the providing access to them towards international standards which support the broad portability of research methods across libraries and collections. Examples of  these standards are as ISBD (International Standard Book Description, ISAD-G (International Standard Archival Description), MARC (Machine Readable Cataloging), Z39.50 (a standardized search and retrieval protocol), and more recently the Dublin Core metadata initiative. This standardization has allowed researchers world wide to achieve good efficiency and effectiveness regardless of physical location or type of material sought.

 

However as information consumers become increasingly familiar with and dependent on computer-based techniques for locating and accessing information, especially since the advent of the Web, libraries find themselves under growing pressure to support more flexible and powerful information seeking activities. This demand manifests itself in two ways: the great success of web search engines like Google for doing quick, high recall searches to identify large amounts of potentially useful information, and the demand for richer and more domain-specific description and search support than was possible with traditional standards such as ISBD and MARC. No longer satisfied with compromising accuracy to achieve portability of research methods, information consumers now want to have it both ways.

 

At the same time, managers of information collections that were never well served by traditional library standards are becoming empowered to find better solutions for describing and searching their information through constant technological improvements. This is a very useful development for users of this type of previously “unserved” information, but is leading to the creation of metadata gulags that are isolated from mainstream library-based search systems thus hiding them from many potential information seekers.

 

The need to be able to support a wide variety of metadata schemas, to integrate them, and to expose them all to simple, flexible search and retrieval mechanisms has become a major challenge for libraries in the Web era. And libraries are an analog of the corporate information enterprise in that the challenges and opportunities they and their customers face mirror those faced by companies and governments around the world. In this way libraries are the lead users of institutional information solutions that are being developed locally in advance of ready-made market solutions. Through partnering with libraries in addressing the challenges and opportunities described, we can gain deep insights and build the competencies required to create the enterprise and consumer information environments of the future.

 

The current libraries paradigm is stressed in several dimensions:

1.      Historically, libraries have optimized access and discovery of “locally accessible” assets.  The pragmatic meaning of “locally accessible” has shifted in the era of networked resources, driving demand for interoperability systems, assets, metadata, and services.

2.      It is expensive for libraries to provide community-specific ways to describe or annotate content.

3.      Even when libraries do so, community-oriented information is held in distinct information silos or “catalogues” and does not integrate well with more traditional or generic library catalogues.

4.      It is difficult for communities to describe/annotate their own content on their own terms in a way that other users could access through the libraries

5.      Decisions about who-does-what are hard to change.  Information initially created and managed by the community is difficult to hand off to the library, and vice-versa.

 

 

Haystack, Personal and Collaborative Information Management Problem Domain:

 

While libraries have historically focused on a corpus of assets and how they are described. research in personal and collaborative information management has focused on how individual users, and groups working together, interact with their own personal or community-based stores of information.  Traditional information retrieval systems focus upon uniform access by many users to a centralized corpus.  But projects like Haystack at MIT – led by David Karger – focus upon the relationship between a particular individual and her corpus:

 

“An individual's own haystack priviliges information with which that user interacts, gathers data about those interactions, and uses this metadata to further personalize the retrieval process.

 

“When a person looks for information, he will often start with his own bookshelf. This ‘personal repository’ contains a collection of information, built up over time, that reflects the needs and knowledge of its owner. This makes it different, in crucial ways, from the library. For example, all the content was actively placed there by the user, who is familiar with it and believes it to be useful. In the user's area of expertise, it is often more up to date than the library: its owner, who is actively seeking information in his area of interest, often finds new information before the library gets around to it. Overall, a person's bookshelf contains the bulk of the information that he considers most valuable.

 

“An individual's bookshelf is also organized in an idiosyncratic fashion. While library materials are arranged according to a standardized classification scheme, individuals have been known to arrange their books by topic, chronology, usage pattern, or even size and color. Even users who make no active attempt to organize their books find them structured in some kind of most-recently-used hierarchy. Individuals exploit their idiosyncratic organization when searching for information: they may look for a blue book, or a book on the bottom shelf, or a book next to another book. At a library, users are limited to searching the standard classification.

 

“The Haystack project aims to make a digital IR system that is less like a library and more like a personal bookshelf. Fundamentally, this means building a system that adapts to its user, instead of forcing its user to adapt to the limitations of the system. A haystack provides automated data gathering (through active observation of user activity), customized information collection, and adaptation to individual query needs.”[i]

 

Three themes pervade this research:

  1. Many individual and community schemes for organizing and describing assets are idiosyncratic.  They reflect the particular needs of those individuals who opt in to their use, not the needs of the masses.  Support of such specialization has value.
  2. Valuable metadata can be obtained by observing interactions of users with the system, and by mining assets and metadata using a variety of algorithms.
  3. A flexible platform for modeling and managing metadata about assets, their use, their relationships to individual users, and their relationships to each other is essential to unleashing the value of assets and metadata.
  4. Humans require effective ways (user interfaces, query mechanisms) to interact with such a flexible information substrate.

 

 

The Semantic Web Problem Domain:

 

The web today is primarily designed for human consumption, and is “metadata free”:

“Information varies along many axes. One of these is the difference between information produced primarily for human consumption and that produced mainly for machines. At one end of the scale we have everything from the five-second TV commercial to poetry. At the other end we have databases, programs and sensor output. To date, the Web has developed most rapidly as a medium of documents for people rather than for data and information that can be processed automatically. The Semantic Web aims to make up for this.”[ii]

 

Interoperability on the web today requires too much human mediation:

“The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. It is the idea of having data on the Web defined and linked in a way that it can be used for more effective discovery, automation, integration, and reuse across various applications. The Web can reach its full potential if it becomes a place where data can be shared and processed by automated tools as well as by people.”[iii]

 

Many opportunities are emergent, not considered in advance:

“The Semantic Web can and will be built in parts, by people with varied interests. The parts may be created independently without prior commitment and yet work together to create a great whole. Imagine services that provide RSS feeds of trusted annotations on particular areas of interest (recipes, stock quotes, medicine, etc.). When metadata applications are built on a common metadata framework, it is a safe bet to expect the unexpected.”[iv]

 

Much information is distributed.  It will remain so, but must become interoperable:

“Information about any particular thing can be created by multiple users, served by various services and dispersed across multiple sites. Consider for example, information about the musician David Bowie. His concert schedule, tickets for these concerts, songs he's written, songs he's produced, album pricing, album reviews, his biography, books about him, movies he's in, etc., are across more than a dozen different sites. With the existing technology today, we are still in the "hunter gatherer" phase of using the Web. The Semantic Web is designed to provide the necessary infrastructure for enabling services and applications on the Web aggregate, and to integrate this information into a sum greater than the individual parts.”[v]

 


How are libraries, haystacks, and the semantic web related?

 

These three problem domains are bound together through their concern regarding various aspects of the “Information Investment Lifecycle” – depicted here graphically:

 

 

Individuals, communities, and institutions each use and derive value from assets, schemas, and metadata.  Increasingly such use is mediated by services (for example: discovery, recommendation, or fulfillment services, or even haystack as a user interface to interpret metadata and issue queries on a user’s behalf).

 

Individuals, communities, and institutions each hold their own unique mission, motivators, limitations, and constraints.  These factors influence, along with the value derived from existing resources, the decisions that each actor (individual, community, or institution) takes regarding how to invest available time and/or dollars in the creation or procurement of new assets, schemas, or metadata that will further enhance value in a virtuous cycle.

 


Today, as described in the briefs of each of the three intersecting problem domains, the interoperability barriers that exist between individual, community, and institutional information environments are real.  The following diagram depicts the landscape:

 

 

Assets, schemas, metadata, and services can each be homed in either an individual’s information environment, a community information environment, or an institutional information environment.  Further, individuals, communities, and institutions each view available resources through their own lens of perceived value.  They each independently take investment decisions about how they will allocate available resources across:

-         additional asset procurement and/or creation

-         creation of new organizing and/or descriptive paradigms or schemas

-         metadata procurement and/or creation, and

-         development of and/or contracting for additional services that provide and/or consume assets/schemas/metadata

 

Historically the locus of resource distribution has been determined by the system housing the resources, and whether that system resides in an individual, community, or institutional context.  Several scenarios (the web, a personal office, haystack, library) are projected on the landscape above, and shown in the appendix.

 

In the future interoperability on this landscape will be important in two dimensions:

-         Vertically, as individuals, communities, and institutions (or agent-based services, on their behalf) wish to conduct business and share assets, schemas, and metadata that span multiple organizational paradigms, or that were initially created in one organizational paradigm, but now need to be used in another.

-         Horizontally, as service-providing value chains develop and mature, requiring assets, schemas, metadata, and services – each developed and provided by different service providers – to work together.

 

Challenges and Opportunities

 

Through this work we would like to demonstrate how both the vertical barriers and horizontal barriers to interoperability can be reduced, thereby (1) allowing haystack-like solutions to span the individual, community, and institutional paradigms and (2) enabling agent and end-user services to draw upon assets, schemas and metadata created and maintained in any of those domains.  To ensure that the work remains grounded, we propose to conduct the work using libraries as an exemplar domain.  This section describes the domain-specific challenges and opportunities in doing so.

 

To move forward libraries require a user interface that can provide a window into descriptions of information resources from a wide variety of sources.

 

This interface must effectively expose to humans the relevant parts of an information substrate that exhibits the following characteristics:

-         descriptions of information assets can come from a wide variety of sources

-         new information assets can easily be added

-         community-specific ways of talking about assets (schemas) can be added

-         communities and libraries an decide – and change their minds – about whether the designated “home” for their assets and/or their descriptions is in institutional/enterprise-controlled systems (e.g. library), or in
community-controlled systems

 

Challenges

 

We would like to be able to support arbitrary metadata schemas for library collections that are domain or discipline-based in addition to those that are library standards-based. Initially these could resemble traditional schemas such as Dublin Core metadata, but should eventually include much more complex schemas which allow for structures such as hierarchies and groupings (whole part, or parent child).

 

We would like to be able to support the flexible evolution of domain-based metadata schemas. Libraries could control the evolution of their standardized schemas to limit the impact of these changes on their practices and systems. But domain-based metadata changes rapidly and this should be supported since it allows metadata to reflect the real thinking of the domain rather than arbitrary abstractions of it.

 

We would like to be able to support arbitrary, ad hoc annotations to metadata schemas and instance metadata. These could be supplied by information consumers directly, by external domain experts, by collections managers, or by automated techniques such as collection data mining.

 

We need to be able to work with an increasing number of third party service providers, both of digital content and of metadata about that content. Traditionally libraries has control of the descriptive metadata that was produced about their collections, but it is becoming increasingly true that metadata is generated from beyond libraries: either by authors themselves or by service bureaus such as publishers or production labs. This is especially true for technical metadata about digital objects, and this metadata should be kept and used if at all possible.

 

We would like to be able to store digital content with persistence and some assurance that this content will still be accessible and usable in the future. This is more difficult for digital content than for traditional analog resources.  To preserve digital assets, (1) the bits themselves must be stored and retrievable, (2) the format and representation of  the bits must also be stored and retrievable, so that the bits themselves can have meaning when retrieved in the future, and (3) the stored information must be fulfillable using off-ramp devices or appliances available in the future.  Meeting these challenges will require the creation of new metadata schemas to describe the attributes and properties of the digital objects. 

 

Opportunities

 

Provide strong community support, allowing community-held domain-specific schemas and instance metadata while offering the permanence and structure of an institutional library.   Ever increasing amounts of digital content being sourced from diverse constituents in diverse communities means that libraries must manage multiple kinds of content.  They must also manage interactions and relationships with multiple communities of interest.  The kinds of content assets under management, the communities that produce them, and the ways of talking about them are not static.  They change with time as the communities themselves develop, ebb, and flow.

 

In the digital domain libraries have an opportunity to support multiple community-specific metadata overlays on the same information assets.  Libraries can offer choices about whether the responsibility for the assets and overlaying metadata lies with the community or with the library, and can provide options with low switching costs ranging from full community responsibility to full-service stewardship by the library.

 

Note that libraries themselves form a community that generate domain-specific schemas, the use of which help libraries succeed in their mission.  For example, much research remains to be done in the field of digital preservation; schemas supporting digital preservation are not yet mature, but libraries continue to develop them and we expect that much metadata will be sourced by libraries about existing assets as they represent and lead the digital preservation community.

 

Provide a unified, customizable search interface to all libraries assets that is easy for humans and librarians to use.  In this new digital domain, libraries would ideally offer “One Stop Shopping” to all information resources of interest to its consitituents, and act as a flexible “information clearinghouse”.

 

This interface could provide access to the richness of community-specific schemas and metadata (where they exist), while also providing good general interdisciplinary recall across all items.  The interface would provide access to items held “within” the library, as well as to external items to which the library has negotiated access, and would be configurable by individuals and communities to emphasize and optimize the presentation of assets and metadata that are prominent within their community.

 

Provide access to new external resources, as they are identified as being of interest to the libraries’ constitutents.  Note that identification of interest could happen by the library itself, by communities, by individual patrons, or some blend of all of these.

 

Position ongoing library operations as a systemic source of new information that can aid in organizing and identifying the utility of resources, in addition to simply providing access to those resources.  In the digital domain, libraries can explore increasingly adaptive, interactive, and collaborative techniques to accomplish its missions of selection, collection, organization, access, and preservation.  In addition to simply serving resources, libraries might become data sources themselves, offering recommender systems based on individual and/or collective patterns of use and interest.

 



[i] “Haystack: Per User Information Environments”, Eytan Adar, David Karger and Lynn-Andrea Stein.  Proceedings of the eighth international conference on Information knowledge management.

[ii] Tim Berners-Lee, James Hendler, and Ora Lassila.  “The Semantic Web”.  Scientific American. May 1, 2001.

[iii] W3C Semantic Web Activity Statement, <http://www.w3.org/2001/sw/Activity>

[iv] IBID

[v] IBID