15.963 Electronic Commerce Project #1

Harvesting Customer Usage Information from the Web

Bill Heslam Kirk Solo Matt Steinfort

March 11, 1997

Introduction
Many of the articles we have read and speakers we have heard have spoken of the web as a mechanism to improve a company’s knowledge of its customers. More specifically, speakers such as Jim Manzi, have labeled the web as the mechanism that will allow companies to use this new knowledge to move from a broadcasting marketing model to a narrowcasting model. With the web, they claim, we will be able to determine exactly what a customer is interested in, see where they navigate, and witness how they behave leading up to a purchase decision. Following their logic, we can then use this information for both traditional marketing analysis and for customization of our message to the individual user’s preferences. In order for these claims to be true, however, companies must first be able to collect information about their customers’ and potential customers’ preferences. The question that must be asked in assessing these statements is then whether it is currently possible to retrieve this type information from the web. The purpose of our research briefing, in light of this question, is then to examine the industry’s current capability to use web usage information to gather customer preferences. In order to provide background for the reader, we first explain the basics of the technology involved in tracking usage. We then discuss the various strategies that can be employed and identify the strengths and weaknesses of each approach. Following this we present a scan of the new developments in the area as well as brief overview of several companies that are attempting innovative solutions to the problem. Finally we draw on our newly found knowledge of the subject to make our own prediction of what will be successful in the future.

Technology Overview
During the web’s recent rise to mass awareness, we have heard many fears from paranoid users and many claims by visionary marketers regarding how much information can be retrieved about users of a particular website. At the bottom of these (mostly incorrect) assertions is the general knowledge that some user information is conveyed to the website that a user accesses. The "some," in this case, appears to be the source of all the confusion. With the exception of the true techies, nobody seems to know exactly what information is included in this some. In order to debunk the misconceptions, it is useful to have an idea of how user information is retrieved by websites and what information they have access to.
Start with the basic process. A user logs onto a network somewhere through school, work, an Internet service provider (ISP), or on-line service (e.g. America On-line.) The user then decides to browse the web, so starts up a browser, such as Netscape or Microsoft’s Internet Explorer. Once the browser has started, the user decides to go to a favorite location so he navigates somehow (through links or simply by typing in the URL) to that web page. For the sake of example, say that the user had selected the url http://espnet.sportszone.com, read the headlines, then clicked on one of the articles about the tournament seadings for the NCAA Men’s Basketball tournament. At this point, the user has provided the web site with several pieces of information about him. The goal of this paper is to make it clear exactly what data the user has provided in this case, and what strategies companies are employing to make use of this information.
Before one can assess the effectiveness of a particular strategy, however, it is important to have an understanding of the technological capabilities that are available. To facilitate this understanding, this section will present an introduction into the various technologies involved in gathering usage information from the web. There are three categories of technology that can currently be deployed to harvest usage information from the web: HTTPD logging; cookies; and Java, JavaScript, and CGI Routines.

HTTPD Logging
Most web-servers generate several log files that record the activities that occur on their pages. Every time a page is accessed or a document is retrieved, the web server makes a note of the activity in its logs. There are three logs which contain the bulk of user information (the other logs mainly record errors and exceptions.) These logs are the access log, the refered log, and the agent log.

Refered Log
The refered log tracks the web page that referred the user to the current site. The intent of this log is to determine from where users are entering the company’s site. This helps web administrators to analyze how users are getting to their sites. They can then determine which ads on other sites are effective, are other people putting links to their sites on their own sites, and how often are they being accessed from search engines.

The refered log contains:
· the page from which the user came
· the page upon which they landed

The following is an example of the data stored in the refered log: http://sunsite.unc.edu/boutell/faq/tinter.htm-> /transparent_images.html http://webcrawler.com/cgi-bin/WebQuery -> /images.html file:///Hard%20Drive/System%20Folder/Preferences/Netscape/Bookmarks.html file:///localhost/usr/users/la/lav6/public_html/.lynx_bookmarks.html -> /transparent_images.html file:///I|HTML/Referenc.htm -> /about_html.html

Access Log
The access log is a history of what specific pages have been accessed within the company’s web. This is a great tool for identifying the most popular sites on the web. Webmasters can use this information to improve the quality of their sites.

The access log contains the following information:
· the user’s domain name
· client machine name and user name (although most servers do not record this information)
· date and time of access
· the request made by the client browser
· the name and version of the protocol used to send and receive the data with the client
· a server response code · the number of bytes transferred to the client

The following is an example of the data stored in the access log:
128.103.120.11 - - [01/Dec/1995:16:01:38 -0500] "GET /transparent_images.html HTTP/1.0" 200 3515
128.103.120.11 - - [01/Dec/1995:16:02:26 -0500] "GET /files/giftrans.exe HTTP/1.0" 200 4872
128.103.120.11 - - [01/Dec/1995:16:08:33 -0500] "GET / HTTP/1.0" 200 1018
128.103.120.11 - - [01/Dec/1995:16:09:00 -0500] "GET /local_stuff.html HTTP/1.0" 200 1865

Agent Log
The agent log is used to track the browsers that were used to access the site. This is helpful for content designers as they can get an idea of what browsers they need to design for when they produce content.

The agent log contains the following information:
· type of browser
· the browser’s version
· the platform
· the operating system version
· proxy information if the user happens to be going through a proxy server

The following is a sample of the data stored in the agent log:
Mozilla/1.1 (Windows; I; 32bit) Mozilla/1.1N (Macintosh; I; PPC)
Mozilla/1.1 (Windows; I; 32bit) via proxy gateway CERN-HTTPD/3.0 libwww/2.17

Cookies
While the realization that current logging technology only provides limited user information will result in relieved paranoids and disappointed marketers, this is not the only technology available that gathers user information. Another tool for tracking users’ usage patterns is known as a cookie. A cookie, at its simplest level, is a just a message that is given to a web browser by a web server. The message can include any information that the web server wishes to include in addition to some type of unique identifier. It is stored in a file, cookies.txt, along with cookies from other companies. The message is then passed back to the web server each time the user accesses one of its pages. With this technology, the web server can uniquely identify the user and potentially customize the information that is presented to him or her. Now is when the marketing folks start to get excited and the paranoids start to get sweaty palms. By giving your browser a cookie, the web site is essentially writing to your hard drive and then using this information to track your behavior while you are at their site. It is basically harmless, as the web server can only write information you have provided it, either voluntarily or involuntarily through the logs, to the cookie file.

The following is a sample of the information contained in a cookies.txt file:
.focalink.com TRUE / FALSE 946641600 SB_ID ads02.22424845851486145682
.netscape.com TRUE / FALSE 946684799 NETSCAPE_ID 1000e010,10372d3f
.microsoft.com TRUE / FALSE 937422000 MC1 GUID=a587827b7bac11d08b1808002bb74f3f
.msn.com TRUE / FALSE 937396800 MC1 ID=a587827b7bac11d08b1808002bb74f3f
.nba.com FALSE / FALSE 1488383230 SWID 05A64F41-966F-11D0-BA42-00A0C9110F6B .isn.com TRUE / FALSE 946684799 session 23099757

Java, JavaScript, and CGI Routines
While each of these three tools are similar in many respects, it is important to fully understand how they function, how they are different, and how they are best applied. Java and JavaScript share the same name, and some functionality, but are actually quite different development tools. Both are development languages that use an object oriented programming structure. While the two share some functionality, there may be more differences than similarities. Java can be thought of as a tool that creates small applications, or "applets", that will run on any operating platform that is Java enabled. These applets are actually compiled executable programs that are transferred across the Internet and run on the client’s machine. JavaScript, on the other hand, is simply application code that is inserted into the HTML document. It has the advantage of not having to be compiled, but is limited in its capabilities with respect to Java. CGI or "Common Gateway Interface" routines allow for the dynamic creation of HTML documents. CGI routines can use information stored in a relational database to create web pages "on the fly". This allows the web server to create custom pages for a given user, based on information that has been gathered in the past, or even during the current session.
Using these tools, much of the information that can be gathered by a web server can actually be accessed real-time in an active document. Each of these development tools allows a web page to dynamically access information from the client that is currently visiting the web page.

This real-time data can be used in a variety of ways:
· It can be used to create and update the cookie files that can store a variety of user information.
· It can also call information from the log histories that were described above and create a web page that is specifically targeted toward an individual user based on prior log activity and/or stored preferences.
· It can gather usage and access information about a given user that can be used to augment the server log files.

Strengths and Weaknesses of Technological Approaches Technology Strength Weakness

HTTPD Logging
· Supported by most web servers
· Relatively easy to implement
· Plethora of 3rd party analysis tools
· Cannot be controlled by user
· Cannot identify unique customers
· Limited to macro analysis
· Historical

Java
· Platform independent
· Relatively powerful· Real-time
· Some security concerns
· Requires Java enabled browser

Java-script
· Easy to implement
· Real-time
· Limited capabilities
· Requires JavaScript enabled browser

CGI
· Allows for dynamic creation of web pages
· More complicated than simple HTML

Cookies
· Enables identification of unique users
· Can be refused by privacy conscious users

Issues with Current Strategies
This section provides an overview of the issues involved with the current strategies for retrieving usage information from the web.

Real-time vs. Historical
So far we have described an infrastructure that allows for the archival of information, and real-time acquisition of information. One obvious application of this technology is the analysis of the historical information on an aggregate level. This type of information can be used to determine general user preferences for products or web pages, and can even be used to establish some demographic information about the users. This is obviously a powerful new tool in the realm of market research, but is not necessarily a revolutionary concept. This technology starts to become very exciting when a service provider is able to couple the real-time and historical information. The first step is to capture real-time information about the user that has accessed your web site. The next step is to utilize the tools that we have described to establish the user’s history and preferences. Once the user has been identified, and their preferences captured, they can be given information that has been dynamically created just to meet their needs and preferences. All of the information that has been stored in the server logs or in their cookie file can be used to narrowcast information to that user. For this to be possible, it is necessary that the service provider is able to capture both historical and real-time information.

Direct Customer Participation vs. Indirect Customer Participation
One of the biggest distinctions that exist among the various strategies is whether it requires active contribution by the user. The technologies discussed above can at best compile rough characterizations of the customers and attempt to infer preferences based on their past behavior. Some industry participants believe that the only way to get more robust customer information, such as demographics, is to ask the customer. This introduces a new challenge, although it is not a technological one. How can companies convince users to provide them with this information? There are several strategies in practice today. The first strategy is to simply keep track of information provided at the time of purchase, such as address, name, etc. and match this information with the historical data gathered through other means. The second strategy, which is perhaps the boldest, is to simply demand that the user provide a certain level of information in order to access the site. This registration process would also provide a customer profile that could be matched with other sources. The third strategy, which has a great deal of promise, is to offer the user something that he or she values in return for the information. Companies such as Firefly, which is discussed later in this paper, are attempting to use this strategy to build the customer information bases.

Case Studies
This section presents a scan of new developments in the area, including companies that are developing software to address this problem and a few examples of companies that have chosen innovative approaches to gathering usage information.

Interse Corporation (www.interse.com)
Interse is the creator of a commercially available web log analyzer package called Market Focus 2.07. Hailed by PCWeek as the most complete web log file analysis tool on the market. It allows webmasters to analyze their server statistics and fine tune their performance and understand their users’ behavior. Market Focus supports the standard HTTPD log file options such as time of visit, host name, type of browser and operating system, but also lets web masters make use of cookies.

Firefly Inc. (www.firefly.com)
The impetus behind the creation of Firefly was twofold. One, it is an attempt to create an on-line community where users create a personal profile that is matched with other users to generate recommendations on potentially interesting purchases of books and records based on the preferences of users with similar tastes. Two, it represented a good opportunity for its founders to display their intelligent agent technology. It is a great example of a company that is coupling user input of preferences with usage information.

Double-click (www.doubleclick.net)
Doubleclick is a behind-the-scenes Internet advertising company that boasts of "the most highly targeted and cost-effective Internet advertising capabilities.. [that] targets, schedules, delivers, tracks and reports online ad campaigns." They make significant use of cookies to determine which ads a customer has seen and thus display only appropriate ads. They are behind many of the banners that appear on various websites.

Netscape (www.netscape.com)
Creator of both the browsers and web servers that provide most of the functionality discussed above. Within their on-line store, Netscape makes use of cookies to maintain persistent shopping carts. That is, if a user is in their on-line store and adds several items to the shopping cart and then leaves the site, upon return the cookie will allow the same shopping cart to be presented to the user.

Internet Shopping Network (www.isn.com)
As we learned in class, ISN is a wholesaler of PCs and PC components. When we checked out the sight for class, we noted that we had received a cookie. While it is not known what exactly they are using the cookie for, the speaker from ISN did mention that they were capable of determining the exact pages that a user accessed prior to making a purchase decision.

Conclusion
Until companies are able to provide demonstrable or at least perceived value for providing information, customers will not have any incentive to provide them with the information that they need to successfully implement the narrowcasting model. Until this value can be demonstrated, companies can use their log files to fine tune their web sites effectiveness and develop macro level understandings of how customers are using their sites. More aggressive companies can make use of cookies to uniquely identify their customers and track how they move through the website, but the problem of limited customer information remains. In the future, it will have to be companies, like Firefly, that marry the use of technological tracking with customer provided information to provide a truly customized, targeted marketing campaign. Until that time, the reality of narrowcasting at the individual level remains a goal that is not quite within our reach.