15.963 Electronic Commerce Project #1
Harvesting Customer Usage Information from the Web
Bill Heslam Kirk Solo Matt Steinfort
March 11, 1997
Introduction
Many of the articles we have read and speakers we have heard have spoken
of the web as a mechanism to improve a company’s knowledge of its customers.
More specifically, speakers such as Jim Manzi, have labeled the web as
the mechanism that will allow companies to use this new knowledge to move
from a broadcasting marketing model to a narrowcasting model. With the
web, they claim, we will be able to determine exactly what a customer is
interested in, see where they navigate, and witness how they behave leading
up to a purchase decision. Following their logic, we can then use this
information for both traditional marketing analysis and for customization
of our message to the individual user’s preferences. In order for these
claims to be true, however, companies must first be able to collect information
about their customers’ and potential customers’ preferences. The question
that must be asked in assessing these statements is then whether it is
currently possible to retrieve this type information from the web. The
purpose of our research briefing, in light of this question, is then to
examine the industry’s current capability to use web usage information
to gather customer preferences. In order to provide background for the
reader, we first explain the basics of the technology involved in tracking
usage. We then discuss the various strategies that can be employed and
identify the strengths and weaknesses of each approach. Following this
we present a scan of the new developments in the area as well as brief
overview of several companies that are attempting innovative solutions
to the problem. Finally we draw on our newly found knowledge of the subject
to make our own prediction of what will be successful in the future.
Technology Overview
During the web’s recent rise to mass awareness, we have heard many
fears from paranoid users and many claims by visionary marketers regarding
how much information can be retrieved about users of a particular website.
At the bottom of these (mostly incorrect) assertions is the general knowledge
that some user information is conveyed to the website that a user accesses.
The "some," in this case, appears to be the source of all the
confusion. With the exception of the true techies, nobody seems to know
exactly what information is included in this some. In order to debunk the
misconceptions, it is useful to have an idea of how user information is
retrieved by websites and what information they have access to.
Start with the basic process. A user logs onto a network somewhere through
school, work, an Internet service provider (ISP), or on-line service (e.g.
America On-line.) The user then decides to browse the web, so starts up
a browser, such as Netscape or Microsoft’s Internet Explorer. Once the
browser has started, the user decides to go to a favorite location so he
navigates somehow (through links or simply by typing in the URL) to that
web page. For the sake of example, say that the user had selected the url
http://espnet.sportszone.com, read the headlines, then clicked on one of
the articles about the tournament seadings for the NCAA Men’s Basketball
tournament. At this point, the user has provided the web site with several
pieces of information about him. The goal of this paper is to make it clear
exactly what data the user has provided in this case, and what strategies
companies are employing to make use of this information.
Before one can assess the effectiveness of a particular strategy, however,
it is important to have an understanding of the technological capabilities
that are available. To facilitate this understanding, this section will
present an introduction into the various technologies involved in gathering
usage information from the web. There are three categories of technology
that can currently be deployed to harvest usage information from the web:
HTTPD logging; cookies; and Java, JavaScript, and CGI Routines.
HTTPD Logging
Most web-servers generate several log files that record the activities
that occur on their pages. Every time a page is accessed or a document
is retrieved, the web server makes a note of the activity in its logs.
There are three logs which contain the bulk of user information (the other
logs mainly record errors and exceptions.) These logs are the access log,
the refered log, and the agent log.
Refered Log
The refered log tracks the web page that referred the user to the current
site. The intent of this log is to determine from where users are entering
the company’s site. This helps web administrators to analyze how users
are getting to their sites. They can then determine which ads on other
sites are effective, are other people putting links to their sites on their
own sites, and how often are they being accessed from search engines.
The refered log contains:
· the page from which the user came
· the page upon which they landed
The following is an example of the data stored in the refered log: http://sunsite.unc.edu/boutell/faq/tinter.htm->
/transparent_images.html http://webcrawler.com/cgi-bin/WebQuery -> /images.html
file:///Hard%20Drive/System%20Folder/Preferences/Netscape/Bookmarks.html
file:///localhost/usr/users/la/lav6/public_html/.lynx_bookmarks.html ->
/transparent_images.html file:///I|HTML/Referenc.htm -> /about_html.html
Access Log
The access log is a history of what specific pages have been accessed
within the company’s web. This is a great tool for identifying the most
popular sites on the web. Webmasters can use this information to improve
the quality of their sites.
The access log contains the following information:
· the user’s domain name
· client machine name and user name (although most servers do not
record this information)
· date and time of access
· the request made by the client browser
· the name and version of the protocol used to send and receive
the data with the client
· a server response code · the number of bytes transferred
to the client
The following is an example of the data stored in the access log:
128.103.120.11 - - [01/Dec/1995:16:01:38 -0500] "GET
/transparent_images.html HTTP/1.0" 200 3515
128.103.120.11 - - [01/Dec/1995:16:02:26 -0500] "GET /files/giftrans.exe
HTTP/1.0" 200 4872
128.103.120.11 - - [01/Dec/1995:16:08:33 -0500] "GET / HTTP/1.0"
200 1018
128.103.120.11 - - [01/Dec/1995:16:09:00 -0500] "GET /local_stuff.html
HTTP/1.0" 200 1865
Agent Log
The agent log is used to track the browsers that were used to access
the site. This is helpful for content designers as they can get an idea
of what browsers they need to design for when they produce content.
The agent log contains the following information:
· type of browser
· the browser’s version
· the platform
· the operating system version
· proxy information if the user happens to be going through a proxy
server
The following is a sample of the data stored in the agent log:
Mozilla/1.1 (Windows; I; 32bit) Mozilla/1.1N (Macintosh; I; PPC)
Mozilla/1.1 (Windows; I; 32bit) via proxy gateway CERN-HTTPD/3.0 libwww/2.17
Cookies
While the realization that current logging technology only provides
limited user information will result in relieved paranoids and disappointed
marketers, this is not the only technology available that gathers user
information. Another tool for tracking users’ usage patterns is known as
a cookie. A cookie, at its simplest level, is a just a message that is
given to a web browser by a web server. The message can include any information
that the web server wishes to include in addition to some type of unique
identifier. It is stored in a file, cookies.txt, along with cookies from
other companies. The message is then passed back to the web server each
time the user accesses one of its pages. With this technology, the web
server can uniquely identify the user and potentially customize the information
that is presented to him or her. Now is when the marketing folks start
to get excited and the paranoids start to get sweaty palms. By giving your
browser a cookie, the web site is essentially writing to your hard drive
and then using this information to track your behavior while you are at
their site. It is basically harmless, as the web server can only write
information you have provided it, either voluntarily or involuntarily through
the logs, to the cookie file.
The following is a sample of the information contained in a cookies.txt
file:
.focalink.com TRUE / FALSE 946641600 SB_ID ads02.22424845851486145682
.netscape.com TRUE / FALSE 946684799 NETSCAPE_ID 1000e010,10372d3f
.microsoft.com TRUE / FALSE 937422000 MC1 GUID=a587827b7bac11d08b1808002bb74f3f
.msn.com TRUE / FALSE 937396800 MC1 ID=a587827b7bac11d08b1808002bb74f3f
.nba.com FALSE / FALSE 1488383230 SWID 05A64F41-966F-11D0-BA42-00A0C9110F6B
.isn.com TRUE / FALSE 946684799 session 23099757
Java, JavaScript, and CGI Routines
While each of these three tools are similar in many respects, it is important
to fully understand how they function, how they are different, and how
they are best applied. Java and JavaScript share the same name, and some
functionality, but are actually quite different development tools. Both
are development languages that use an object oriented programming structure.
While the two share some functionality, there may be more differences than
similarities. Java can be thought of as a tool that creates small applications,
or "applets", that will run on any operating platform that is
Java enabled. These applets are actually compiled executable programs that
are transferred across the Internet and run on the client’s machine. JavaScript,
on the other hand, is simply application code that is inserted into the
HTML document. It has the advantage of not having to be compiled, but is
limited in its capabilities with respect to Java. CGI or "Common Gateway
Interface" routines allow for the dynamic creation of HTML documents.
CGI routines can use information stored in a relational database to create
web pages "on the fly". This allows the web server to create
custom pages for a given user, based on information that has been gathered
in the past, or even during the current session.
Using these tools, much of the information that can be gathered by a web
server can actually be accessed real-time in an active document. Each of
these development tools allows a web page to dynamically access information
from the client that is currently visiting the web page.
This real-time data can be used in a variety of ways:
· It can be used to create and update the cookie files that can
store a variety of user information.
· It can also call information from the log histories that were
described above and create a web page that is specifically targeted toward
an individual user based on prior log activity and/or stored preferences.
· It can gather usage and access information about a given user
that can be used to augment the server log files.
Strengths and Weaknesses of Technological Approaches Technology Strength Weakness
HTTPD Logging
· Supported by most web servers
· Relatively easy to implement
· Plethora of 3rd party analysis tools
· Cannot be controlled by user
· Cannot identify unique customers
· Limited to macro analysis
· Historical
Java
· Platform independent
· Relatively powerful· Real-time
· Some security concerns
· Requires Java enabled browser
Java-script
· Easy to implement
· Real-time
· Limited capabilities
· Requires JavaScript enabled browser
CGI
· Allows for dynamic creation of web pages
· More complicated than simple HTML
Cookies
· Enables identification of unique users
· Can be refused by privacy conscious users
Issues with Current Strategies
This section provides an overview of the issues involved with the current
strategies for retrieving usage information from the web.
Real-time vs. Historical
So far we have described an infrastructure that allows for the archival
of information, and real-time acquisition of information. One obvious application
of this technology is the analysis of the historical information on an
aggregate level. This type of information can be used to determine general
user preferences for products or web pages, and can even be used to establish
some demographic information about the users. This is obviously a powerful
new tool in the realm of market research, but is not necessarily a revolutionary
concept. This technology starts to become very exciting when a service
provider is able to couple the real-time and historical information. The
first step is to capture real-time information about the user that has
accessed your web site. The next step is to utilize the tools that we have
described to establish the user’s history and preferences. Once the user
has been identified, and their preferences captured, they can be given
information that has been dynamically created just to meet their needs
and preferences. All of the information that has been stored in the server
logs or in their cookie file can be used to narrowcast information to that
user. For this to be possible, it is necessary that the service provider
is able to capture both historical and real-time information.
Direct Customer Participation vs. Indirect Customer Participation
One of the biggest distinctions that exist among the various strategies
is whether it requires active contribution by the user. The technologies
discussed above can at best compile rough characterizations of the customers
and attempt to infer preferences based on their past behavior. Some industry
participants believe that the only way to get more robust customer information,
such as demographics, is to ask the customer. This introduces a new challenge,
although it is not a technological one. How can companies convince users
to provide them with this information? There are several strategies in
practice today. The first strategy is to simply keep track of information
provided at the time of purchase, such as address, name, etc. and match
this information with the historical data gathered through other means.
The second strategy, which is perhaps the boldest, is to simply demand
that the user provide a certain level of information in order to access
the site. This registration process would also provide a customer profile
that could be matched with other sources. The third strategy, which has
a great deal of promise, is to offer the user something that he or she
values in return for the information. Companies such as Firefly, which
is discussed later in this paper, are attempting to use this strategy to
build the customer information bases.
Case Studies
This section presents a scan of new developments in the area, including
companies that are developing software to address this problem and a few
examples of companies that have chosen innovative approaches to gathering
usage information.
Interse Corporation (www.interse.com)
Interse is the creator of a commercially available web log analyzer package
called Market Focus 2.07. Hailed by PCWeek as the most complete web log
file analysis tool on the market. It allows webmasters to analyze their
server statistics and fine tune their performance and understand their
users’ behavior. Market Focus supports the standard HTTPD log file options
such as time of visit, host name, type of browser and operating system,
but also lets web masters make use of cookies.
Firefly Inc. (www.firefly.com)
The impetus behind the creation of Firefly was twofold. One, it is an attempt
to create an on-line community where users create a personal profile that
is matched with other users to generate recommendations on potentially
interesting purchases of books and records based on the preferences of
users with similar tastes. Two, it represented a good opportunity for its
founders to display their intelligent agent technology. It is a great example
of a company that is coupling user input of preferences with usage information.
Double-click (www.doubleclick.net)
Doubleclick is a behind-the-scenes Internet advertising company that boasts
of "the most highly targeted and cost-effective Internet advertising
capabilities.. [that] targets, schedules, delivers, tracks and reports
online ad campaigns." They make significant use of cookies to determine
which ads a customer has seen and thus display only appropriate ads. They
are behind many of the banners that appear on various websites.
Netscape (www.netscape.com)
Creator of both the browsers and web servers that provide most of the functionality
discussed above. Within their on-line store, Netscape makes use of cookies
to maintain persistent shopping carts. That is, if a user is in their on-line
store and adds several items to the shopping cart and then leaves the site,
upon return the cookie will allow the same shopping cart to be presented
to the user.
Internet Shopping Network (www.isn.com)
As we learned in class, ISN is a wholesaler of PCs and PC components. When
we checked out the sight for class, we noted that we had received a cookie.
While it is not known what exactly they are using the cookie for, the speaker
from ISN did mention that they were capable of determining the exact pages
that a user accessed prior to making a purchase decision.
Conclusion
Until companies are able to provide demonstrable or at least perceived
value for providing information, customers will not have any incentive
to provide them with the information that they need to successfully implement
the narrowcasting model. Until this value can be demonstrated, companies
can use their log files to fine tune their web sites effectiveness and
develop macro level understandings of how customers are using their sites.
More aggressive companies can make use of cookies to uniquely identify
their customers and track how they move through the website, but the problem
of limited customer information remains. In the future, it will have to
be companies, like Firefly, that marry the use of technological tracking
with customer provided information to provide a truly customized, targeted
marketing campaign. Until that time, the reality of narrowcasting at the
individual level remains a goal that is not quite within our reach.