Answers to HW3

SMA 5506: Database Technology

Summer 2005

Lecture 16 & 17

Sample Solutions to HW3: Semantic Integration & XQuery

Question 1. In the lecture on Semantic Data Integration, you saw the TASC demo and the context definitions for the three primary data sources: Disclosure, Worldscope, and Datastream. You will use the first two sources for this assignment. For your reference, their context definitions are given in the following table:

	Disclosure	Worldscope
Currency used	Country of incorp.	US Dollars
Currency symbol	3 letters	3 letters
Scale factor	1	1000
Company names	Disclosure names	Worldscope names
Date style	‘American Style /’	‘American Style /’

The simplified schemas of the sources are given below (bold to indicate the key):

Disclosure DB:

DiscAF(company_name:string, latest_annual_data:string, net_income:number, location_of_incorp:string).

Worldscope DB:

WorldcAF(company_name:string, latest_annual_financial_datE:string, SALES:NUMBER).

There are also a few auxiliary data sources useful for answering questions in this assignment. You may need to use some or all of the following relations/functions:

currency_map(char3_currency:STRING, char2_currency:STRING)

Use it to translate currency symbols from 3 chars to 2 chars and vice versa, e.g., SQL statement select char3_currency from currency_map where char2_currency=’US’ will return USD in the result set. Note either column is a key attribute.

currency_types (country:STRING, currency:STRING)

Use it to find out the official currency of a given country, e.g., select currency from currency_types where country=’USA’ will return USD.

· datexform(date1:STRING, format1:STRING, date2:STRING, format2:STRING)

Use it to translate between date formats, e.g., select date 1 from datexform where format1=’European Style –’ and date2=’01/02/03’ and formate2=”American Style /’ will return the 02-01-03. Note that US uses mm/dd/yy, whereas Europe uses dd-mm-yy.

· olsen(exchanged:STRING, expressed:STRING, rate:NUMBER, date:STRING),

Use it to obtain exchange rate between two currencies. Note that olsen uses American style for its date and 3-char format for currencies. For example, select rate from olsen where exchanged=’USD’ and expressed=’SGD’ and date=’01/02/03’ will return the exchange rate for the number of SGD’s for 1 USD on January 2^nd, 2003. This is a Web source with relatively slow speed compared with the other sources; it also lacks set capability, meaning it processes one record at a time.

name_map_ds_ws(ds_names:STRING, ws_names:STRING)

pair-wise translation between Disclosure names and Worldscope names. A same company may use slightly different names in different DBs. Fro example, Daimler-Benz may appear as Daimler-Benz in one DB, as Daimler-Benz AG in another, and yet Daimler Benz Corp in a third DB. This relation and the next two allow you to translate company names.

name_map_dt_ds(dt_names:STRING, ds_names:STRING)

pair-wise translation between Datastream names and Disclosure names

name_map_dt_ws(dt_names:STRING, ws_names:STRING),

pair-wise translation between Datastream names and Worldscope names

A Worldscope user now wants to find out the latest net income and sales of all companies that appear in both databases. In the COIN framework, a user can express a query in his own context and let COIN to resolve context conflicts. So the above user, who is in Worldscope context, can issue the following SQL to the COIN system to get what he wants:

select WorldcAF.COMPANY_NAME, DiscAF.LATEST_ANNUAL_DATA, DiscAF.NET_INCOME, WorldcAF.SALES

from DiscAF, WorldcAF

where DiscAF.COMPANY_NAME = WorldcAF.COMPANY_NAME

and DiscAF.LATEST_ANNUAL_DATA = WorldcAF.LATEST_ANNUAL_FINANCIAL_DATE;

Please answer the following questions:

1) List and describe as complete as possible the context conflicts that must be resolved when this query is evaluated.

2) Rewrite the SQL into one that resolves these conflicts so that results will be in the user’s context.

3) Suppose all the sources are distributed on the network. Discuss how you can optimize the SQL from step 2. State your assumptions clearly and keep your discussions concise. (Useful hint: Assume that the number of companies that appear in both databases is not large.)

Question 2: You have seen the census database in Access SQL relational DB and in XML. Its relational schema is:

census(State, Abbr, Year, Populations, Cars, Drivers).

Here is an excerpt for the cencus.xml data file:

<State>Alabama</State>

</Census>

<State>Alaska</State>

</Census>

…

</dataroot>

Due to migration of people in the U.S., the population in certain states may be decreasing. Please answer the following questions.

1) Write a query in SQL that find the states whose population in 2000 is less than that of 1998, return the name of the state and the population in 1998 and 2000.

2) Write the corresponding XQuery against the xml file.

Sample Solutions to Question 1

Note: We deliberately made the query executable in the TASC demo on COIN website. The solutions given here will be somewhat different from what the demo generates. As you know, there is always room for optimization.

We need study the context definitions of both sources and receiver and the attributes in the SQL to determine semantic conflicts; the following table summarizes the results:

Sources: Worldscope and Disclosure. Refer to the background section for contexts.

Receiver: Worldscope

Note: One of the sources, Worldscope, matches the receiver’s context, so no context conflicts exist there. We only have to be concerned about the Disclosure source context and the Worldscope receiver context.

Attribute/SQL fragment	Modifier	Disclosure context	Worldscope context	Description
DiscAF.NET_INCOME	Currency	Company -> Country incorp. -> Official currency	US Dollars	Disclosure uses the official currency of the country of incorporation, whereas receiver Worldscope uses US dollars. Sales data from Worldscope has no conflict because receiver is in the same context. Auxiliary sources: currency_types, olsen
DiscAF.NET_INCOME	Scale factor	1	1000	Disclosure source uses 1, whereas Worldscope receiver uses 1000. Sales data has no conflict. Auxiliary sources: none
DiscAF.COMPANY_NAME = WorldcAF.COMPANY_NAME	Name format	Disclosure name	Worldscope name	Disclosure and Worldscope use different formats for company names, e.g., DAIMLER BENZ CORP in Disclosure, DAIMLER-BENZ AG in Worldscope. Auxiliary sources: name_map_ds_ws

COIN demo generates conflicts table, organized by modifiers of semantic types defined in the ontology. From the readings, you know that attributes in source relations are elevated to semantic types and modifiers help specify contexts. The following table generated from the demo summarizes the same set of conflicts:

Semantic Type	Column	Source	Modifier	Modifier value in source context	Modifier value in target context	Conversion Function
Company Financials	IV	DiscAF(Name, Name, Shares, Income, Sales, Assets, Incorp)	currency	disclosure : MV1	worldscope : USD	attr(V13, fyEnding, V12), value(V12, V11, V10), olsen_p(V9, V8, V7, V6), value(V9, V11, V5), value(V8, V11, V4), value(V6, V11, V10), value(V7, V11, V3), V2 is V1 * V3
Company Financials	Income	Same as above	Scale Factor	disclosure : 1	worldscope : 1000	V5 is V4 / V3, V2 is V1 * V5
Company Name	Name	Same as above	format	disclosure : ds_name	worldscope : ws_name	Name_map(V4, V3, V2, V1)

The last column has the conversion functions needed to resolve the conflict.

The query should resolve all semantic conflicts. Some of the auxiliary sources will be used in resolving semantic conflicts. For example, the JOIN on company names cannot be evaluated directly because of naming format conflict. Instead, the source name_map_ds_ws should be used to convert names from source format to receiver format before performing equality evaluation. Similarly, currency conflict requires the olsen source. Because Disclosure uses official currency of country of incorporation, we also need currecy_types to find the official currency using the LOCATION_OF_INVCOPRP attribute value. Overall, there are a total of 5 data sources involved although the original SQL has only 2 data sources.

The SQL that resolves the identified conflicts can be done in several sub queries or in a single comprehensive query, like the following one:

Select WorldcAF.COMPANY_NAME, DiscAF.LATEST_ANNUAL_DATA, DiscAF.NET_INCOME*0.001*olsen.rate, Worldc.SALES

From WorldcAF, DiscAF, name_map_ds_ws, currency_types, olsen

Where DiscAF.COMPANY_NAME=. name_map_ds_ws.ds_names and

name_map_ds_ws .ws_names=WorldcAF.COMPANY_NAME and

DiscAF.LATEST_ANNUAL_DATA= WorldcAF.LATEST_ANNUAL_FINANICAL_DATE and

Disc.LATEST_ANNUAL_DATA=olsen.date and

DiscAF.LOCATION_OF_INCORP=currency_types.country and

Currency_types.currency=olsen.exchanged and

olsen.expressed=”USD”;

The TASC demo generates the following SQL to resolve conflicts:

select name_map_ds_ws.ws_names, olsen.date, ((discaf.net_income*0.001)*olsen.rate), worldcaf.sales

from   (select ds_names, ws_names

        from   name_map_ds_ws) name_map_ds_ws,

       (select country, currency

        from   currencytypes

        where  currency <> 'USD') currencytypes,

       (select exchanged, 'USD', rate, date

        from   olsen

        where  expressed='USD') olsen,

       (select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp

        from   discaf) discaf,

       (select company_name, latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets, country_of_incorp

        from   worldcaf) worldcaf

where  name_map_ds_ws.ws_names = worldcaf.company_name

and    olsen.date = discaf.latest_annual_data

and    discaf.latest_annual_data = worldcaf.latest_annual_financial_date

and    name_map_ds_ws.ds_names = discaf.company_name

and    currencytypes.country = discaf.location_of_incorp

and    currencytypes.currency = olsen.exchanged

and    currencytypes.currency <> 'USD'

union

select name_map_ds_ws2.ws_names, discaf2.latest_annual_data, (discaf2.net_income*0.001), worldcaf2.sales

from   (select ds_names, ws_names

        from   name_map_ds_ws) name_map_ds_ws2,

       (select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp

        from   discaf) discaf2,

       (select country, 'USD'

        from   currencytypes

        where  currency='USD') currencytypes2,

       (select company_name, latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets, country_of_incorp

        from   worldcaf) worldcaf2

where  name_map_ds_ws2.ws_names = worldcaf2.company_name

and    discaf2.latest_annual_data = worldcaf2.latest_annual_financial_date

and    name_map_ds_ws2.ds_names = discaf2.company_name

and    discaf2.location_of_incorp = currencytypes2.country

This SQL is semantically equivalent to the much simpler SQL. The demo system rewrites the SQL into a format the system can execute, and to some extent, to a format the system thinks to be “optimal”. On high level, the first half (before UNION) takes care of companies in countries where official currency is not US Dollars, while the second half takes care of those whose official currency is US Dollars.

Assumptions: We assume that both WorldcAF and DiscAF tables are large. However, the number of records that have matching company names and annual financial data dates are small.

We can think about optimization from both conventional DBMS and distributed DBMS aspects. From conventional DBMS aspect, we should use some optimization heuristics, such as SELECT before JOIN, perform JOIN that results in smaller result set before other JOINs, etc. The following query in relational algebra makes use of these basic heuristics:

Π_{COMPAY_NAME,
LATEST_ANNUAL_DATA, NET_INCOME*0.001*rate, SALES} ((((Π_{ws_names,
LATEST_ANNUAL_DATA, NET_INCOME, LOCATION_OF_INCORP}(DiscAF⋈_{COMPANY_NAME=ds_names} name_map_ds_ws)) ⋈_{ds_names=COMPANY_NAME and LATEST_ANNUAL_DATA=
LATEST_ANNUAL_FINANCIAL_DATE} WorldcAF) ⋈_{LOCATION_OF_INCORP=country} currency_types) ⋈_{currency=exchanged and LATEST_ANNUAL_DATA=date} (σ_{expressed=”USD”}(olsen)) )

With the assumptions made earlier, we perform JOIN on company names and financial dates before JOIN on currency types. If the first JOIN results in, say, 4 records only, the JOIN on currency types will be fast; if we did otherwise, thousands of records in DiscAF will JOIN with currency types, which then need to JOIN with WorldcAF on financial dates and company names. Last, we JOIN with olsen source. Had we performed this JOIN with DiscAF first, we would have to retrieve thousands of exchange rates, instead of 4 in this case. For olsen source, we also perform SELECT before JOINing with other intermediate relations.

With distributed DBMS, network communication cost is often high and capabilities of source vary. Therefore, we should limit unnecessary transmission of data and avoid letting slow sources do unnecessary work.

Example of limit the amount of data transmission: suppose we join two tables of different sizes on the network. We can ship both tables to the user location to perform the JOIN locally, or ship the smaller table to the larger table for JOIN and return results to the user location. Obviously, the latter is better. In the case of our SQL, the name mapping table is often a smaller lookup table, therefore it should be shipped to the DiscAF location to perform the JOIN.

As we described earlier, olsen is a slow web source. The SQL in relational algebra purposely delays the JOIN with it to limit queries to only those that have survived all the other JOINs.

In addition, we can arrange parallel operations of several sub queries: a) we can fetch data from WorldcAF and DiscAF in parallel; b) once a smaller set of currencies is obtained from the join between WorldcAF and DiscAF, we can send request for currency conversions in parallel to olsen source.

To summarize, the query can be executed using the following plan:

Step 1: Perform the following three tasks in parallel:

a) Ship name_map_ds_ws to DiscAF and perform this JOIN:

DiscAF⋈_{COMPANY_NAME=ds_names} name_map_ds_ws

b) Fetch WorldcAF and ship it to DiscAF

c) Fetch currency_types and ship it to DiscAF

Step 2: Perform the following three tasks in serial:

a) Perform PROJECT to the result set from JOIN in 1-a):

Π_{ws_names, LATEST_ANNUAL_DATA,
NET_INCOME, LOCATION_OF_INCORP}

b) Perform JOIN with WorldcAF (suppose data has already arrived)

c) Perform JOIN with currency_types

Step 3: For each currency type in the result set from step 2, query olsen web source for exchange rate in parallel.

Step 4: Merge results and perform the final PROJECT to get final answer:

Π_{COMPAY_NAME,
LATEST_ANNUAL_DATA, NET_INCOME*0.001*rate, SALES}

Sample Solutions to Question 2

1. SQL

Select s98.State, s98.Population, s00.Population

From Census s98, Census s00

Where s98.Year=1998 and s00.Year=2000 and s98.State=s00.State and s98.Population>s00.Population

2. XQuery

for $s00 in doc("census.xml")//Census[Year=2000]

let $s98 := doc("census.xml")//Census

[Year=1998 and ./State=$s00/State]

where xn:int($s00/Population) < xn:int($s98/Population)

return <Rec>

{$s98/State,$s98/Population, $s00/Population}

</Rec>

Note: in XPath, . stands for current node, .. stands for parent node.

For your curiosity, here is the output when the query is run against the database used in lecture:

<Rec>

<State>Maine</State>

</Rec> <Rec>

<State>West Virginia</State>

</Rec>