SMA 5506: Database Technology
Summer 2005
Lecture 16 & 17
Sample
Solutions to HW3: Semantic Integration & XQuery
Question 1. In the lecture on Semantic Data Integration, you saw the TASC demo and the context definitions for the three primary data sources: Disclosure, Worldscope, and Datastream. You will use the first two sources for this assignment. For your reference, their context definitions are given in the following table:
|
|
Disclosure |
Worldscope |
|
Currency used |
Country of incorp. |
US Dollars |
|
Currency symbol |
3 letters |
3 letters |
|
Scale factor |
1 |
1000 |
|
Company names |
Disclosure names |
Worldscope names |
|
Date style |
‘American Style /’ |
‘American Style /’ |
The simplified schemas of the sources are given below (bold to indicate the key):
Disclosure DB:
DiscAF(company_name:string, latest_annual_data:string,
net_income:number, location_of_incorp:string).
Worldscope DB:
WorldcAF(company_name:string, latest_annual_financial_datE:string,
SALES:NUMBER).
There are also a few auxiliary data sources useful for answering questions in this assignment. You may need to use some or all of the following relations/functions:
Use
it to translate currency symbols from 3 chars to 2 chars and vice versa, e.g.,
SQL statement select char3_currency from
currency_map where char2_currency=’US’ will return USD in the result set. Note either column is a key attribute.
Use
it to find out the official currency of a given country, e.g., select currency from currency_types where
country=’
·
datexform(date1:STRING,
format1:STRING, date2:STRING, format2:STRING)
Use
it to translate between date formats, e.g.,
select date 1 from datexform where format1=’European Style –’ and
date2=’01/02/03’ and formate2=”American Style /’ will return the
·
olsen(exchanged:STRING, expressed:STRING, rate:NUMBER, date:STRING),
Use
it to obtain exchange rate between two currencies. Note that olsen uses
American style for its date and 3-char format for currencies. For example, select rate from olsen where exchanged=’USD’
and expressed=’SGD’ and date=’01/02/03’ will return the exchange rate for
the number of SGD’s for 1 USD on January 2nd, 2003. This is a Web
source with relatively slow speed compared with the other sources; it also
lacks set capability, meaning it processes one record at a time.
pair-wise
translation between Disclosure names and Worldscope names. A same company may
use slightly different names in different DBs. Fro example, Daimler-Benz may
appear as Daimler-Benz in one DB, as Daimler-Benz AG in another, and yet
Daimler Benz Corp in a third DB. This relation and the next two allow you to
translate company names.
pair-wise
translation between Datastream names and Disclosure names
pair-wise
translation between Datastream names and Worldscope names
A Worldscope user now wants to find out the latest net income and sales of all companies that appear in both databases. In the COIN framework, a user can express a query in his own context and let COIN to resolve context conflicts. So the above user, who is in Worldscope context, can issue the following SQL to the COIN system to get what he wants:
select
WorldcAF.COMPANY_NAME, DiscAF.LATEST_ANNUAL_DATA, DiscAF.NET_INCOME, WorldcAF.SALES
from
DiscAF, WorldcAF
where
DiscAF.COMPANY_NAME = WorldcAF.COMPANY_NAME
and
DiscAF.LATEST_ANNUAL_DATA = WorldcAF.LATEST_ANNUAL_FINANCIAL_DATE;
Please answer the following questions:
1) List and describe as complete as possible the context conflicts that must be resolved when this query is evaluated.
2) Rewrite the SQL into one that resolves these conflicts so that results will be in the user’s context.
3) Suppose all the sources are distributed on the network. Discuss how you can optimize the SQL from step 2. State your assumptions clearly and keep your discussions concise. (Useful hint: Assume that the number of companies that appear in both databases is not large.)
Question 2: You have seen the census database in Access SQL relational DB and in XML. Its relational schema is:
census(State, Abbr, Year, Populations, Cars, Drivers).
Here is an excerpt for the cencus.xml data file:
<dataroot>
<Census>
<State>Alabama</State>
<Abbr>AL</Abbr>
<Year>2000</Year>
<Population>4447</Population>
<Cars>3960</Cars>
<Drivers>3521</Drivers>
</Census>
<Census>
<State>
<Abbr>AK</Abbr>
<Year>2000</Year>
<Population>627</Population>
<Cars>594</Cars>
<Drivers>465</Drivers>
</Census>
…
</dataroot>
Due to migration of people in the
1) Write a query in SQL that find the states whose population in 2000 is less than that of 1998, return the name of the state and the population in 1998 and 2000.
2) Write the corresponding XQuery against the xml file.
Sample Solutions to
Question 1
Note: We deliberately made
the query executable in the TASC demo on COIN website. The solutions given here
will be somewhat different from what the demo generates. As you know, there is
always room for optimization.
1)
We need study the context
definitions of both sources and receiver and the attributes in the SQL to
determine semantic conflicts; the following table summarizes the results:
Sources:
Worldscope and Disclosure. Refer to the background section for contexts.
Receiver:
Worldscope
Note: One of the sources,
Worldscope, matches the receiver’s context, so no context conflicts exist
there. We only have to be concerned
about the Disclosure source context and the Worldscope receiver context.
|
Attribute/SQL fragment |
Modifier |
Disclosure context |
Worldscope context |
Description |
|
DiscAF.NET_INCOME |
Currency |
Company -> Country
incorp. -> Official currency |
US Dollars |
Disclosure uses the
official currency of the country of incorporation, whereas receiver
Worldscope uses US dollars. Sales data from Worldscope has no conflict
because receiver is in the same context. Auxiliary sources: currency_types,
olsen |
|
Scale factor |
1 |
1000 |
Disclosure source uses 1, whereas Worldscope
receiver uses 1000. Sales data has no conflict. Auxiliary sources: none |
|
|
DiscAF.COMPANY_NAME =
WorldcAF.COMPANY_NAME |
Name format |
Disclosure name |
Worldscope name |
Disclosure and Worldscope
use different formats for company names, e.g., DAIMLER BENZ CORP in
Disclosure, DAIMLER-BENZ AG in Worldscope. Auxiliary sources: name_map_ds_ws |
COIN demo generates conflicts
table, organized by modifiers of semantic types defined in the ontology. From
the readings, you know that attributes in source relations are elevated to
semantic types and modifiers help specify contexts. The following table
generated from the demo summarizes the same set of conflicts:
|
Semantic Type |
Column |
Source |
Modifier |
Modifier value in source context |
Modifier value in target context |
Conversion Function |
|
Company Financials |
IV |
DiscAF(Name, Name, Shares, Income, Sales, Assets, Incorp) |
currency |
disclosure : MV1 |
worldscope : USD |
attr(V13, fyEnding, V12), value(V12, V11, V10), olsen_p(V9, V8, V7,
V6), value(V9, V11, V5), value(V8, V11, V4), value(V6, V11, V10), value(V7,
V11, V3), V2 is V1 * V3 |
|
Company Financials |
Income |
Same as above |
Scale Factor |
disclosure : 1 |
worldscope : 1000 |
V5 is V4 / V3, V2 is V1 * V5 |
|
Company Name |
Name |
Same as above |
format |
disclosure : ds_name |
worldscope : ws_name |
Name_map(V4, V3, V2, V1) |
The last column has the
conversion functions needed to resolve the conflict.
2)
The query should resolve all
semantic conflicts. Some of the
auxiliary sources will be used in resolving semantic conflicts. For example,
the JOIN on company names cannot be evaluated directly because of naming format
conflict. Instead, the source name_map_ds_ws
should be used to convert names from source format to receiver format before
performing equality evaluation. Similarly,
currency conflict requires the olsen
source. Because Disclosure uses official currency of country of incorporation,
we also need currecy_types to find
the official currency using the LOCATION_OF_INVCOPRP
attribute value. Overall, there are a total of 5 data sources involved although
the original SQL has only 2 data sources.
The SQL that resolves the
identified conflicts can be done in several sub queries or in a single
comprehensive query, like the following one:
Select WorldcAF.COMPANY_NAME, DiscAF.LATEST_ANNUAL_DATA,
DiscAF.NET_INCOME*0.001*olsen.rate, Worldc.SALES
From WorldcAF, DiscAF,
name_map_ds_ws, currency_types, olsen
Where DiscAF.COMPANY_NAME=.
name_map_ds_ws.ds_names and
name_map_ds_ws
.ws_names=WorldcAF.COMPANY_NAME and
DiscAF.LATEST_ANNUAL_DATA=
WorldcAF.LATEST_ANNUAL_FINANICAL_DATE and
Disc.LATEST_ANNUAL_DATA=olsen.date
and
DiscAF.LOCATION_OF_INCORP=currency_types.country
and
Currency_types.currency=olsen.exchanged
and
olsen.expressed=”USD”;
The TASC demo generates the
following SQL to resolve conflicts:
select name_map_ds_ws.ws_names, olsen.date, ((discaf.net_income*0.001)*olsen.rate), worldcaf.salesfrom (select ds_names, ws_names from name_map_ds_ws) name_map_ds_ws, (select country, currency from currencytypes where currency <> 'USD') currencytypes, (select exchanged, 'USD', rate, date from olsen where expressed='USD') olsen, (select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp from discaf) discaf, (select company_name, latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets, country_of_incorp from worldcaf) worldcafwhere name_map_ds_ws.ws_names = worldcaf.company_nameand olsen.date = discaf.latest_annual_dataand discaf.latest_annual_data = worldcaf.latest_annual_financial_dateand name_map_ds_ws.ds_names = discaf.company_nameand currencytypes.country = discaf.location_of_incorpand currencytypes.currency = olsen.exchangedand currencytypes.currency <> 'USD'unionselect name_map_ds_ws2.ws_names, discaf2.latest_annual_data, (discaf2.net_income*0.001), worldcaf2.salesfrom (select ds_names, ws_names from name_map_ds_ws) name_map_ds_ws2, (select company_name, latest_annual_data, current_shares_outstanding, net_income, net_sales, total_assets, location_of_incorp from discaf) discaf2, (select country, 'USD' from currencytypes where currency='USD') currencytypes2, (select company_name, latest_annual_financial_date, current_outstanding_shares, net_income, sales, total_assets, country_of_incorp from worldcaf) worldcaf2where name_map_ds_ws2.ws_names = worldcaf2.company_nameand discaf2.latest_annual_data = worldcaf2.latest_annual_financial_dateand name_map_ds_ws2.ds_names = discaf2.company_nameand discaf2.location_of_incorp = currencytypes2.country
This SQL is semantically
equivalent to the much simpler SQL. The demo system rewrites the SQL into a
format the system can execute, and to some extent, to a format the system
thinks to be “optimal”. On high level, the first half (before
3)
Assumptions:
We assume that both WorldcAF and DiscAF tables are large. However, the number
of records that have matching company names and annual financial data dates are
small.
We can think about
optimization from both conventional DBMS and distributed DBMS aspects. From conventional
DBMS aspect, we should use some optimization heuristics, such as SELECT before
JOIN, perform JOIN that results in smaller result set before other JOINs, etc. The following query in relational algebra
makes use of these basic heuristics:
ΠCOMPAY_NAME,
LATEST_ANNUAL_DATA, NET_INCOME*0.001*rate, SALES ((((Πws_names,
LATEST_ANNUAL_DATA, NET_INCOME, LOCATION_OF_INCORP (DiscAF⋈COMPANY_NAME=ds_names
name_map_ds_ws)) ⋈ ds_names=COMPANY_NAME and LATEST_ANNUAL_DATA=
LATEST_ANNUAL_FINANCIAL_DATE WorldcAF) ⋈
LOCATION_OF_INCORP=country currency_types) ⋈
currency=exchanged and LATEST_ANNUAL_DATA=date (σexpressed=”USD”(olsen))
)
With the assumptions made
earlier, we perform JOIN on company names
and financial dates before JOIN on currency types. If the first JOIN
results in, say, 4 records only, the JOIN on currency types will be fast; if we did otherwise, thousands of
records in DiscAF will JOIN with currency types, which then need to JOIN with
WorldcAF on financial dates and company names. Last, we JOIN with olsen source. Had we performed this JOIN
with DiscAF first, we would have to retrieve thousands of exchange rates,
instead of 4 in this case. For olsen source, we also perform SELECT
before JOINing with other intermediate relations.
With distributed DBMS,
network communication cost is often high and capabilities of source vary.
Therefore, we should limit unnecessary transmission of data and avoid letting
slow sources do unnecessary work.
Example of limit the amount
of data transmission: suppose we join two tables of different sizes on the
network. We can ship both tables to the user location to perform the JOIN
locally, or ship the smaller table to the larger table for JOIN and return
results to the user location. Obviously, the latter is better. In the case of
our SQL, the name mapping table is often a smaller lookup table, therefore it
should be shipped to the DiscAF location to perform the JOIN.
As we described earlier, olsen is a slow web source. The SQL in
relational algebra purposely delays the JOIN with it to limit queries to only
those that have survived all the other JOINs.
In addition, we can arrange
parallel operations of several sub queries: a) we can fetch data from WorldcAF
and DiscAF in parallel; b) once a smaller set of currencies is obtained from
the join between WorldcAF and DiscAF, we can send request for currency
conversions in parallel to olsen
source.
To summarize, the query can
be executed using the following plan:
Step 1: Perform the following three tasks in parallel:
a) Ship name_map_ds_ws to DiscAF and perform this JOIN:
DiscAF⋈COMPANY_NAME=ds_names
name_map_ds_ws
b) Fetch WorldcAF and ship it to DiscAF
c) Fetch currency_types and ship it to DiscAF
Step 2: Perform the following three tasks in serial:
a) Perform PROJECT to the result set from JOIN in 1-a):
Πws_names, LATEST_ANNUAL_DATA,
NET_INCOME, LOCATION_OF_INCORP
b) Perform JOIN with WorldcAF (suppose data has already
arrived)
c) Perform JOIN with currency_types
Step 3: For each currency type in the result set from step
2, query olsen web source for exchange rate in parallel.
Step 4: Merge results and perform the final PROJECT to get
final answer:
ΠCOMPAY_NAME,
LATEST_ANNUAL_DATA, NET_INCOME*0.001*rate, SALES
Sample Solutions to Question 2
1. SQL
Select
s98.State, s98.Population, s00.Population
From
Census s98, Census s00
Where
s98.Year=1998 and s00.Year=2000 and s98.State=s00.State and s98.Population>s00.Population
2. XQuery
for
$s00 in doc("census.xml")//Census[Year=2000]
let
$s98 :=
doc("census.xml")//Census
[Year=1998
and ./State=$s00/State]
where
xn:int($s00/Population) < xn:int($s98/Population)
return
<Rec>
{$s98/State,$s98/Population,
$s00/Population}
</Rec>
Note: in XPath, . stands for
current node, .. stands for parent node.
For your curiosity, here is
the output when the query is run against the database used in lecture:
<Rec>
<State>
<Population>1278</Population>
<Population>1275</Population>
</Rec>
<Rec>
<State>
<Population>1812</Population>
<Population>1808</Population>
</Rec>