DRAFT DRAFT DRAFT

MIT Information Technology
MIT ID Database V2.0 System Design

mitid@mit.edu
(http://web.mit.edu/mitid/www/v2)
Last modified: March 31, 1998

MIT ID Database V2.0 System Design

Table of Contents

  1. Overview
  2. MIT ID Database Design Overview
  3. Issues

Overview

This System Design document outlines planned design changes to be made to the MIT ID Database service as described in the Version 2.0 Project Charter. The new Graduate Admissions End User Application which is planned to begin roll out to departmental offices during the spring of 1998, will be used by departmental admissions offices to maintain applicant data within the MITSIS Mid-tier database. Once all required applicant data is received, a Graduate Admissions administrative application will be used to "clean up" the data in the Mid-tier before migrating it to the MITSIS Backend database.   Graduate Admissions Overview Version 2.0 changes to the MIT ID Database service include those required to enable the Graduate Admissions administrative application to utilize the MIT ID lookup and assignment functionality of the MIT ID Database. Specifically these changes include the following functional changes:

MIT ID Database Design Overview

The MIT ID Database service consists of the following primary components: The MITID DLL (Dynamically Linked Library) implements the MITID API (Application Programmer Interface) to be integrated with Departmental Systems. Additionally, a standalone application which uses the DLL is also provided. The standalone application functions  as a sample application and enables people to use the service before it becomes fully integrated into their departmental systems.

The MIT ID Server consists of a server daemon process handling transactional lookup and assignment requests. All applications that use the MIT ID Database do so through the MIT ID Server as it is this server that actually implements the search algorithm.

The MITID Database is the Institute system of record for MIT ID numbers. It is not the system of record for all information about people. It includes a minimal set of biographical information about people in order to support its lookup functionality. This information is entered when the the ID is created. Changes to the biographical information is updated via a feed from the Data Warehouse.

The following diagram depicts the dataflow betwen the different components of the MIT ID Database service.

This high level design will not be changing in V2.0 however the implementations of each component will change. These changes will be descried in the following sections.

DLL Design Changes

The MIT ID Application Programer Interface (API), implemented via a Dynamically Linked Library (DLL), is the mechanism through which departmental systems gain access to the MIT ID Database.

Port to 32 bit

The V1.0 DLL (people.dll) is a 16 bit DLL developed for use under Windows 3.1 but does work for 16 bit applications under Windows 95 and Windows NT. The V2.0 DLL (mitid32.dll) will be a port to a Windows NT 32 bit implementation. Windows specific code will be used only where necessary so that the DLL may be ported to additional platforms in the future.

Port to Kerberos V5

The version 1.0 DLL uses Kerberos V4 to authenticate the end user.  The new version will add support for Kerberos V5 using the GSSAPI.

Port to PowerBuilder 5.0

The MITID stand alone application will be ported to 32 bit PowerBuilder 5.0 to take advantage of the changes in the V2.0 changes. In addition the application will be modified to take a formatted file of records and to process this batch of records in such a way as to determine which records have EXACT_MATCH, NO_MATCH and POSSIBLE_MATCH as per the new algorithm.

Improve API Usability

The API will undergo substantial re-design to enable better batch processing support and to simplify its use while making it more configurable and extendable.

The V1.0 API was designed to work primarily in an interactive mode where the displayed result list would be manually reviewed for each query to determine if there was a match or not. The V2.0 matching algorithm will be improved so that only questionable results will require a manual review.

The V1.0 API includes a number of data structures that each client application must declare and utilize. V2.0 will combine these structures into a single structure to be refereed to as the MITID_OBJECT. The MITID_OBJECT will encompass all of the data to be used by the DLL. The V2.0 API will include access functions for applications to read and write data to and from the MITID_OBJECT.

The MITID_OBJECT will include references to the following structures: PROCESS_INFO, QUERY_INFO, RESULTS_INFO, VERSION_INFO. Some of these structures have existing counterparts in V1.0 but will be extended to include additional information such as "host name" and "port id" for the server, connection_status, and query_options. The V2.0 MITID_OBJECT structure will replace the following 1.0 structures: PROCESS_STRUCTURE (a.k.a. CONTEXT), PERSON_STRUCTURE, QUERY_PERSON_TYPE.

Since the 2.0 server will continue to support backward compatibility for 16 bit clients using the 1.0 API and  DLL,  the 2.0 API need not be backward compatible. NOTE: Future server updates will make the 1.0 API obsolete so all applications should utilize the 2.0 API. The following table illustrates the migration of the API from 1.0 to 2.0.  Specific changes to the Version 2.0 API can be found in the document MIT ID 2.0 Application Programmer Interface.

Server Design Changes

The server component of the MIT ID Database is an Oracle PRO-C program used in conjunction with the INET Daemon server which essentially enables the server program to communicate with any Internet client application via a Socket. The server authenticates its clients via Kerberos and then determines if they are authorized to use the service. Authentication is currently handled via Kerberos V4 but the server will be changed to also handle Kerberos V5 clients. Authorizations are currently implemented via the MIT ID Database but may in the future move to the Roles Database; no major changes in the authorization mechanism are planned for this release. Additional changes may be made for improved maintainability and operability of server application.

Server backward compatibility is required for this release so that existing deployed V1.0 clients do not need to be replaced until they are ready to go to this new 32 bit implementation. While the server will continue to support its current functionality to V1.0 clients, new functionality will only be available to the V2.0 clients. This includes improvements to the matching algorithm and results feedback.

The following tables summarize the V2.0 Server Connection Process and Search Algorithm.
 

V2.0 Server Connection Process

Client connects No change
Determine client version  Per Connection Configurations enable simultaneous connections from both V1.0 and V2.0 clients.
Client connection authenticated (Kerberos) Either V4 or V5
Parse connection Request (search, assign) V2.0 Clients will be able to specify additional options
Search/Assignment performed See the Search Algorithm below
Results returned The V2.0 result set as well as each individual result record will be identified as EXACT_MATCH, NO_MATCH or POSSIBLE_MATCH 
Connection closes No change
 
 

Search Algorithm

Existing in V1.0
 V2.0 Changes
Cast Wide Net - Get any/all remotely matching records from database:  
  •  -Matching SSN 
  •  -Matching first name 
  •  -Matching soundex on 'lastName' 
  •  -Matching soundex on 'maybe_lastName' 
  •   "maybe_lastName" includes phonetically similar names: 
    •  PH -> F 
    •  F -> PH 
    •  KN -> N 
    •  N -> KN 
    •  C -> K 
    •  K -> C 
  •  -Matching MMDD 
  •  -Matching DDMM (for date mix-ups)
Clients will be able to specify search options to control the size of this initial data set. (Specific options yet to be determined but may include options to omit soundex records or suspected obsolete records. An OBSOLETE_RECORD indicator may be used by the database to flag records to be ignored by the search algorithm.  See database changes for more details.)
Evaluate Results - Assign a numeric value for each field and sum for record  
  • SSN - MatchDigitPercent 
  • DOB - MatchDigitPercent 
  • LastName - matchAlphaPercentage 
  • MiddleName - matchAlphaPercentage 
  • FirstName - matchAlphaPercentage 
 MatchAlpaPercentage   
  -matching characters  
  -matching positions  
  -target length  
  
 MatchDigitPercentage   
  -Number of matching digits  
  -matching positions  
  -target length
Version 1.0 clients will continue to use the existing matching algorithm while V2.0 will use a different search algorithm which can more specifically identify the individual results as MATCH, NO_MATCH, and POSSIBLE_MATCH. (Specific changes yet to be determined.)
Results Returned and displayed   
  • SearchPersonMatch -One Exact Match (100.0) 
  • SearchPersonList -Multiple Results Found 
  • SearchPersonNone -No Results Found
(Specific changes yet to be determined.)
 

Database Design Changes

The only significant changes to the database in V2.0 will be to assist with the cleanup of MIT ID Data problems (See  MIT ID Data Problems ) and to improve the data feeds between the MIT ID Database and the Warehouse.

Data Cleanup

When multiple MIT IDs are found for a single person, human interaction will be required to determine which record (and therefore which MITID) is to be used by the search algorithm in the server application. There will be a manual mechanism to set an OBSOLETE_RECORD indicator that the search algorithm can use to filter out these records. By default the OBSOLETE_RECORD indicator will be zero (0). Other valid MIT_ID values will be used to either reference the correct MIT ID or to indicate that the correct ID is unknown. Keeping these records in the database for historical reasons will prevent the MIT IDs from accidentally being used again.

Data Feeds

The Data feeds between the Warehouse and the MIT ID Database ensure that data between these two databases remain synchronized. Updates are needed in both directions. The current feed updates all records in the MIT ID Database even if there are no changes and does not recognize when records have disappeared from the feed. The new feed will only update changed records and will flag records that have disappeared from the feed using the OBSOLETE_RECORD indicator. Updating only changed records will be important as the number of records is expected to increase significantly when past students are added to the Warehouse.

Currently there is no data feed from the MIT ID database back to the Warehouse; this will need to change in V2.0 so that changes in each of these two system will remain synchronized.

Additionally, a set of scripts are needed to assist with the manual cleanup of data problems as they are identified. In particular, a mechanism is needed to better identify and resolve Multiple ID issues, that is when there are multiple entries for a single person resulting in multiple ID. With the addition of the OBSOLETE_RECORD indicator, these scripts do not need to be in production immediately, however being able to identify multiple ID proactively and being able to eliminate them from search results in a timely manner is critical to the success of the MIT ID Database.

Open Issues and Questions

Open Issues and Questions

ISSUE/QUESTION
STATUS
DATE
No support for previous names:   
Their currently is no mechanism to allow searches on previous or alternative names, however some departmental systems do provide this capability.
OPEN
03/19/1998
Kerberos V4 and V5: 
Will we need to support both Kerberos V4 and V5 in the client?
OPEN
03/19/1998
Server Search Options and algorithm changes: 
Need to solidify Server Search Options to be available in this release as well as changes to the search algorithm to determine result status
OPEN
03/19/1998
Data Feed: 
Need to solidify Changes
OPEN
03/19/1998
Vendor Product Integration: 
Need to specify how vendor products not using the MITID DLL will be handled.
OPEN
03/19/1998