The following document was converted from a MS Word Document. This document was used during intitial design discussions to help determine the scope of the project and may or may not reflect the actual implementation design.

People Database and Graduate Admissions
Preliminary Planning Document

By: Norm Wright

SIS Team/Grad. Admissions Team

Jonathan Ives

People Database/IT Integration Team

Date: November 5, 1997

Version 5

Table of Contents 1. Overview *

1.1. Introduction *

1.2. Expected Audience *

1.3. Next Steps *

2. What is the MIT People Database? * 2.1. Overview *

2.2. What is the status of the People Database? *

2.3. What is the Technology behind the People Database? *

2.4. Who is Responsible for the People Database? *

3. Why integrate in with the People Database? * 3.1. Business Functions to be met *

3.2. How do duplicate MIT Id’s get generated? *

3.3. Duplicate checking needs to be done anyway *

4. Where in the Process should the Integration occur? *

5. Outline of Matching Process and Suspense Resolution? *

5.1. Applicants matched to People Database in Batch *

5.2. Processing (1) Definite Matches to an existing Person *

5.3. Processing (2) Definitely does NOT match anyone. *

5.4. Processing (3) may or may not match some person(s) *

6. Potential Barriers to Using the People Database * 6.1. The People Database does not have a Full history of Students *

6.2. The API is in written in C *

6.3. The algorithm needs to be tested in a batch environment *

6.4. The MIT Id Assignment Scheme needs to be coordinated *

6.5. People Database may not have sufficient information to resolve questionable matches *

6.6. People Database does not have a separate Name Suffix *

6.7. MITSIS "creates" other people *

7. Proposed Direction and Known Issues * 7.1. Outline of approach *

7.2. Need a high hit ratio *

7.3. Current API May need more info for suspense resolution *

7.4. We will probably have to adjust the matching logic *

7.5. Code will have to be upgraded *

7.6. C DLL’s upgraded to 32 bit *

7.7. No need to Port to MAC *

7.8. Who will do this work? *

7.9. Could this be designed more generically? *

7.10. Kerberos 4 vers. Kerberos 5 *

1. Appendix – The Current MITSIS Admissions Matching Algorithm * 1.1. Where is this algorithm? *

1.2. General Rules *

1.3. Wide net of "potential" matches is cast *

1.4. People caught in wide net are compared to "new" person *

1.5. Each person caught in the net is categorized as follows: *

1.6. The rules for this classification are stored in a table, but match the following guidelines: *

1.7. Date of Birth Matching *

1.8. Name Matching *

1.9. ID Matching *

Overview

Introduction

The Graduate Admissions system is currently being re-designed and integrated in with the MITSIS system. During this process, we hope to integrate the new system in with MIT’s "People" database. This document discusses that issue.

Expected Audience

There are three groups of people for whom this document is intended. All members of these groups are programmers or have a technical background. The groups are:

The Graduate Admissions Re-design Team
MITSIS Team
People Database Team

Next Steps

This document has helped frame the discussions between the chief participants from MITSIS and the People Database Team. It has formed the basis for planning and designing the integration of the People Database with MITSIS. A separate document outlining the design and implementation of that integration is being developed.

What is the MIT People Database?

Overview

The People database is the result of an IT integration effort to create a common infrastructure service for identifying people associated with MIT.

It is a database that contains basic biographic information about any person who has been associated with MIT including students, employees, contractors, and spouses. Not all contractors and spouses are currently available.

It has information from both MITSIS and the Personnel systems.

It provides access to this database via an API which allows a user to query the database. The API performs a complex and sophisticated search of the existing data and provides for the creation of new people and the assignment of an MIT Id to them.

It has a graphical client server front end that provides end user access to the database using the API. The front end can be downloaded off the WWW.

What is the status of the People Database?

The People Database was originally developed several years ago (3 years?) by Scott Thorne, et. al. and is currently at release Version 1.

It uses the Data Warehouse as its source for information from the personnel and student systems.

Consideration for a new release of the People Database is underway. As of October 1997, this project is in the Discovery Phase with no target completion date. Information about this project is available at:

http://web.mit.edu/peopledb/www/v2

What is the Technology behind the People Database?

The database is Oracle. The version is currently behind slightly -- V7.1.4.1.1. An upgrade to V7.3.2 is planned but not currently scheduled.

The Mid Tier is a stand-a-lone Unix C program that listens to a particular port for requests. It establishes a connection with the client and processes requests by interrogating the database and returning results to the client. The client interaction is managed via a well defined API.

The API is accessible through calls to an API that has been deployed as a windows DLL.

The DLL was developed in C and is currently available only in a 16 bit Windows 3.11 version, which runs on Windows 95 and Windows NT as a 16 bit application.

The connection is fully encrypted and authenticated using Kerberos V. 4.

A Powerbuilder client was built in PowerBuilder 4.0 and is deployed only in the Windows Environment (3.1, 95, NT 4.0). The PowerBuilder application makes calls to the C DLL.

» Could JAVA call the Mid Tier?

Who is Responsible for the People Database?

http://web.mit.edu/peopeldb/www

Mail can be sent to:

peopledb@mit.edu

The main contact person is:

Jonathan Ives

Jives@mit.edu

x3-0590

E19-349L

Why integrate in with the People Database?

Business Functions to be met

It is hoped that by integrating the People Database in with the new Admissions system, and therefore MITSIS, we will be able to solve three main business needs:

Identification and removal of duplicates before decision letters.

Admissions has stated a business need to identify and clean up any duplicates prior to sending out decision letters. It is very important to make sure the right decision letter is sent to the right person and that the same person not get two different letters.

Identification and removal of duplicates before matriculating students

The Registrars Office is charged as part of academic accreditation standards with maintaining a unified academic history for each student that matriculates. Undergraduate and graduate records are supposed to be connected together.

The assignment of a single/unique/unchanging MIT Id

How do duplicate MIT Id’s get generated?

The Admissions Office is NOT always the 1^st point of contact. The Admissions Database is the primary entry point for MOST BUT NOT ALL students. It is where a basic biographic record for a prospective student is usually created and when that record is passed over to the Registrars Office a new student gets created in the MITSIS database and an MIT Id gets assigned. There is an implicit assumption in this process that the student is an entirely new person to MIT and that she did not exist in any other system prior to her creation and Id assignment in MITSIS.

This assumption of course is not necessarily true. On the graduate level, a new student may have been an employee or the spouse of another student. A new undergraduate student could have been the son or daughter of an employee who is on the health plan. These situations create serious problems and confusions. The person gets multiple Ids cards with different numbers. Information about the person cannot be readily matched and entered in the various systems.

It is hoped that integration with the "People" database will solve these problems and eliminate the creation of duplicate Id’s for the same person.

Duplicate checking needs to be done anyway

This needs to be done because it is NOT possible to force a thorough searching of the database before allowing the user to enter data into the database. Often the user does not have sufficient information to make a fully informed decision. As a result, duplicates are sure to appear.

In addition, duplicate checking needs to be done when a person becomes a student in order to make sure the Registrars Office maintains a unified academic history for the person.

It would be nice to do one set of "matching" against just the People Database to identify both duplicates within Admissions itself against Registrars Office and people whose point of 1^st contact was not Admissions.

Where in the Process should the Integration occur?

In the admissions process, people progress through a series of stages.

Contact The person sent scores to MIT or requested some information.

Inquiry The person has requested an application.

» Application The person has submitted an application

Admitted The person has been admitted by MIT

Student The person has said that she will matriculate

In theory, the earlier in the process we do the integration with the people database, the better. But in practice, before the Application stage the information that we have on the person is often fragmentary, is prone to errors. Therefore, integration with the people database cannot occur before this point.

Also, at the Application stage – before a decision is made – the admissions system has an internal business requirement to "clean up" the data and that duplicates be identified and eliminated. This needs to be done so that a decision can be correctly made and attributed to the proper person and so that same person does not get both an admit letter and a deny letter.

Currently Freshmen have their MIT Id’s assigned at the Application stage, when the data is passed over to MITSIS. This may seem a bit early, but there are several reasons for this:

SFAO needs Id’s in MITSIS to allow them to do application tracking.
MIT Ids are essential for obtaining the certificates for WWW access to MIT data.

We plan to assign Id’s to Graduate Students at the Application stage as part of the Graduate Admissions Redesign.

» Taking all this into account, it appears that the integration with the People database should occur at the Application stage also.

Outline of Matching Process and Suspense Resolution?

Applicants matched to People Database in Batch

All applicants that do not have an MIT Id need to be programmatically matched against the People database. The matching puts the applicant into one of three categories:

Definitely matches an existing person in the People Database.
Definitely does NOT match anyone in the People Database.
May or may not match some existing person(s) in the People Database.

» The reason we only need to search on people who do not already have an MIT Id is because the system does not allow anyone to enter an MIT Id. Therefore, if one exists, it is because the new person was already manually matched to an existing person (and we hope they did a good job of the matching).

Processing (1) Definite Matches to an existing Person

If the applicant definitely matches a single person in the people database, then that person’s MIT ID should be assigned to the applicant.

But before this can be done, the system should "double check" that the MIT ID does not already exist in MITSIS. The existing person’s ID from the People database should be used to query MITSIS. If the ID already exists in MITSIS then there probably is a duplicate between the applicant and a student in MITSIS.

Remember only people without ID’s are matched to the People database in the first place; so the original applicant and the MITSIS person who already has the ID must have different PIDMs.

Initially, at least, all these suspected duplicates should put into a suspense table for manual review.

If a suspected duplicate has been reviewed and the user says "Yes" the applicant definitely is the person who already exists in MITSIS then the system should go through and change the PIDM on the original applicant to match the PIDM of the real person. Either that or it should copy the original data over to the real PIDM and mark the original applicant as invalid. The assumption is that for whatever reason, the person was not found when originally browsing the MITSIS database but that all the data should go under the proper PIDM.

Processing (2) Definitely does NOT match anyone.

If an applicant falls into category (2) above by definitely NOT matching anyone in the People Database then she should be added to the People Database and have an MIT Id assigned to her.

This MIT Id should be posted into MITSIS for the applicant.

An MIT Id can also be assigned as the result of a suspense resolution where the person says the applicant definitely is new.

Processing (3) may or may not match some person(s)

An applicant may fall into category (3) above by potentially – but not definitely – matching some person in the People Database or by matching more than one person in the People database. When either of these two situations occur the applicant along with supporting information about the "potential" matches are put into suspense for manual review and resolution. The user must then make a decision:

Is the applicant really one of the people listed in the suspense screen?
Is the applicant an entirely new person?

» In either case, the system should process the record in the same manner as the batch program had automatically determined that there was a "definite" match. See the two preceding sections.

Potential Barriers to Using the People Database

The People Database does not have a Full history of Students

MITSIS currently contains data on all students going back to 1960. The people database contains only "recent" students (It goes back to students who were active since 1994 or so).

To solve this we need to expand the current MITSIS pass to the Data Warehouse to include all students in the both "active" and "archive" sections of the database.

This will add between 70,000 to 100,000 more "old students" to the people database. Jonathan Ives has checked with the DBA and was told that there is sufficient space to hold the NEW people. Also, performance probably should NOT be an issue, but it should be checked and indexes added if needed.

Currently, MITSIS is not sending the Date of Birth to the Data Warehouse (and therefore to the People Database).

Since the People database is fed via the Data Warehouse, the MITSIS DW extract needs to be changed to accommodate these changes. The MITSIS programming needed to accomplish this is almost complete and all we need is coordination with the Data Warehouse to get the new columns and additional people included in the file. This is planned for sometime after November.

» We need to investigate whether the DW can implement a "FULL" refresh of so many people nightly or whether we need to implement an incremental update.

The API is in written in C

The API that was written in version 1.0 of the People Database was written in C. This is a concern because MITSIS (and therefore Grad. Admissions) is using Stored Procedures for much of its processing. If the matching logic were a PL/SQL stored procedure then MITSIS and the People Database could be integrated together very tightly.

» The problem is that while C can call PL/SQL, the reverse is not true. PL/SQL cannot effectively call C.

One of the reasons the search algorithm was written in C is because of its inherent complexity and the lack of alternatives at the time, 1995. Technology may now have caught up and it may now be possible to write the algorithm in PL/SQL using cursor variables and PL/SQL tables.

There are other reasons why the API was in C including, portability and support for systems without direct access to Oracle.

We have three options:

The people database API and search algorithms can be ported to a PL/SQL package. Then triggers could be written to directly call this PL/SQL package.
The Admissions system could have a "Batch" program that runs periodically looking for newly entered applications. When one is found the program would call the People Database using the C API. Since there is an existing interface for PowerBuilder, this batch program could be written in PowerBuilder along with the necessary suspense resolution screens.
Since the Admissions Application has a PowerBuilder front end, it could be designed to make calls directly to the people.dll as it was designed. This would involve designing the UI to search the people database at some point during application processing. A big concern for this approach is that this would be making it the responsibility of each end user to do this duplicate checking and resolution.

The algorithm needs to be tested in a batch environment

The current People Database subscribers do not use it in a heavy duty "batch" environment. Until we pump large volumes of messy data against it we will not know whether adjustments will be needed to the basic algorithm.

In particular, the current algorithm does not determine if a person DEFINITELY DOES OR DOES NOT MATCH people on the People Database. Instead, it "ranks" them according to how close they match, relying on the user to browse the list and make the DEFINITIVE decision. This is fine for an interactive application, but with the large number of people to be checked, it would be impractical for the user to sit there all day manually reviewing each case.

The MIT Id Assignment Scheme needs to be coordinated

MITSIS and the People Database use an identical scheme for assigning new MIT Ids to people.

The only difference is that the People database uses the prefix "92" (is it 92 or 90?) before attaching 7 random digits where MITSIS rotates between prefixes of "95" and "99".

As the People database takes on more and more responsibility for assigning Ids, MITSIS should give up its number space to the People Database.

People Database may not have sufficient information to resolve questionable matches

People Restricts Access to SSN

Currently the SSN can be used to "query" the People database, but it cannot be returned for viewing. Often the SSN is just mistyped and a person reviewing the problem on a suspense screen would need to see the SSN to see that it was simply mistyped.

Jonathan Ives was concerned because his understanding was that the MIT ID was established so that SSN would not and need not be used. Unfortunately, although SSN is a good identifier, we are not permitted to display it. While some departments still may have access to this information within their systems, from what he understood they are not permitted to see this information for records coming from other departmental systems (this would also be true for home address info). Using SSN for search purposes is somewhat of a convenience. What we might consider is being able to modify the algorithm to only return records with exact match on SSN.

Norm’s understanding was that MIT wanted to get away from using the SSN as the KEY to the person’s computer based records at MIT. A separate issue is deciding who needs to have access to it and for what reasons.

People Database does not have Address

Often when searching and resolving suspense, a person’s address helps to resolve the issue. The people database does not have this information.

» There are privacy issues that need to be worked out here along with the access to the SSN.

People Database does not have a separate Name Suffix

Often the absence or existence of a suffix helps to tell apart a Jr. and Sr. who live at the same address.

MITSIS "creates" other people

In addition to somebody becoming "admitted", certain groups of students get created directly in MITSIS. This is a problem that we hope to address, but it will NOT be addressed immediately and the "multiple" points of entry creates potential headaches due to timing and related issues.

Such groups include:

HST special students
Cross-registered students
Visiting students
CAES students
Etc.

» This will not be a major problem if we do refreshes every night between MITSIS and the People Database.

Proposed Direction and Known Issues

Outline of approach

Create a PowerBuilder application that attaches simultaneously to both the MITSIS mid tier database and the people database. This application will replace logic currently provided by the MITSIS Cobol Matching Algorithm that is part of the current Admissions/MITSIS interface.

The new application will be written use the EXISTING People Database PowerBuilder UI CODE as its basis.

Instead of PROMPTING the user for NAME, SSN and DOB, the application will read that information from MITSIS and automatically call the People database API.

Depending on the response from the People database API, the application will do one of the following:

If the People API says that DEFINITELY NO ONE MATCHES, then the application will automatically call the MIT ID assignment routine in the People Database. After it creates the person in the People database (and gets the MIT ID back), it turns around and assigns that ID to the person in the MITSIS database.
If the People API says that the person in MITSIS DEFINITELY MATCHES A PARTICULAR PERSON in the People Database, then the application will take that person’s ID from the People database and assign it to the person in the MITSIS database.
Finally, if the People API CANNOT DEFINITELY DETERMINE A MATCH, the list of "potential" matches is displayed and the user is prompted to make a decision. Once the user makes the decision, either creating a new ID or choosing an existing person, then the application records that decision in the MTISIS database.

Need a high hit ratio

In order for this approach to work, the People Database matching logic must produce a fairly high hit ratio. 90% to 95% of the time, the application should DEFINITELY MATCH or DEFINITELY NOT MATCH. If too many records require manual review then the approach will become too cumbersome to implement.

Current API May need more info for suspense resolution

As noted previously, the current People Database restricts view access to DOB, SSN and Address information, all of which are important in allowing a user to determine whether or not two people REALLY are the same person. This issue still needs to be addressed.

» We may be able to solve this by getting the "missing" data on students directly from MITSIS.

We will probably have to adjust the matching logic

It appears that the current People database matching logic does NOT RETURN DEFINITE answers. We may be able to simply use the "closeness" number that is returned, but this will have to be checked.

Code will have to be upgraded

The current People Database PowerBuilder code is in version 4.0. To allow us to integrate into MITSIS properly, we would need to upgrade it to version 5.0.

C DLL’s upgraded to 32 bit

We may also have to upgrade the client API DLL’s to 32 bit too so they will work with PB 5.0.

No need to Port to MAC

Since this would be a single application running on one machine, we might be able to justify setting it up to run on just an NT machine.

Who will do this work?

Norman Wright can probably do the MITSIS database work (stored procedures, etc.)

Jonathan Ives can probably handle the C work required to port the Client API to 32 bit and any adjustments to the Matching Logic.

We need someone to port the existing PB code from version 4.0 to 5.0.

We would then need to make the changes required to integrate it into MITSIS.

Could this be designed more generically?

Could this process be designed from a generic enough perspective that it would NOT be specific to MITSIS? I.e. could it be designed to take in a list of people and spit out a new list of the people containing the MIT ID? Basically it would be a "batch" version of the People Database.

» There may be such a need to but it would probably not need to be written in PowerBuilder in a client server mode. Instead it would probably be written to run all on the server.

Kerberos 4 vers. Kerberos 5

The current People DLL uses a Kerberos 4 implementation. The Grad. Admissions system uses Kerberos 5 in its single sign on kludge. We may need to upgrade the People DLL to Kerberos 5.

Appendices

Appendix – The Current MITSIS Admissions Matching Algorithm

Where is this algorithm?

This algorithm is implemented as part of the current MITSIS Admissions Interface program and is written in Pro/Cobol.

General Rules

All names are stripped of spaces and punctuation and mapped to uppercase before any comparisons are made.

Wide net of "potential" matches is cast

First a wide net is cast for people who "potentially" might be the person in question. This is done in several ways:

Anyone with the same Date of Birth
Anyone with the same Last name
Anyone with the same first 3 characters of the Last name
and same first initial
Anyone with the same SSN, MIT Id or Old Id

Anyone with the same (or similar) address.

People caught in wide net are compared to "new" person

For each potential person selected in the wide net above we compare the three main pieces of data to see if it:

Is Missing
An Exact Match
Partially (or Potentially) matches
Does NOT Match

Date of Birth
Name
ID (SSN and/or MIT Id and Old Id)

Each person caught in the net is categorized as follows:

Definitely the same person
Definitely not the same person
May potentially be the same person but needs manual confirmation
Appear to be the same person but have something different that is odd.

If anyone appears to be the same person but has something different that is odd then the record is thrown to suspense for manual review.
If anyone in the list is in the category that s/he may potentially be the same person, but needs manual confirmation then the record is thrown into suspense too.
If it comes out that the system says that 2 or more people definitely match the new person then that record is thrown into suspense for review too.
If only one definite match then that person is the match.
Finally if everyone in the list is definitely not the same as the new person then the person is definitely new and should be added to the database.

The rules for this classification are stored in a table, but match the following guidelines:

Definitely the same person

Name, ID and Date of birth all match exactly
ID and Date of birth match exactly
Name and ID match exactly and Date of birth either is missing or partially matches
Name and Date of birth match exactly and ID is missing or partially matches
Etc.

Definitely not the same person

None of the data matches
Name and Date of birth does not match
Name partially matches but date of birth does not match and ID is missing
Etc.

May be the same person

Name matches exactly and ID and Date of Birth are missing
Name and Date of Birth partially match and ID is missing
Etc.

Appear to be the same person but have something different that is odd.

Name and ID match but the Dates of birth are different
Name and Date of birth match but the Ids are different
Etc.

Date of Birth Matching

Missing Birth Date

If either is Blank.

Exact Match

They match exactly year, month and day.

Partial Match

Mistyped -- 7 of the 8 digits match.
Months and Days are transposed.
Year digits have been transposed (i.e. 1957 for 1975).
Month digits have been transposed (i.e. 01 (Jan.) for 10 (Oct.)
Days have been transposed (i.e. 13 for 31).
Current Year put in for Year of Birth by mistake

Months and Days match, but one Year is <= 15 years old
There are no 15 year old students at MIT.

Does Not Match

If all the above fail.

Name Matching

Missing

Not allowed – must have at least a last name as a condition of doing the search.

Exact Match

The last, first and middle names match exactly character for character.
The names match character for character but are parsed differently.

First and last names match and the middles match because either one is missing or it is just a middle initial and it matches the beginning of the other.
Last name matches, middle names match (or middle initials match) and the

Partial Match

The first and last names have been switched.
The first and middle names have been switched
Using middle and dropping first name entirely
Using middle name and initializing the first name.
First and middle are combined
Middle and last have been combined
Initializing of first name (i.e. F. Scott Fitzgerald)
First name is abbreviated
Middle name not initialized but abbreviated
One part of the name is entirely contained entirely in the other and the rest match
Maiden name became middle name (Must both be female)
Maiden name added to middle name (Must both be female)
Maiden name is now middle initial
Mistyped if only one part of the name off by 1 character and that part is >= 5 characters in length.
Same but if part of name is >= 9 characters then it can be off by 2.
Mistyped if dropped one character in a name >= 3 in length.

Does not match

Both suffixes are present but different!

I.e. one is Jr. the other Sr.

ID Matching

Missing

If no Ids are present.

Matches exactly

Both ID’s exist and match exactly.

Matches Partially

Off by 1 digit.
A transposition of two adjacent characters

Does not match

If an MIT Id and the initial two digits do not match.
If an old Id (888 type number) and the initial three digits do not match.

People Database and Graduate Admissions Preliminary Planning Document

People Database and Graduate Admissions
Preliminary Planning Document