A Pattern Matching Approach to Parsing

Pattern Matching --- Erin Rhode

Motivation and Execution

Since the strings being parsed are relatively simple and the data follows a general format, searching for common patterns seems to be the easiest approach to this problem. The pattern matcher comes with files representing the lexicon:

units for dosages -- includes mg, cc, puffs, etc.
routes for ways the medication can be taken -- includes po (by mouth), topical, nasal, etc.
frequencies -- includes qd (once a day), bid (twice a day), tid (three times a day), etc
Dictionary.txt lists many drugs that may be prescribed -- includes flonase, centrum silver, aspirin, etc.

To run the pattern matcher, save the parser and lexicon files in the same directory as the file with the sentences to be parsed (one sentence per line). At an athena prompt, run:

athena % parsemed.pl file-with-sentences

The parser reads one line at a time, removing superfluous words such "please" and "taken" and converting number words (like "one") to their digit form ("1"). (Currently the parser only parses "one", "two", and "three" this way, but I have written a converter that parses any number from one to nine thousand nine-hundred and ninety-nine.) Once the sentence is in this format, the parser goes through and first tries to match all the routes from the "routes" file with the sentence. If it finds a route, it stores it as a variable and replaces it in the sentence with ROUTE. (The reason for this will be explained later.) Then the parser goes through and searches for frequencies in the same manner, replacing a found frequency with "FREQ". Then the parser searches for digits followed by unit, and calls this the dose (i.e. 40 mg.) and replaces it with "DOSE". Then the parser searches for drugs out of the Dictionary.txt file, stores the found drugs as a variable, and replaces them with "DRUG".

Once it has matched the sentence against the lexicon files, it proceeds to look for quantities, which are generally given by "#100", which would mean quantity of 100, or it might explicitly say "quantity 100". Either of these patterns is easy to search for without a lexicon. There is one other quantity format, which also provides refill information: "#100 x 3" which means "quantity of 100 with 3 refills". This pattern is also easy to look for without a lexicon. Other refill formats include "1 year" or "3 refills," patterns that are easily matchable, again without the need of a refill lexicon. The final category it tries to match up is "Directions", a category originally intended for all phrases not parsed. This matcher actually looks for common directions that start with "for", "from", or "as per": "for spasms", "as per psychiatry", etc. The remaining words not parsed are simply outputted as "Unparsed words".

After pattern matching for all these categories, the parser has one final step. Often, a dosage is given without a unit such as "Vitamin D one po qd". After the previous pattern matching is done, this phrase looks like: "DRUG 1 ROUTE FREQ". In all of the test sentences given, the dosage immediately followed the drug name. Thus, for its final step, the parser employs a tiny bit of syntax rules by searching for a number that immediately follows "DRUG". If this is the case, that number is stored as the dosage.

After all the matching is completed, the pattern matcher outputs the data in the following format, leaving fields blank if no equivalent pattern was found:

##Pravachol 20 mg qd, #90 x 3
Name: Pravachol 
Dose:  20 mg 
Route:  
Freq: qd 
Quant: 90 
Refill: 3 
Directions:

Of the 100 test phrases provided to us, the pattern matcher was able to completely parse 92 of them. Of the remaining 8, generally only one phrase went unparsed, a phrase that should belong in the directions category. In one case, the word "in" was leftover. For the simplicity of this method, these results indicate that it is highly efficient.

One major drawback to this parser is that it has no way of dealing with a sentence such as: "metformin 500 mg tablets one bid. Hold for creatinine greater than 1.4." It will correctly parse the first part of the phrase, but does not recognize anything in the second part of the phrase. There is no easy way of adding this functionality to a pattern matcher. This case would be much better handled in an Earley parser, such as the one explored in the other half of this project.

Files

The pattern matcher: parsemed.pl
The lexicon files:
units
routes
frequencies
Dictionary.txt
The input file: MedPhrases100.txt
The output: output.txt