Earley & Semantic Parser -- Amanda C. Smith


Motivation and Execution

One approach to making a medical data parser is to modify existing parsing tools to the new task. To this end, I used the Earley and semantic parsing tools used in Labs Three and Four to create a parser which will successfully extract the relevant data from all of the medical phrases in the test file..

One difficulty of this approach lies in the fact that these tools were designed for normal natural languages, which generally follow strict rules and have syntactic tree structures. Parsing prescriptions is very different, because there is no standard structure or rules for how they can be written, and they certainly don't follow a tree pattern. Phrases like the ones in the test file are better represented by various, fairly standard phrases, such as medicine names and doses (number followed by a unit), which can be strung together in any order with extraneous words between, before, and after. I thought of two different approaches to this problem. One was to create a single structure that nearly any medical phrase could fall into, which would look like this:
S -> Misc Medicine Dose Route Frequency Quantity Refill Misc Medicine Dose Route Frequency Quantity Refill... etc.
where all of these categories would be optional. The other is to produce many top-level structures like so:
S -> Medicine Dose
S -> Medicine Dose Route
S -> Medicine Route Frequency etc.
I chose the latter approach. The former approach would only have one top-level rule, but this rule would be incredibly complicated and difficult to understand or modify, and would probably take a long time for the parser. The latter approach requires many different rules, but they are each easily understood and it is clear what sort of structure each represents. It is also easily extendable. I did not cover all possible structures in my rule file, only the ones in the test phrase file, but adding other structures is not at all difficult.

The other rules are mainly rules for stringing together words to form different types of phrases, for instance:
Dose -> Number Unit
Such a rule simply produces a string consisting of the number followed by the unit, which is all together marked as a dose. In this case, unlike the natural language cases of verbs, prepositional phrases and the like, merely producing a string is more beneficial than showing the relation of the two words to each other through a function. The parser only needs to recognize that, for instance, "40 mg" is a medication dose. It does not need to understand "mg," "40," or how the two relate to each other.

The lexicon consists mainly of recognizing the expected medical terms, such as drug names and dosage units, as well as recognizing and classifying some key words such as "quantity" and "refills". There is also a "trash" category consisting of words which can usually be deleted from a prescription phrase. For instance, in a phrase such as "add hydrocortizone, apply topically", both "add" and "apply" are not necessary and the phrase is parsed as "Medicine: hydrocortizone, route: topically". The parser.py file (available below) does have some new lexicon lookup behavior for the cases in which it does not have a word in its lexicon. If a word does not explicitly occur in the lexicon, but contains a digit, it is marked as a number; if it contains a digit and begins with a hash mark, it is marked as a quantity. Obviously, this is to allow the parser to recognize numbers and certain types of numbers without having to enter all possible or likely numbers into the lexicon. If it does not have a word in its lexicon and the word does not contain a digit, it is marked as "misc". The previous behavior of the parser was to simply not parse a phrase which contained words that it could not identify. This behavior should not hold for the medical parser, because we want to extract information despite extraneous words. The "misc" words can be dealt with by grouping them into "misc" nodes and finally assigning them a "misc" category in the prescription data. This sums up the unique behavior of the medical parser as opposed to the behavior of the parser in labs.

Files

The only two files altered from the semantic and Earley parsers are semantic.py and parser.py, provided below. To use the medical parser, make a copy of the other semantic and parser files and replace semantic.py and parser.py with these, and then run using the medical.py rules file. I have also provided a copy of the test phrase file. All of these phrases will be parsed correctly by this medical parser.

semantic.py
parser.py
medical.py
medphrases.txt