6.00: Introduction to Computer Science and
Programming
Problem Set 6: RSS Feed Filter
Handed out: Tuesday, October 27, 2009.
Due: Tuesday, November 3, 2009.
In problem set 6, you will build a program to monitor news feeds over the Internet. Your program will filter the news, alerting the user when it notices a news story that matches that user's interests (for example, the user may be interested in a notification whenever a story related to the Red Sox is posted).
Please download and save these files to the same directory.
You will also need the following three modules for your ps6.py to work. Please download and save these files to the same directory. You should not modify these modules, but you may read them if you want.
|
Many websites have content that is updated on an unpredictable schedule. News sites, such as Google News, are a good example of this. One, tedious, way to keep track of this changing content is to load the website up in your browser, and periodically hit the refresh button. Fortunately, this process can be streamlined and automated by connecting to the website's RSS feed, using an RSS feed reader instead of a web browser (e.g. Sage). An RSS reader will periodically collect and draw your attention to updated content. RSS stands for "Really Simple Syndication". An RSS feed consists of (periodically changing) data stored in an XML-format file residing on a web-server. For this project the details are unimportant. You don't need to know what XML is, nor do you need to know how to access these files over the network. We will use a special Python module to deal with these low-level details. The higher-level details, in the notes below, describing the structure of the Google News RSS feed, should be enough for our purposes. |
|
First, let's talk about one specific RSS feed: Google News. The URL
for the Google News feed is:
http://news.google.com/?output=rss
If you try to load this URL in your browser, you'll probably see your browser's interpretation of the XML code generated by the feed. You can view the XML source with your browser's "View Page Source" function, though it probably will not make much sense to you. Abstractly, whenever you connect to the Google RSS feed, you receive a list of items. Each entry in this list represents a single news item. In a Google News feed, every entry has the following fields:
This is, unfortunately, a little trickier than we'd like it to be, because each of these RSS feeds is structured a little bit differently than the others. So, our goal in Part I is to come up with a unified, standard representation that we'll use to store a news story.
Why do we want this? When all is said and done, we want an application that aggregates several RSS feeds from various sources, and can act on all of them in the exact same way: we should be able to read the New York Times's RSS feed, Google News's RSS feed, The Tech's RSS feed, and the RSS feeds from the really whiny blogs of the 6.00 TAs, all in one place.
We want to store this information in an object that we can then
pass around in the rest of our program. Your task, in this problem, is
to write a class, NewsStory, with at least the following
methods:
getGuid()
getTitle()
getSubject()
getSummary()
getLink()
NewsStory
that takes guid, title, subject, summary, link as
arguments and stores them appropriately. The solution
to this problem should be relatively short and very straightforward.
Parsing is the process of turning a data stream into a format more convenient for us to work with. We have provided you with code that will retrieve and parse the Google and Yahoo news feeds.
Given a set of news stories, your program will generate alerts for a subset of those stories. Stories with alerts will be displayed to the user, and the other stories will be silently discarded. We will represent alerting rules as triggers. A trigger is a rule that is evaluated over a single news story and may fire to generate an alert. For example, a simple trigger could fire for every news story whose title contained the word "Microsoft". Another trigger may be set up to fire for all news stories where the summary contained the word "Boston". Finally, a more specific trigger could be set up to fire only when a news story contained both the words "Microsoft" and "Boston" in the summary.
In order to simplify our code, we will use object polymorphism. We will define a trigger interface and then implement a number of different classes that implement that trigger interface in different ways.
evaluate
method that takes a news item (NewsStory object) as an input and returns True if
an alert should be generated for that item.
The class below implements the trigger interface. It fires for every news item, so it doesn't actually do anything all that interesting:
class Trigger:
def evaluate(self, story):
"""
Returns True if an alert should be generated
for the given news item, or False otherwise.
"""
return True
Having a trigger that always fires isn't interesting. Let's write some that are. A user may want to be alerted about news items that contain specific words. For instance, a simple trigger could fire for every news item whose title contained the word "Microsoft". In the following questions, we ask you to create a word trigger interface and implement three classes that implement triggers of this sort.
The trigger should fire when the whole word is present. For example, a trigger for "soft" should fire on:
This is a little tricky, especially the case with the apostrophe. For
the purpose of your parsing, pretend that a space or any character
in string.punctuation is a word separator.
The split and replace method of strings will
almost certainly be useful here. You may also find the string methods lower and upper helpful.
WordTrigger. It should take in a string word as an argument to the class's constructor.
WordTrigger should be a subclass of Trigger. It has one new method, isWordIn. isWordIn takes in one string argument: a body of text. It returns True if the whole word word is present in the body of text, False otherwise, as described in the above examples. This method should not be case-sensitive. Implement this method.
Because this is an interface class, we will not be directly instantiating any WordTriggers. WordTrigger should inherit its evaluate method from Trigger.
TitleTrigger that fires when a news item's title
contains a given word. The word should be an argument to the class's
constructor. This trigger should not be case-sensitive (it should treat "Intel" and "intel" as being equal).
For example, an instance of this type of trigger could be used to generate an alert whenever the word "Intel" occurred in the title of a news item. Another instance could generate an alert whenever the word "Microsoft" occurred in the title of an item.
Think carefully about what methods should be defined in TitleTrigger and what methods should be inherited from the superclass.
Once you've implemented TitleTrigger, the TitleTrigger unit
tests in our test suite should pass.
SubjectTrigger, that fires when a news
item's subject contains a given word. The word should be an
argument to the class's constructor. This trigger should not be
case-sensitive.
Once you've implemented SubjectTrigger, the SubjectTrigger unit
tests in our test suite should pass.
SummaryTrigger, that fires when a news
item's summary contains a given word. The word should be an
argument to the class's constructor. This trigger should not be
case-sensitive.
Once you've implemented SummaryTrigger, the SummaryTrigger unit
tests in our test suite should pass.
So the triggers above are mildly interesting, but we want to do better: we want to 'compose' the earlier triggers, to set up more powerful alert rules. For instance, we may want to raise an alert only when both "google" and "stock" were present in the news item (an idea we can't express right now).
Note that these triggers are not word triggers and should not be subclasses of WordTrigger.
NotTrigger).
This trigger should produce its output by inverting the output of
another trigger. That other trigger should be an argument to the NOT
trigger's constructor (why its constructor? Because we can't
change evaluate... that'd break our polymorphism). So, given a trigger T and a news item
x, the output of the NOT trigger's evaluate method
should be: not T.evaluate(x).
When this is done, the NotTrigger unit tests should pass.
AndTrigger).
This trigger should take two triggers as arguments to its constructor, and should fire on a news story if both of the inputted triggers would.
When this is done, the AndTrigger unit tests should pass.
OrTrigger).
This trigger should take two triggers as arguments to its constructor, and should fire if either one (or both) of its inputted triggers would.
When this is done, the OrTrigger unit tests should pass.
At this point, you have no way of writing a trigger that matches on "New York City" -- the only thing you know how to write will also fire on "New students at York University love the city". It's time to fix this. Since here you're asking for an exact match, we will require that the cases match, but we'll be a little more flexible on word matching. So, "New York City" will match:
PhraseTrigger) that fires when a given
phrase is in any of the subject, title, or summary. You may find the Python operator in helpful, as in:
>>> print "New York City" in "In the heart of New York City's famous cafe"
True
>>> print "New York City" in "I love new york city"
False
When this is done, the PhraseTrigger unit tests should pass.
At this point, you can run ps6.py, and it will fetch and display Google and Yahoo news items for you in little pop-up windows. How many news items? All of them.
Right now, the code we've given you
in ps6.py gets all of the feeds every minute, and displays
the result. This is nice, but, remember, the goal here was to filter
out only the the stories we wanted.
filter_stories(stories, triggerlist)
that takes in a list of news stories and a list of triggers,
and returns only the stories which a trigger fires for.
Right now, your triggers are specified in your Python code, and to change them, you have to edit your program. This is very user-unfriendly. (Imagine if you had to edit the source code of your web browser every time you wanted to add a bookmark!)
Instead, we want you to read your trigger configuration from
a triggers.txt file, every time your application starts,
and use the triggers specified there.
# subject trigger named t1 t1 SUBJECT world # title trigger named t2 t2 TITLE Intel # phrase trigger named t3 t3 PHRASE New York City # composite trigger named t4 t4 AND t2 t3 # the trigger set contains t1 and t4 ADD t1 t4
The example file specifies that four triggers should be created, and that two of those triggers should be added to the trigger set:
The two other triggers (t2 and t3) are created but not added to the trigger set directly. They are used as arguments for the composite AND trigger's definition.
Each line in this file does one of the following:
Blank: blank lines are ignored. A line that consists only of whitespace is a blank line.
Comments: Any line that begins with a # character is ignored.
Trigger definitions: Lines that do not begin with the keyword ADD define named triggers. The first element in a trigger definition is the name of the trigger. The name can be any combination of letters without spaces, except for "ADD". The second element of a trigger definition is a keyword (e.g., TITLE, PHRASE, etc.) that specifies the kind of trigger being defined. The remaining elements of the definition are the trigger arguments. What arguments are required depends on the trigger type:
Trigger addition: A trigger definition should create a trigger and associate it with a name but should not automatically add that trigger to the trigger set. One or more ADD lines in will specify which triggers should be in the trigger set. An addition line begins with the ADD keyword. Following ADD are the names of one or more previously defined triggers. These triggers will be added to the the trigger set.
readTriggerConfig(filename). We've
written code to open the file and throw away all the lines that don't
begin with instructions (e.g. comments, blank spaces). Your job is to
finish the implementation.
readTriggerConfig should return the list of triggers
specified in the configuration file.
Once that's done, modify the code at the bottom to use the trigger list specified in your configuration file, instead of the one we hard-coded for you:
# TODO: Question 11
# After implementing readTriggerConfig, uncomment this line
# triggerlist = readTriggerConfig("triggers.txt")
After completing Question 11, you can try running ps6.py, and depending on your triggers.txt file, various RSS news items should pop up for easy reading. The code runs an infinite loop, checking the RSS feed for new stories every 60 seconds.
1. Save
All your code should be
in a single file called ps6.py.
2. Time and collaboration
info
At the start of the file, in a comment, write down
the number of hours (roughly) you spent on this problem set, and the names of whomever you collaborated with. For example:
# Problem Set 6 # Name: Jane Lee # Collaborators (Discussion): John Doe # Collaborators (Identical Solution): Alice Smith # Time: 1:30 #
3. Submit
To submit a
file, upload it to your
workspace. If there is some error uploading to your workspace,
email your file to 6.00-staff [at] mit.edu.
You may upload new versions of the problem set until the 11:59pm deadline.