Notes: Recitation 5

From 6.00 reference wiki

1 Complexity
- 1.1 Time Complexity
- 1.2 Orders of Growth: Asymptotic Upper Bounds
  - 1.2.1 Example
    - 1.2.1.1 Quadratic
    - 1.2.1.2 Logarathmic
2 Abstraction and Pseudocode
- 2.1 Graphs

Complexity

To analyze an algorithm is to determine the amount of resources (such as time and storage) necessary to execute it. Most algorithms are designed to work with inputs of arbitrary length. Usually the efficiency or complexity of an algorithm is stated as a function relating the input length to the number of steps (time complexity) or storage locations (space or memory complexity) required to execute the algorithm.

[edit]

Time Complexity

The time complexity of an algorithm can be defined as:

The number of basic steps of computation needed by that algorithm, as a function of the input size.

Basic step

A unit of code/computation not dependent on the input size.
For example: multiplying two ints; indexing a list (or string, or dictionary); etc...

Input Size

The complexity of an algorithm is always stated with-respect-to the size of some input.
It is important to specify which input size we are referring to (e.g. number of digits in the number n; number of elements in the list L; the number of elements in the list L2, where L2 is the union of all lists that are values in the input dictionary D; etc...)

[edit]

Orders of Growth: Asymptotic Upper Bounds

In theoretical analysis of algorithms it is common to estimate their complexity in asymptotic sense, i.e., to estimate the time complexity for reasonably large length of input.

In particular, we focus on the worst-case running time of the algorithm.

For example, we can write down the worst-case running time of an algorithm as an equation.

  time = 3 + 5n

Where n is the number of elements in an input list. The running-time of this algorithm is linear in the length of the input list. We can see this because the highest-order term in this equation is n (ie. linear). We write this as follows:

  O(n)

This is sometimes referred to as big-O notation. When we say an algorithm runs in O(n) time, we mean that the running time of that algorithm increases no faster than a linear function of the input length. So, for example, if the algorithm ran in 10 seconds on an input of length x, it would run in around 20 seconds on a list of length 2*x, on the same machine.

O(n) specifies an upper bound on the running time of the algorithm.

 time <= Kn

Where K is a constant.

Of course, in this case we could also have said:

 time <= K * n * n

But we are interested in a tight upper bound.

[edit]

Example

[edit]

Quadratic

Consider the following code:

  def count_duplicates(L):
      dupe = {}
      for i in range(len(L)):
          if not L[i] in dupe:
              dupe[L[i]] = 1
              for j in range(i+1, len(L)):
                  if L[i] == L[j]:
                      dupe[L[i]] += 1
      return dupe

Suppose n is the number of elements in the list L

Let's try to derive the time taken by this algorithm in basic-steps.

+2 for the dupe={} and return dupe lines
+n for the for i
+n for the if not L[i] ...
The if predicate is true (n - num_duplicates) times, and each time through the loop body:
- +1 for the dupe[L[i]] = 1
- +2(n-i) for the inner loop
- +num_duplicates for every time the if is true.

Even for this relatively simple function, this turns out to be a fairly complex expression:

  time = 2 + n + n + (n - num_duplicates)(1 + 2(n - avg_index) + num_duplicates)

But we don't really need to do this. All we care about is a tight upper-bound on the worst-case running time. Some thought shows that in the worst case, there are no duplicated elements, and so the inner loop sum simplifies to a standard arithmetic sum (avg_index = n/2).

Then we can specify and simplify an upper bound on time.

  time <= 2 + 2n + n(1 + 2n - n)
       <= 2 + 3n + n*n 
       <= 6 * n * n

Which is true for n >= 1.

Therefore, we say that the time complexity of this algorithm is quadratic or O(n²). Notice that we can throw away constants and other lower-order terms by using <= and an appropriate constant for the highest order term.

Once you gain experience, you will not need to go through the above calculation to figure out the complexity of an algorithm. The bounds of the nested for loops should tip you off that this is a quadratic algorithm.

Of course, there is a better, linear time algorithm to do this (this should look familiar from the problem sets):

   def count_duplicates(L):
       dupe = {}
       for i in range(len(L)):
          dupe[L[i]] = dupe.get(L[i], 0) + 1

The above algorithm is O(n) because the dictionary's get function happens to run in constant time, not depending on how big the dictionary actually is.

[edit]

Logarathmic

Binary search over a sorted list is logarithmic, or O(logN), where N is the number of elements in the list.

[edit]

Abstraction and Pseudocode

Now that you know Python, given a problem, your first instinct may be to jump in and start coding. For anything other than very simple code (or... unless you're a very experienced programmer), this approach can get you into serious trouble.

Before you begin, you should abstractly think of the problem you are trying to solve, figure out a solution and write down some pseudocode. The idea behind pseudocode is that it will keep you from worrying about language-specific details, so that you can focus on constructing a correct algorithm, before implementing it. There are no formal rules for pseudocode, but you should try to be specific about what your program will do. This is especially important for the most complicated parts of the program.

As an example, consider the following pseudocode for the build_compression_map function from the problem set 3 solution.

   build_compression_map
      1. Build a word-frequency table.
          (count number of times each unique word occurs in the file)
      2. Sort words by frequency.
          (descending order, most frequent word comes first)
      3. Integer code for word is the sorted ranking of that word.
          (most frequent word is encoded as 0)

This pseudocode is almost an English description, but clearly breaks down the problem into three sub-problems. I often write down this kind of pseudocode as comments in my code, before I start writing a complicated function, filling in the code for each step under its comment.

[edit]

Graphs

One example where this sort of abstract reasoning before coding is important is when we are trying to enumerate all the paths between two nodes in a graph.

Recall that, mathematically, a graph is a set of vertices (the nodes) and a set of relationships between those vertices (the edges).

When we are trying to decide what the path-enumeration algorithm should look like, it is easier to think in terms of nodes and edges. The alternative would be to think in terms of Python code. But how is a graph represented? A list of lists? A dictionary with lists as values? A list with dictionaries as values? A string? We will avoid deciding till later.

The representation is not relevant to the correctness of an algorithm, so we avoid that for now. You should be aware that sometimes representations can affect the time complexity of an algorithm, but writing out pseudocode will usually alert us when this is happening.

Let us write some pseudocode. We have the inputs the nodes a and b, and the graph G, and we want to find all paths starting at node a and ending at node b in G:

 find_all_paths(a, b, G):
    1. Let P be the set of found paths 
        (P is initially empty)
    2. Find a single path p from a to b, such that p is not in P.
        (use Depth-First-Search)
    3. If a path p exists, we're done.
       Otherwise add p to P and repeat step 2.

Now, this is a reasonable initial sketch, but it is not enough. In particular, our description of step 2 is too vague. It is too abstract (or equivalently, it is at too high a level of abstraction to be useful).

Here is a more detailed description of the algorithm:

 find_all_paths(p, d, G, V, P):
    Let
       p is a sequence of nodes, the path so far (initially just the start node)
       x is the last node in p
       d is the destination node
       G is the graph
       V is the set of visited nodes (initially empty)
       P is the set of found paths (initially empty)     
    If (x has no children)
       Only add p to P if the x is the end node.
    Else
       Iterate over every child y of x that isn't in V
           If y is d, then p+y is a path from start to destination.
             Add p+y to P.
           Else
             Recursively search for extensions of p that lead to d.
                 Call find_all_paths(p+y, d, G, V+y, P)
    Backtrack one step backward along the path p
       If p is just the start node, then we're done.
       Otherwise, backtrack one step to find paths that aren't extensions of p.

Compare this to the dfs4 code from lecture. Note how the code backtracks and how the visited set V is maintained.

dfs4

Retrieved from "http://slice.csail.mit.edu../../../r/e/c/Notes%7E_Recitation_5_370d.html"

This page has been accessed 244 times. This page was last modified 17:34, 10 March 2006 by 6.00 reference wiki user Asfandyar.

Notes: Recitation 5

Contents

Complexity

Time Complexity

Orders of Growth: Asymptotic Upper Bounds

Example

Quadratic

Logarathmic

Abstraction and Pseudocode

Graphs