18.06 Spring2019 pset 2¶

due Wednesday 2/20 at 10:55am¶

Note about pedagogical style: Your professor tends to use the technique of "foreshadowing." This means that you may see a notion in the homework that we believe you are ready to attack at this point in the class, but the bigger picture emerges later. It is well known that most students learn best this way. Of course if you start googling around, you can see the bigger picture earlier, at least with some other author's notation or point of view. I can't and wouldn't stop you, but rest assured you will see concepts over and over again in expanding waves in this class.

If you need to know where we are heading, I am planning to cover all of the results (but not the methods in) Strang's chapters 2 and 3 by leading up to the SVD (singular value decomposition) without any mention of pivot variables, free variables, or eigenvalues. This means you are less likely to understand how to hand compute the various concepts, and more likely to understand them conceptually. This gives us time to also show you some applications. It is indeed an experiment, which is why I welcome feedback, positive and negative, during my many, office hours, but I honestly feel in my heart that I am providing a valuable approach.

The problem sets are meant to be learning experiences, and perhaps more challenging than you are familiar with. This is not going to be the style of class where you find a similar problem in the lecture notes and just repeat. You may have to reach back into the 18.02 prerequisites for this class. You may learn a lot by working with other students in the class. There is a lot to be learned at the math learning center, in recitations, during the TAs office hours, and my numerous office hours. No promises that 18.06 is easy, only that you will learn a lot!

All questions are 10 points, problem 10 will be a freebie if you execute it. (Coming soon)

1. Warmup questions (10 points, -1 for wrong answers until 0 is reached)¶

Reminder: in mathematical contexts the distinction between an n-vector and an nx1 single column matrix tends to be blurred. On computers this can and does cause problems. In the following, we will blur the distinction. For these problems, some of you will know how to do it from the prerequisites, some will ask your friends or your recitation instructor, but in any event you will strengthen your general feel for linear algebra.
Some places worth looking for answers are Strang Section 1.2, page 194, VMLS page 58. The definition of a normal, the definition of a gradient, the definition of a plane and the definition of a hyperplane which generalizes a plane. Also ||v|| = sqrt(vᵀv), is the length of a vector v, sometimes pronounced the norm of v, or the magnitude of v.

   1a. If vᵀw=0 we say that v and w are a) perpendicular b) orthogonal c) at right angles d) all the above.
   1b. The equation 2x+3y+4z = 0 describes a a) line  b)  plane through 0  c)  plane not through 0 d) sphere
   1c. A normal to your answer in 1b is __________?
   1d. The equation 2x+3y+4z = 2019 describes a a) line  b)  plane through 0  c)  plane not through 0 d) sphere
   1e. A normal to your answer in 1d is __________?
   1f. The equation x²+y²+z²=1 describes a a) line  b)  plane through 0  c)  plane not through 0 d) sphere
   1g. A normal to your answer in 1f is __________?  (Hint: ok to use intuition from two or three dimensions)
   1h. Calculate the gradient of f(x,y,z) = 2x+3y+4z
   1i. Calculate the gradient of f(x,y,z) = x²+y²+z² 
   1j. The inequality 2x+3y+4z > 0 describes a region of 3 dimensional space.  Describe that region:_____
   1k. The inequality 2x+3y+4z < 2019 describes a region of 3 dimensional space.  Describe that region:_____
   1l.  Using vector language in Rⁿ, (no elements or indices)  write an equation that describes every v on the hyperplane through 0 with given normal vector w (not 0).  
   1m. Give an example of a hyperplane parallel to the hyperplane in 1l above that does  not go through 0.
   1n. Suppose given a nonzero w that we consider the hyperplane of all v for which  wᵀv=1.  Find the unique vector v (in terms of w) that is on this hyperplane in the direction w.  (It is helpful to define ||w|| as the length of w.)

2. (Real) Vector Spaces (10 points) (Strang page 123, Lecture 2 slides)¶

a) If $v$ is a vector in any vector space, what is the complete list of vectors that must also be in this vector space?

b) Must a vector space contain a zero vector? Why?

c) What is the zero vector in $R^4$? in $R^{2,3}$? What is the zero function? What is the zero (differential) operator from functions to functions? (Hint: The zero operator takes in an arbitrary function and returns....)

d) State in words the difference between the zero function and the zero operator, as they seem so similar.

3. Am I a vector space? (The basic question here is whether every linear combination is in the space. If there is no zero, then I'm for sure not a vector space.)¶

a) All vectors in $R^n$ whose entries sum to 0.

b) All matrices in $R^{m,n}$ whose entries when squared sum to 1.

c) Cubic polynomials of the form $f(x) = a + bx + cx^2 + dx^3$, where $a,b,c,d$ are abitrary, possibly 0.

d) All linear functions of the form $f(x) = mx +b$, where $f(17)=0$. ($m$ or $b$ may possibly be 0)

e) All linear functions of the form $f(x) = mx +b$, where $f(0)=17$.($m$ or $b$ may possibly be 0)

4. Nullspace¶

A fact you are supposed to already know about matrices is that $M(cv+dw) = cMv+dMw$, if $M$ is any matrix, $c$ and $d$ are any scalars, and $v$ and $w$ are any vectors.

Suppose for some particular rectangular matrix $M$, we have $Mv=0$ and $Mw=0$ for some particular vectors $v$ and $w$. Is it true that if $x$ is a linear combination of $v$ and $w$ then $Mx$ is also 0? Why?

The set of vectors $x$ for which $Mx=0$ is known as the nullspace of $M$. For an mxn matrix, we point out two extremes among the typically infinitely many possibilities: it could be the zero vector in $R^n$ or it could be all of $R^n$. (We will see there are usually lots of possibilities in between.)

What is the nullspace of the $n$ by $n$ identity matrix? The $n$ by $n$ zero matrix? The $m$ by $n$ matrix of all ones?

5. Column Space¶

The column space of a matrix $M$ is the set of all vectors $y$ which can be written in the form $Mx$ for some vector $x$. Why is the column space a vector space?

What is the column space of the $n$ by $n$ identity matrix? The $n$ by $n$ zero matrix? The $m$ by $n$ matrix of all ones?

6. Using the SVD¶

In [13]:

using LinearAlgebra

In Julia if you are given an m x n matrix $A$, Julia returns a matrix $U$, a vector $s$ and a Matrix $V$ for which A = U * Diagonal(s) * V' (up to floating point roundoff errors)

If m >n, $U$ is m by n, Diagonal(s) is n by n, and $V$ is n by n.
If m < n $U$ is m by m, Diagonal(s) is m by m, and $V$ is n by m.

You can remember this because the result is m by n, and the Diagonal(s) is always square whose size is the smaller of m and n.

Here are some examples:
(Don't forget to execute the using LinearAlgebra) or else svd won't be a known function

In [47]:

A = rand(4,5)
U,s,V = svd(A)
println( size(U), size(Diagonal(s)), size(V))

(4, 4)(4, 4)(5, 4)

In [48]:

A = rand(5,4)
U,s,V = svd(A)
println( size(U), size(Diagonal(s)), size(V))

(5, 4)(4, 4)(4, 4)

What are the sizes of $U$, $Diagonal(s)$ and $V$ for
a) A 10x10 matrix?
b) A 10x20 matrix?
c) A 20x10 matrix?

In [ ]:

7. It is possible to solve Ax=b for an invertible A using the svd. It's slightly less efficient than LU but usually quite reasonably accurate, even at least as accurate as LU.¶

Here is an example from Svetlana's problems posted in Piazza:

x + 2y + 3z =1
y + z = 2
3x + y - z =3

Solution:

In [52]:

# Form the Matrix
A = [1 2 3 ; 0 1 1;3 1 -1]

Out[52]:

3×3 Array{Int64,2}:
 1  2   3
 0  1   1
 3  1  -1

In [54]:

# For the right hand side b
b = [1, 2, 3]

Out[54]:

3-element Array{Int64,1}:
 1
 2
 3

In [59]:

U,s,V = svd(A)
V*((U'b)./s)

Out[59]:

3-element Array{Float64,1}:
 -0.9999999999999999
  4.000000000000002 
 -1.9999999999999993

Don't let floating point fool you, the answer given is -1,4,-2. Check that the solution is right. Explain mathematically why V*((U'b)./s) gives the right answer.

On an exam, you will not be expected to compute an SVD using eigenvalues or a computer. The SVD story will unfold during the course of the semester.

8. Suppose $W$ is an unknown 2x2 matrix, and we want to approximately find a $W$ that best fits $Wxᵢ=yᵢ$ where x₁,x₂,x₃,y₁,y₂,y₃ are each known 2 vectors? How would you find $W$ using QR?¶

Hint: this is tricky because the unknowns are the four elements in $W$. I think this problem is hard.

9. A Permutation of n¶

is a rearrangement of the numbers 1 through n. Here is a random permutation of 1:5

In [70]:

using Random
p = randperm(5)

Out[70]:

5-element Array{Int64,1}:
 5
 4
 2
 1
 3

A Permutation matrix is row reordering of the identity matrix

In [80]:

ident = Matrix{Int64}(I,5,5)

Out[80]:

5×5 Array{Int64,2}:
 1  0  0  0  0
 0  1  0  0  0
 0  0  1  0  0
 0  0  0  1  0
 0  0  0  0  1

In [81]:

ident[p,:]

Out[81]:

5×5 Array{Int64,2}:
 0  0  0  0  1
 0  0  0  1  0
 0  1  0  0  0
 1  0  0  0  0
 0  0  1  0  0

What is closest to the number of parameters that are needed for an n x n permutation matrix? a) n^2 b) n c)n! d)n^n

In [ ]:

10. a Julia application of linear algebra¶

Understanding this code is not required for this class, but is an extra for those interested. All of you might want to look just a little and execute the lines though.

The signed distance to a hyperplane is one of those standard machine learning (and elsewhere!) codes. In one commonly used notation, hyperplanes are written as the set of x for which θᵀx + θ₀ = 0 where θ is in Rⁿ and θ₀ is in R. The formula for the distance of a point p to this hyperplane is
distance = (θᵀp +θ₀ )/||θ||

In [163]:

using LinearAlgebra
# Formula in Julia looks a lot like the math
d(p,θ,θ₀) = (θ'p .+ θ₀)/norm(θ)

# If you would like to annotate with types, this is also allowed
# this says p and θ must be vectors and θ₀ a number
d(p::Vector, θ::Vector, θ₀::Number) = (θ'p .+ θ₀)/norm(θ)

Out[163]:

d (generic function with 3 methods)

In [164]:

n = 7
d( rand(n), rand(n), rand() )

Out[164]:

0.6736808399014581

Python is a very popular language these days. It is really nice to have lots of friends who know the same language. Python suffers, however, from the two language problem: that efficient machine learning code tends to use C or C++, and even numpy tends to mix metaphors being sort of python like and sort of something different.

Are we above or below the hyperplane?

In [165]:

which_side(p,θ,θ₀) = d(p,θ,θ₀)<0 ? "I'm below" : "I'm above" # Ternary operator

Out[165]:

which_side (generic function with 1 method)

In [166]:

which_side(randn(n), rand(n), rand() )

Out[166]:

"I'm below"

What if we have a matrix of 5 data points, each column is in R^n?

In [167]:

P = randn(n,5)

Out[167]:

7×5 Array{Float64,2}:
 -0.0135899  -1.85682   -1.4147    -1.73322   -0.29654 
 -0.0583988   0.39595   -0.748555  -0.760963   0.468719
  0.898581    1.31427   -0.790358  -0.1995     0.90457 
 -0.0752133   0.334317  -0.557758  -2.391      0.729908
  0.432025    1.12397   -1.70417    1.31616    1.78537 
 -0.494314   -2.58893    0.245257  -1.75916    0.439534
 -1.50217    -1.00493   -1.25288   -0.790197   1.21259

In [168]:

distances = d( P, rand(n), rand() )  # Code evaluates five distances

Out[168]:

1×5 Array{Float64,2}:
 -0.0509957  -1.44167  -1.57092  -2.49945  1.62281

In [169]:

sign.(distances)  # elementwise sign function tells you above or below

Out[169]:

1×5 Array{Float64,2}:
 -1.0  -1.0  -1.0  -1.0  1.0

Carrying around a θ and θ₀ separately seems silly. How likely are you to remember that this is a pair next month? Let's wrap them together:

In [170]:

struct Hyperplane
    θ  :: Vector
    θ₀ :: Number
end

In [171]:

h = Hyperplane( rand(3), rand() )

Out[171]:

Hyperplane([0.754178, 0.452019, 0.854368], 0.7806995443589713)

In [172]:

dump(h)  # see what's inside

Hyperplane
  θ: Array{Float64}((3,)) [0.754178, 0.452019, 0.854368]
  θ₀: Float64 0.7806995443589713

Let's make it easy to write the more natural distance to a hyperplane function

In [173]:

d(p::Vector,  h::Hyperplane )  = d(p,h.θ,h.θ₀)

Out[173]:

d (generic function with 3 methods)

In [174]:

h = Hyperplane( rand(n), rand() )
d( randn(n), h)

Out[174]:

-0.7457277112775967

This professor encourages good coding, where mathemtical abstractions are wrapped with their names rather than being isolated as greek letters.

Let's generate a vector of 10 hyperplanes:

In [175]:

hyperplanes = [Hyperplane( rand(3), rand() ) for i=1:10 ]

Out[175]:

10-element Array{Hyperplane,1}:
 Hyperplane([0.639999, 0.381686, 0.451886], 0.1268770727445261) 
 Hyperplane([0.782602, 0.790227, 0.132017], 0.06124709194394651)
 Hyperplane([0.638825, 0.459649, 0.0769363], 0.9779856818512844)
 Hyperplane([0.601499, 0.857207, 0.776075], 0.35331470556440636)
 Hyperplane([0.500564, 0.459139, 0.621231], 0.8541667481679043) 
 Hyperplane([0.527508, 0.95329, 0.435191], 0.054942045856016586)
 Hyperplane([0.838488, 0.390493, 0.190962], 0.39390438127219274)
 Hyperplane([0.275477, 0.896726, 0.339034], 0.4137173415074926) 
 Hyperplane([0.166933, 0.977787, 0.0165396], 0.8839908695144154)
 Hyperplane([0.0228206, 0.547961, 0.996911], 0.7564915268371468)

Notice this is 10-element vector whose elements are hyperplanes !!

Let's see if a random point is above or below these hyperplanes.

In [176]:

p = randn(3)

Out[176]:

3-element Array{Float64,1}:
 0.13038386350675657
 0.7420231369911667 
 0.22137717632262296

In [178]:

d.( [p] , hyperplanes)

Out[178]:

10-element Array{Float64,1}:
 0.6811138314488024
 0.6954398194536037
 1.7949674760014023
 0.9510484436008779
 1.518376859487798 
 0.7905034014117159
 0.8843732377089485
 1.1930979743272325
 1.648023020172752 
 1.2187964026961917

or maybe more readable

In [179]:

[  d(p, h) for h ∈ hyperplanes ]

Out[179]:

10-element Array{Float64,1}:
 0.6811138314488024
 0.6954398194536037
 1.7949674760014023
 0.9510484436008779
 1.518376859487798 
 0.7905034014117159
 0.8843732377089485
 1.1930979743272325
 1.648023020172752 
 1.2187964026961917

For those really into 6.036, feel free to take a look at a (very short) sample perceptron training implementation.

In [ ]: