Note about pedagogical style: Your professor tends to use the technique of "foreshadowing." This means that you may see a notion in the homework that we believe you are ready to attack at this point in the class, but the bigger picture emerges later. It is well known that most students learn best this way. Of course if you start googling around, you can see the bigger picture earlier, at least with some other author's notation or point of view. I can't and wouldn't stop you, but rest assured you will see concepts over and over again in expanding waves in this class.
If you need to know where we are heading, I am planning to cover all of the results (but not the methods in) Strang's chapters 2 and 3 by leading up to the SVD (singular value decomposition) without any mention of pivot variables, free variables, or eigenvalues. This means you are less likely to understand how to hand compute the various concepts, and more likely to understand them conceptually. This gives us time to also show you some applications. It is indeed an experiment, which is why I welcome feedback, positive and negative, during my many, office hours, but I honestly feel in my heart that I am providing a valuable approach.
The problem sets are meant to be learning experiences, and perhaps more challenging than you are familiar with. This is not going to be the style of class where you find a similar problem in the lecture notes and just repeat. You may have to reach back into the 18.02 prerequisites for this class. You may learn a lot by working with other students in the class. There is a lot to be learned at the math learning center, in recitations, during the TAs office hours, and my numerous office hours. No promises that 18.06 is easy, only that you will learn a lot!
All questions are 10 points, problem 10 will be a freebie if you execute it. (Coming soon)
Reminder: in mathematical contexts the distinction between an n-vector and an nx1 single column matrix tends to be blurred. On computers this can and does cause problems. In the following, we will blur the distinction. For these problems, some of you will know how to do it from the prerequisites, some will ask your friends or your recitation instructor, but in any event you will strengthen your general feel for linear algebra.
Some places worth looking for answers are Strang Section 1.2, page 194, VMLS page 58. The definition of a normal, the definition of a gradient, the definition of a plane and the definition of a hyperplane which generalizes a plane. Also ||v|| = sqrt(vᵀv), is the length of a vector v, sometimes pronounced the norm of v, or the magnitude of v.
1a. If vᵀw=0 we say that v and w are a) perpendicular b) orthogonal c) at right angles d) all the above.
1b. The equation 2x+3y+4z = 0 describes a a) line b) plane through 0 c) plane not through 0 d) sphere
1c. A normal to your answer in 1b is __________?
1d. The equation 2x+3y+4z = 2019 describes a a) line b) plane through 0 c) plane not through 0 d) sphere
1e. A normal to your answer in 1d is __________?
1f. The equation x²+y²+z²=1 describes a a) line b) plane through 0 c) plane not through 0 d) sphere
1g. A normal to your answer in 1f is __________? (Hint: ok to use intuition from two or three dimensions)
1h. Calculate the gradient of f(x,y,z) = 2x+3y+4z
1i. Calculate the gradient of f(x,y,z) = x²+y²+z²
1j. The inequality 2x+3y+4z > 0 describes a region of 3 dimensional space. Describe that region:_____
1k. The inequality 2x+3y+4z < 2019 describes a region of 3 dimensional space. Describe that region:_____
1l. Using vector language in Rⁿ, (no elements or indices) write an equation that describes every v on the hyperplane through 0 with given normal vector w (not 0).
1m. Give an example of a hyperplane parallel to the hyperplane in 1l above that does not go through 0.
1n. Suppose given a nonzero w that we consider the hyperplane of all v for which wᵀv=1. Find the unique vector v (in terms of w) that is on this hyperplane in the direction w. (It is helpful to define ||w|| as the length of w.)
1a. d) all the above.
1b. b) plane through $0$.
1c. $(2,3,4)$.
1d. c) plane not through $0$.
1e. $(2,3,4)$.
1f. d) sphere
1g. $(x,y,z)$.
1h. $(2,3,4)$.
1i. $(2x,2y,2z)$
1j. The plane $2x+3y+4z = 0$ divides $\mathbb{R}^3$ into two half spaces. The region $2x+3y+4z > 0$ is the half space containing the point $(2,3,4)$.
1k. The plane $2x+3y+4z = 2019$ divides $\mathbb{R}^3$ into two half spaces. The region 2x+3y+4z < 2019 is the half space containing the point $(0,0,0)$.
1l. $w^T v = 0$.
1m. Any $w^Tv = c$ where $c \ne 0$.
1n. $v = \frac{w}{w^Tw}$
a) If $v$ is a vector in any vector space, what is the complete list of vectors that must also be in this vector space?
b) Must a vector space contain a zero vector? Why?
c) What is the zero vector in $R^4$? in $R^{2,3}$? What is the zero function? What is the zero (differential) operator from functions to functions? (Hint: The zero operator takes in an arbitrary function and returns....)
d) State in words the difference between the zero function and the zero operator, as they seem so similar.
a) The set of all $cv$ where $c \in \mathbb{R}$ is the complete list of vectors that must also be in this vector space.
b) This is by definition of a vector space. Alternatively, $0v = 0$.
c) The zero in $\mathbb{R}^4$ is $(0,0,0,0)$. The zero in $\mathbb{R}^{2,3}$ is $\begin{pmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \end{pmatrix}$. Given some domain $D$, the zero function is the function $z(x) = 0$ for every $x \in D$. The zero operator $Z$ is the operator which takes a function $f$ and maps it to the zero function $z$.
d) The zero function maps to the point $0$ whereas the zero operator maps to the zero function.
a) All vectors in $R^n$ whose entries sum to 0.
b) All matrices in $R^{m,n}$ whose entries when squared sum to 1.
c) Cubic polynomials of the form $f(x) = a + bx + cx^2 + dx^3$, where $a,b,c,d$ are arbitrary, possibly 0.
d) All linear functions of the form $f(x) = mx +b$, where $f(17)=0$. ($m$ or $b$ may possibly be 0)
e) All linear functions of the form $f(x) = mx +b$, where $f(0)=17$.($m$ or $b$ may possibly be 0)
a) This is a vector space. If $(a_1,\ldots,a_n), (b_1,\ldots,b_n) \in \mathbb{R}^n$ such that $$ a_1 + \cdots + a_n = 0, \quad b_1 + \cdots + b_n = 0 $$ then for any $c,d \in \mathbb{R}$ we have $$ c(a_1,\ldots,a_n) + d(b_1,\ldots,b_n) = (ca_1 + db_1,\ldots,ca_n + db_n) $$ where $$ ca_1 + db_1 + \cdots + ca_n + db_n = c(a_1 + \cdots + a_n) + d(b_1 + \cdots + b_n) = 0. $$
b) This is not a vector space. Indeed, the zero matrix is not in this space since the squared sum of the entries is $0$.
c) This is a vector space. If $f_1(x) = a_1 + b_1 x + c_1 x^2 + d_1 x^3$ and $f_2(x) = a_2 + b_2 x + c_2 x^2 + d_1 x^3$, then for any $e_1,e_2 \in \mathbb{R}$ we have $$ e_1 f_1(x) + e_2 f_2(x) = (e_1 a_1 + e_2 a_2) + (e_1 b_1 + e_2 b_2) x + (e_1 c_1 + e_2 c_2) x^2 + (e_1 d_1 + e_2 d_2) x^3. $$
d) This is a vector space. If $f(x) = mx + b$ and $g(x) = nx + c$, then for any $d,e \in \mathbb{R}$ we have $$ d f(x) + e g(x) = (dm + en)x + (db + ec). $$ Furthermore, if $f(17) = 0$ and $g(17) = 0$, then $$ d f(17) + e g(17) = 0. $$
e) This is not a vector space. Indeed, the zero function $z$ satisfies $z(0) = 0 \ne 17$.
A fact you are supposed to already know about matrices is that $M(cv+dw) = cMv+dMw$, if $M$ is any matrix, $c$ and $d$ are any scalars, and $v$ and $w$ are any vectors.
Suppose for some particular rectangular matrix $M$, we have $Mv=0$ and $Mw=0$ for some particular vectors $v$ and $w$. Is it true that if $x$ is a linear combination of $v$ and $w$ then $Mx$ is also 0? Why?
The set of vectors $x$ for which $Mx=0$ is known as the nullspace of $M$. For an mxn matrix, we point out two extremes among the typically infinitely many possibilities: it could be the zero vector in $R^n$ or it could be all of $R^n$. (We will see there are usually lots of possibilities in between.)
What is the nullspace of the $n$ by $n$ identity matrix? The $n$ by $n$ zero matrix? The $m$ by $n$ matrix of all ones?
True. If $x$ is a linear combination of $v,w$, we may write $x = cv + dw$ for some $c,d \in \mathbb{R}$. Then $$ Mx = M(cv + dw) = c Mv + d Mw = 0. $$
Suppose $I$ is the $n\times n$ identity matrix. Then the only vector which satisfies $Ix = 0$ is $x = 0$. Thus the null space of $I$ is $\{0\}$.
Suppose $Z$ is the $n\times n$ zero matrix. Then every vector $x \in \mathbb{R}^n$ satisfies $Zx = 0$. Thus the null space of $Z$ is $\mathbb{R}^n$.
Let $A$ be the $m\times n$ matrix of all $1$s. Then $$ A = \begin{pmatrix} 1 & \cdots & 1 \\ \vdots & \ddots & \vdots \\ 1 & \cdots & 1 \end{pmatrix} \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} = \begin{pmatrix} 0 \\ \vdots \\ 0 \end{pmatrix} $$ exactly when $$ x_1 + \cdots + x_n = 0. $$ Thus the null space of $A$ is $\{(x_1,\ldots,x_n) \in \mathbb{R}^n: x_1 + \cdots + x_n = 0 \}$. This is the hyperplane through the origin with normal vector $(1,\ldots,1)$.
The column space of a matrix $M$ is the set of all vectors $y$ which can be written in the form $Mx$ for some vector $x$. Why is the column space a vector space?
What is the column space of the $n$ by $n$ identity matrix? The $n$ by $n$ zero matrix? The $m$ by $n$ matrix of all ones?
If $y_1,y_2$ are in the column space of $M$, then $y_1 = Mx_1, y_2 = Mx_2$. Then for any $c,d \in \mathbb{R}$ we have $$ cy_1 + dy_2 = cMx_1 + dMx_2 = M(cx_1 + dx_2) $$ so that $cy_1 + dy_2$ is in the column space of $M$. This shows the column space is a vector space.
Suppose $I$ is the $n\times n$ identity matrix. Then for each $x \in \mathbb{R}^n$, we have $x = Ix$. Thus the column space of $I$ is $\mathbb{R}^n$.
Suppose $Z$ is the $n\times n$ zero matrix. Then $0 = Zx$ for every $x \in \mathbb{R}^n$. Thus the column space of $Z$ is $\{0\}$.
Suppose $A$ is the $m\times n$ matrix of all $1$s. Then $$ A = \begin{pmatrix} 1 & \cdots & 1 \\ \vdots & \ddots & \vdots \\ 1 & \cdots & 1 \end{pmatrix} \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix} = \begin{pmatrix} x_1 + \cdots + x_n \\ \vdots \\ x_1 + \cdots + x_n \end{pmatrix}. $$ Thus the column space of $A$ is $\{(c,\ldots,c): c \in \mathbb{R} \}$. This is the line in the direction of $(1,\ldots,1)$.
using LinearAlgebra
In Julia if you are given an m x n matrix $A$, Julia returns a matrix $U$, a vector $s$ and a Matrix $V$ for which A = U * Diagonal(s) * V' (up to floating point roundoff errors)
If m >n, $U$ is m by n, Diagonal(s) is n by n, and $V$ is n by n.
If m < n $U$ is m by m, Diagonal(s) is m by m, and $V$ is n by m.
You can remember this because the result is m by n, and the Diagonal(s) is always square whose size is the smaller of m and n.
Here are some examples:
(Don't forget to execute the using LinearAlgebra) or else svd won't be a known function
A = rand(4,5)
U,s,V = svd(A)
println( size(U), size(Diagonal(s)), size(V))
(4, 4)(4, 4)(5, 4)
A = rand(5,4)
U,s,V = svd(A)
println( size(U), size(Diagonal(s)), size(V))
(5, 4)(4, 4)(4, 4)
What are the sizes of $U$, $Diagonal(s)$ and $V$ for
a) A 10x10 matrix?
b) A 10x20 matrix?
c) A 20x10 matrix?
If $A$ is $10\times 10$, then $U$ is $10\times 10$, $D$ is $10\times 10$, and $V$ is $10\times 10$.
If $A$ is $10\times 20$, then $U$ is $10\times 10$, $D$ is $10\times 10$, and $V$ is $10\times 20$.
If $A$ is $20\times 10$, then $U$ is $20\times 10$, $D$ is $10\times 10$, and $V$ is $10\times 10$.
Here is an example from Svetlana's problems posted in Piazza:
x + 2y + 3z =1
y + z = 2
3x + y - z =3
Solution:
# Form the Matrix
A = [1 2 3 ; 0 1 1;3 1 -1]
3×3 Array{Int64,2}: 1 2 3 0 1 1 3 1 -1
# For the right hand side b
b = [1, 2, 3]
3-element Array{Int64,1}: 1 2 3
U,s,V = svd(A)
V*((U'b)./s)
3-element Array{Float64,1}: -0.9999999999999999 4.000000000000002 -1.9999999999999993
Don't let floating point fool you, the answer given is -1,4,-2. Check that the solution is right. Explain mathematically why V*((U'b)./s) gives the right answer.
On an exam, you will not be expected to compute an SVD using eigenvalues or a computer. The SVD story will unfold during the course of the semester.
For a square matrix $A$ with SVD $UDV$, the matrices $U,V$ are orthogonal.
We have $$ UDVx = Ax = b \quad \implies \quad x = V^T D^{-1} U^Tb $$ which corresponds to the given command.
Hint: this is tricky because the unknowns are the four elements in $W$. I think this problem is hard.
Write $x_i = (x_{i1}, x_{i2})$ and $y_i = (y_{i1},y_{i2})$. We want to find $W = \begin{pmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{pmatrix}$ that best approximates $$ y_i = Wx_i, \quad \quad i = 1,2,3 $$ which we may expand out as $$ y_{ij} = w_{j1} x_{i1} + w_{j2} x_{i2}, \quad \quad i = 1,2,3; j = 1,2 $$ in the sense of least squares; that is we want $W$ to minimize $$ \sum_{i=1}^3 \sum_{j=1}^2 (y_{ij} - w_{j1} x_{i1} - w_{j2} x_{i2})^2. $$
We may express these equations as a single matrix equation $$ Y = Xw $$ where $$ Y = \begin{pmatrix} y_{11} \\ y_{21} \\ y_{31} \\ y_{12} \\ y_{22} \\ y_{32} \end{pmatrix}, \quad X = \begin{pmatrix} x_{11} & x_{12} & 0 & 0 \\ x_{21} & x_{22} & 0 & 0 \\ x_{31} & x_{32} & 0 & 0 \\ 0 & 0 & x_{11} & x_{12} \\ 0 & 0 & x_{21} & x_{22} \\ 0 & 0 & x_{31} & x_{32} \end{pmatrix}, \quad w = \begin{pmatrix} w_{11} \\ w_{12} \\ w_{21} \\ w_{22} \end{pmatrix} $$ If we factorize $X = QR$ where $Q^T Q = I$ and $R$ is upper triangular with positive entires, then we know that the best fitting $w$ is given by $$ w = \begin{pmatrix} w_{11} \\ w_{12} \\ w_{21} \\ w_{22} \end{pmatrix} = R^{-1} Q^T Y.$$
is a rearrangement of the numbers 1 through n. Here is a random permutation of 1:5
using Random
p = randperm(5)
5-element Array{Int64,1}: 5 4 2 1 3
A Permutation matrix is row reordering of the identity matrix
ident = Matrix{Int64}(I,5,5)
5×5 Array{Int64,2}: 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
ident[p,:]
5×5 Array{Int64,2}: 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
What is closest to the number of parameters that are needed for an n x n permutation matrix? a) n^2 b) n c)n! d)n^n
The closest number of parameters of $n$, given by (b).
Note that each permutation matrix has a single $1$ in each column and row. So each permutation matrix may be parametrized by an $n$-vector $(m_1,\ldots,m_n)$ where the $i$th number $m_i$ gives the column index of the $1$ in the $i$th row.
Understanding this code is not required for this class, but is an extra for those interested. All of you might want to look just a little and execute the lines though.
The signed distance to a hyperplane is one of those standard machine learning (and elsewhere!) codes. In one commonly used notation, hyperplanes are written as the set of x for which θᵀx + θ₀ = 0 where θ is in Rⁿ and θ₀ is in R. The formula for the distance of a point p to this hyperplane is
distance = (θᵀp +θ₀ )/||θ||
using LinearAlgebra
# Formula in Julia looks a lot like the math
d(p,θ,θ₀) = (θ'p .+ θ₀)/norm(θ)
# If you would like to annotate with types, this is also allowed
# this says p and θ must be vectors and θ₀ a number
d(p::Vector, θ::Vector, θ₀::Number) = (θ'p .+ θ₀)/norm(θ)
d (generic function with 3 methods)
n = 7
d( rand(n), rand(n), rand() )
0.6736808399014581
Python is a very popular language these days. It is really nice to have lots of friends who know the same language. Python suffers, however, from the two language problem: that efficient machine learning code tends to use C or C++, and even numpy tends to mix metaphors being sort of python like and sort of something different.
Are we above or below the hyperplane?
which_side(p,θ,θ₀) = d(p,θ,θ₀)<0 ? "I'm below" : "I'm above" # Ternary operator
which_side (generic function with 1 method)
which_side(randn(n), rand(n), rand() )
"I'm below"
What if we have a matrix of 5 data points, each column is in R^n?
P = randn(n,5)
7×5 Array{Float64,2}: -0.0135899 -1.85682 -1.4147 -1.73322 -0.29654 -0.0583988 0.39595 -0.748555 -0.760963 0.468719 0.898581 1.31427 -0.790358 -0.1995 0.90457 -0.0752133 0.334317 -0.557758 -2.391 0.729908 0.432025 1.12397 -1.70417 1.31616 1.78537 -0.494314 -2.58893 0.245257 -1.75916 0.439534 -1.50217 -1.00493 -1.25288 -0.790197 1.21259
distances = d( P, rand(n), rand() ) # Code evaluates five distances
1×5 Array{Float64,2}: -0.0509957 -1.44167 -1.57092 -2.49945 1.62281
sign.(distances) # elementwise sign function tells you above or below
1×5 Array{Float64,2}: -1.0 -1.0 -1.0 -1.0 1.0
Carrying around a θ and θ₀ separately seems silly. How likely are you to remember that this is a pair next month? Let's wrap them together:
struct Hyperplane
θ :: Vector
θ₀ :: Number
end
h = Hyperplane( rand(3), rand() )
Hyperplane([0.754178, 0.452019, 0.854368], 0.7806995443589713)
dump(h) # see what's inside
Hyperplane θ: Array{Float64}((3,)) [0.754178, 0.452019, 0.854368] θ₀: Float64 0.7806995443589713
Let's make it easy to write the more natural distance to a hyperplane function
d(p::Vector, h::Hyperplane ) = d(p,h.θ,h.θ₀)
d (generic function with 3 methods)
h = Hyperplane( rand(n), rand() )
d( randn(n), h)
-0.7457277112775967
This professor encourages good coding, where mathemtical abstractions are wrapped with their names rather than being isolated as greek letters.
Let's generate a vector of 10 hyperplanes:
hyperplanes = [Hyperplane( rand(3), rand() ) for i=1:10 ]
10-element Array{Hyperplane,1}: Hyperplane([0.639999, 0.381686, 0.451886], 0.1268770727445261) Hyperplane([0.782602, 0.790227, 0.132017], 0.06124709194394651) Hyperplane([0.638825, 0.459649, 0.0769363], 0.9779856818512844) Hyperplane([0.601499, 0.857207, 0.776075], 0.35331470556440636) Hyperplane([0.500564, 0.459139, 0.621231], 0.8541667481679043) Hyperplane([0.527508, 0.95329, 0.435191], 0.054942045856016586) Hyperplane([0.838488, 0.390493, 0.190962], 0.39390438127219274) Hyperplane([0.275477, 0.896726, 0.339034], 0.4137173415074926) Hyperplane([0.166933, 0.977787, 0.0165396], 0.8839908695144154) Hyperplane([0.0228206, 0.547961, 0.996911], 0.7564915268371468)
Notice this is 10-element vector whose elements are hyperplanes !!
Let's see if a random point is above or below these hyperplanes.
p = randn(3)
3-element Array{Float64,1}: 0.13038386350675657 0.7420231369911667 0.22137717632262296
d.( [p] , hyperplanes)
10-element Array{Float64,1}: 0.6811138314488024 0.6954398194536037 1.7949674760014023 0.9510484436008779 1.518376859487798 0.7905034014117159 0.8843732377089485 1.1930979743272325 1.648023020172752 1.2187964026961917
or maybe more readable
[ d(p, h) for h ∈ hyperplanes ]
10-element Array{Float64,1}: 0.6811138314488024 0.6954398194536037 1.7949674760014023 0.9510484436008779 1.518376859487798 0.7905034014117159 0.8843732377089485 1.1930979743272325 1.648023020172752 1.2187964026961917
For those really into 6.036, feel free to take a look at a (very short) sample perceptron training implementation.