pset7sol

Problem 1 (2+5+5+5+5+5)¶

Suppose that $A$ is an $m \times n$ "tall" matrix ( $m > n$ ) with full column rank (rank $n$ ). Consider the "wide" (underdetermined, i.e. fewer equations than unknowns) system of equations:

$A^T x = b$

You are also given the QR factorization $A = QR$ of $A$ , where $Q$ is an $m \times n$ matrix with orthonormal columns and $R$ is an $n \times n$ invertible upper-triangular matrix. (For example, maybe you got this by Gram-Schmidt on $A$ .) This corresponds to a factorization $A^T = R^T Q^T$ called an "LQ" factorization of $A^T$ (because $L = R^T$ is lower-triangular), which is used in practice to find minimum-norm solutions to underdetermined systems of equations (as opposed to least-square solutions of overdetermined systems), as you will explore in this problem.

(a) $A^T x = b$ has a solution for any $b \in \mathbb{R}^n$ because ____________ but the solution $x$ is not unique because __________.

(b) We can find a unique solution $x_0 \in C(A)$ of $A^T x_0 = b$ , given by the formula ____________ in terms of $A,b$ .

(c) Alternatively, any $x \in C(A)$ can be written $x = Qy$ for some $y\in \mathbb{R}^n$ because ____________. We can find the same solution $x_0 = Qy$ of $A^T x_0 = b$ as in (b), with $y$ given by the formula ____________ in terms of $Q,R,b$ (plug in $A=QR$ and simplify!).

(d) Is the solution $x_0$ in (b) or (c) generally the same as the "particular" solution we would have gotten from the exam-1 approach of setting the free variables equal to zero? Why or why not? (Maybe try with $A^T = [1\;1]$ and $b=[1]$ .)

(e) In terms of $x_0$ from (b) or (c), any solution $x$ of $A^T x = b$ (the "complete" solution) can be written as $x = x_0 + x_1$ where $x_1$ is any vector in the subspace ____________. The solution $\hat{x}$ with the minimum norm $\Vert x \Vert$ is therefore $\hat{x} = \_\_\_\_\_\_$ . (Hint: write $\Vert x \Vert^2 = x^Tx$ in terms of $x_0$ and $x_1$ , remembering orthogonality of subspaces.)

(f) In Julia, x̂ = W \ b for any "wide" full-row-rank matrix W computes the minimum-norm solution $\hat{x}$ to $Wx = b$ . Check that this matches your answer from parts (c) and (d) for a random $A$ using the code below:

In [ ]:

using LinearAlgebra

A = randn(10, 3) # a random full-rank "tall" matrix
QR = qr(A) # its QR factorization
Q = Matrix(QR.Q) # the 10x3 "thin" Q factor
R = QR.R         # and the 3x3 R factor

b = randn(3) # a random right-hand side
x̂ = A' \ b   # the minimum-norm solution to Aᵀx = b using the built-in \

In [ ]:

y = ??????   # your solution from (c), a formula in terms of Q, R, and/or b
x₀ = Q * y
x̂ₑ = ????    # your solution from (e)

In [ ]:

x̂ₑ ≈ x̂ # should print "true"

Solution of Problem 1¶

(a) " $A^T x = b$ has a solution for any $b \in \mathbb{R}^n$ because $C(A^T)$ has dimension $n$ and $C(A^T)=\mathbb R^n$ but the solution $x$ is not unique because the dimension of $N(A^T)$ is $m-n>0$ and hence $N(A^T)$ is non-zero."

More precisely, since we know that $A$ has dimensions $m\times n$ and rank $r=n$ , we know the dimensions of all four foundamental subspaces. In particular, the dimension of $C(A^T)$ is the rank $n$ , and since $C(A^T)$ is a subspace of $\mathbb R^n$ we get $C(A^T)=\mathbb R^n$ and any vector $b$ is in $C(A^T)$ . The dimension of $N(A^T)$ is $m-r=m-n>0$ , so the null-space of $A^T$ contains non-zero vectors and full solutions $x+N(A^T)$ always have more than one vector.

(b) "We can find a unique solution $x_0 \in C(A)$ of $A^T x_0 = b$ , given by the formula $\boxed{x_0=A(A^TA)^{-1}b}$ in terms of $A,b$ ."

We want to find a vector $x_0$ such that $A^Tx_0=b$ and $x_0=Az_0$ . Let us find $z_0$ : combining these two conditions, we have $A^TAz_0=b$ . Note that since $A$ is full column-rank, $A^TA$ is invertible (proved in lectures), so $A^TAz_0=b$ has a unique solution $z_0=(A^TA)^{-1}b$ . Hence the desired $x_0$ is also unique and $x_0=A(A^TA)^{-1}b$ .

(c) "Alternatively, any $x \in C(A)$ can be written $x = Qy$ for some $y\in \mathbb{R}^n$ because for any $x\in C(A)$ we have $z$ such that $x=Az=QRz$ , so we can set $y=Rz$ . We can find the same solution $x_0 = Qy$ of $A^T x_0 = b$ as in (b), with $y$ given by the formula $\boxed{y=(R^T)^{-1}b}$ in terms of $Q,R,b$ (plug in $A=QR$ and simplify!)."

This expression for $y$ is obtained by using that $x_0=A (A^TA)^{-1}b$ from part (b), so $y=R(A^TA)^{-1}b$ . Since $Q$ has orthogonal columns, we have $Q^TQ=I$ and $A^TA=R^TQ^TQQR=R^TR$ . Hence $y=R(R^TR)^{-1}b=(R^T)^{-1}b$ .

(d) The solution $x_0$ is generally different from the "particular solution". The hint provides an example: when $A^T = [1\;1]$ and $b=[1]$ , the solution $x_0$ from part (b) is $\dfrac{1}{2}\begin{bmatrix}1\\1\end{bmatrix}$ , while the special solution (sometimes called a "basic solution") sets the second variable to $0$ and is equal to $\begin{bmatrix}1\\0\end{bmatrix}$ .

(e) "In terms of $x_0$ from (b) or (c), any solution $x$ of $A^T x = b$ (the "complete" solution) can be written as $x = x_0 + x_1$ where $x_1$ is any vector in the subspace $\boxed{N(A^T)}$ . The solution $\hat{x}$ with the minimum norm $\Vert x \Vert$ is therefore $\boxed{\hat{x} = x_0}$ . (Hint: write $\Vert x \Vert^2 = x^Tx$ in terms of $x_0$ and $x_1$ , remembering orthogonality of subspaces.)"

To show that $\hat{x}=x_0$ , note that we want to minimize $\Vert x\Vert^2=x_0^Tx_0+x_1^Tx_0+x_0^Tx_1+x_1^Tx_1$ , where $x_0$ is our solution from previous parts, which is fixed, and $x_1$ is an arbitrary vector from $N(A^T)$ . Recall that subspaces $C(A)$ and $N(A^T)$ are orthogonal complements of each other, hence $x_0$ is always orthogonal to $x_1$ . Hence $\Vert x\Vert^2=x_0^Tx_0+x_1^Tx_1=\Vert x_0\Vert^2+\Vert x_1\Vert^2$ and we need to minimize $\Vert x_1\Vert^2$ , which is achieved when $x_1=0$ .

In [1]:

using LinearAlgebra

A = randn(10, 3) # a random full-rank "tall" matrix
QR = qr(A) # its QR factorization
Q = Matrix(QR.Q) # the 10x3 "thin" Q factor
R = QR.R         # and the 3x3 R factor

b = randn(3) # a random right-hand side
x̂ = A' \ b   # the minimum-norm solution to Aᵀx = b using the built-in \

Out[1]:

10-element Vector{Float64}:
  0.2122948675625042
 -0.05057152739601002
  0.08655272655122272
 -0.30100333944745666
 -0.12011586997211171
 -0.10275806015260548
  0.07650186369816225
 -0.0616304499080549
  0.00041573730377877927
  0.1162350744786731

In [3]:

y =  R'^(-1)*b  # solution from (c), a formula in terms of Q, R, and/or b
x₀ = Q * y
x̂ₑ = x₀   # your solution from (e)

Out[3]:

10-element Vector{Float64}:
  0.21229486756250404
 -0.050571527396009966
  0.08655272655122272
 -0.3010033394474565
 -0.12011586997211157
 -0.10275806015260544
  0.0765018636981623
 -0.06163044990805486
  0.00041573730377877753
  0.11623507447867301

In [4]:

x̂ₑ ≈ x̂ # should print "true"

Out[4]:

true

Problem 2 (9+6 points)¶

(a) Apply Gram-Schmidt to the polynomials ${1, x, x^2}$ to find an orthonormal basis ${q_1,q_2,q_3}$ of polynomials under the inner ("dot") product (different from the one used in class): $f \cdot g = \int_0^\infty f(x) g(x) e^{-x} dx$ (Unlike the Legendre polynomials in class, normalize your polynomials $q_k$ to have $\Vert q_k \Vert = \sqrt {q_k \cdot q_k} = 1$ under this inner product, i.e. to be really orthonormal.)

The following integral will be useful: $\int_0^\infty x^n e^{-x} dx = n!$ ( $n$ factorial) for any integer $n \ge 0$ .

(b) Consider the function $f(x) = \begin{cases} x & x < 1 \\ 0 & x \ge 1 \end{cases}$ . Find the degree-1 polynomial p(x)=αx+β that is the "best fit" to $f(x)$ in the sense of minimizing $\Vert f - \alpha x - \beta \Vert^2 = \int_0^\infty \left[ f(x) - p(x) \right]^2 e^{-x} dx$ In particular, find $p(x)$ by performing the orthogonal projection (with this dot product) of $f(x)$ onto .........?

Solution of Problem 2¶

(a) Before we solve the problem, recall that $\int_0^{\infty}x^n e^{-x} dx=n!$ , which can be proved using induction and integration by parts.

We have three functions $a(x) = 1, b(x) = x$ and $c(x) = x^2$ . Let's firstly use Gram-Schmidt construct an orthogonal set of functions $A(x), B(x), C(x)$ : $\begin{align} A(x) &= 1\\ B(x) &= b(x) - \frac{A\cdot b}{A\cdot A}A(x) \end{align}$ Now $A\cdot A = \int_0^{\infty} e^{-x} dx = 1$ and $A\cdot b = \int_0^{\infty} xe^{-x} dx = 1$ , so that $\begin{align} B(x) &= x - 1 \end{align}$ Then $\begin{align} C(x) = c(x) - \frac{A\cdot c}{A \cdot A}A(x) - \frac{B\cdot c}{B \cdot B}B(x), \end{align}$ and $A\cdot c = \int_0^{\infty} x^2 e^{-x} dx =2$ , $B\cdot c = \int_0^{\infty} x^2(x-1) e^{-x} dx = 4$ , and $B\cdot B = \int_0^{\infty} (x-1)^2 e^{-x} dx = 1$ , so that $\begin{align} C(x) &= x^2 - \frac{2}{1}\times 1 - \frac{4}{1} (x-1)\\ &= x^2 - 4x + 2. \end{align}$ Finally we normalize $A, B, C$ . We already found that $A\cdot A = B \cdot B = 1$ . We can calculate $\Vert C \Vert^2 = C\cdot C = \int_0^{\infty} (x^2 - 4x + 2)^2 e^{-x} dx = 4$ , so that $\begin{align} \boxed{q_1(x) = 1, \;\; q_2(x) = x-1, \;\; q_3(x) = \frac{x^2 - 4x +2}{2}.} \end{align}$

You might be interested that this is actually a well-known basis of orthogonal polynomials called Laguerre polynomials.

(b) We find $p(x)$ by calculating the orthogonal projection of $f(x)$ onto the two-dimensional space of functions $\alpha x +\beta$ , which are spanned by $1,x$ . Note that we have already computed an orthonormal basis for this space in part (a), it is $q_1(x)=1$ and $q_2(x)=x-1$ . Hence the orthogonal projection is given by $p(x)=(q_1 \cdot f) q_1(x)+ (q_2 \cdot f) q_2(x)$ We have $q_1\cdot f=\int_0^{\infty} f(x) e^{-x} dx=\int_0^1 x e^{-x} dx= (-x-1)e^{-x}{\Big|}^{x=1}_{x=0}=1-2e^{-1}$ and $q_2\cdot f=\int_0^{\infty} (x-1)f(x) e^{-x} dx=\int_0^1 (x^2-x) e^{-x} dx= (-x^2-x-1)e^{-x}{\Big|}^{x=1}_{x=0}=1-3e^{-1}$ . Hence $\boxed{p(x)=(1-2e^{-1})+(1-3e^{-1})(x-1)=(1-3e^{-1})x+e^{-1}}$

Problem 3 (10+10+5 points)¶

In a common variant of least-squares fitting called "ridge" or "Tikhonov" regression to get more robust solutions to noisy fitting problems, one minimizes: $f(x) = \Vert b - Ax \Vert^2 + \lambda \Vert x \Vert^2$ instead of just $\Vert b - Ax \Vert^2$ , for some "penalty" parameter $\lambda \ge 0$ . As $\lambda$ gets bigger and bigger, this favors smaller solutions $x$ . Here, $A$ is $m \times n$ .

(a) Write this same $f(x)$ as $f(x) = \Vert c - Bx \Vert^2$ for some matrix $B$ and some vector $c$ defined in terms of $A,\lambda,b$ . Hint: $\Vert x \Vert^2 + \Vert y \Vert^2$ equals the length of a single vector ______ made by stacking $x$ and $y$ .

(b) Since part (a) rewrote $f(x)$ as an ordinary-looking least-squares problem, use the normal equations for $B$ and $c$ to write down an equation $(n\times n \mbox{ matrix})\hat{x} = (\mbox{right-hand side})$ for the minimizer $\hat{x}$ in terms of $A,\lambda,b$ .

(c) For $\lambda > 0$ , show that (b) has a unique solution even if $A$ is not full column rank (i.e. $A^TA$ is singular). In particular, why does your matrix on the left-hand-side have a nullspace of {0}?

A really common strategy in mathematics is to try to rewrite new problems so that they look like old problems, letting us re-use the old solutions!

Solution of Problem 3¶

(a) Following the hint, note that $\Vert x\Vert^2+\Vert y\Vert^2$ is the length of $\begin{bmatrix}x\\y\end{bmatrix}$ (we stack the coordinates of $x$ and $y$ ). Hence $f(x)=\Vert b - Ax \Vert^2 + \lambda \Vert x \Vert^2$ is the length of the vector $\begin{bmatrix}b-Ax\\\sqrt{\lambda}x\end{bmatrix}$ , it is a vector in $\mathbb R^{n+m}$ with the top part having $m$ rows and the bottom part having $n$ rows. We can rewrite this vector as $\begin{bmatrix}b\\ 0\end{bmatrix}-\begin{bmatrix}Ax\\\sqrt{\lambda}x\end{bmatrix}=\begin{bmatrix}b\\ 0\end{bmatrix}-\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}x$ , where $B=\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}$ is the $(m+n)\times n$ matrix, with the top $m\times n$ part being matrix $A$ and the bottom $n\times n$ part being the identity matrix $I_n$ of size $n$ . Hence $f(x)=\Vert c-Bx\Vert^2$ where $\boxed{B=\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}\qquad c= \begin{bmatrix}b\\ 0\end{bmatrix}}$

(b) The normal equations have the form $B^TB\hat x=B^Tc$ . Plugging the values from part (a) we have $B^TB=\begin{bmatrix}A^T&\sqrt{\lambda}I_n\end{bmatrix}\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}=A^TA+\lambda I_n$ $B^Tc=\begin{bmatrix}A^T&\sqrt{\lambda}I_n\end{bmatrix}\begin{bmatrix}b\\ 0\end{bmatrix}=A^Tb$ So, the equation for $\hat x$ is $\boxed{ (A^TA+\lambda I_n)\hat x=A^Tb }$

(c) Assume that $\lambda>0$ and $x\in N(A^TA+\lambda I_n)$ , so $A^TAx+\lambda x=0$ . Then we also have $x^TA^TAx+\lambda x^Tx=0$ , or, in other words, $\Vert Ax\Vert^2+\lambda \Vert x\Vert ^2=0$ . Since both $\Vert Ax\Vert^2$ and $\lambda\Vert x\Vert ^2$ are non-negative, this implies that $\lambda\Vert x\Vert ^2=0$ . Using that $\lambda\neq 0$ we get $\Vert x\Vert ^2=0$ , hence $x=0$ . (Here we have mimicked the proof of $N(A^TA)\subset N(A)$ from the lecture.)

18.06 Problem Set 7¶

Problem 1 (2+5+5+5+5+5)¶

Solution of Problem 1¶

Problem 2 (9+6 points)¶

Solution of Problem 2¶

Problem 3 (10+10+5 points)¶

Solution of Problem 3¶