Due 11am Wednesday March 30. This will be the last problem set covered on exam 2.
Suppose that $A$ is an $m \times n$ "tall" matrix ($m > n$) with full column rank (rank $n$). Consider the "wide" (underdetermined, i.e. fewer equations than unknowns) system of equations:
$$ A^T x = b $$You are also given the QR factorization $A = QR$ of $A$, where $Q$ is an $m \times n$ matrix with orthonormal columns and $R$ is an $n \times n$ invertible upper-triangular matrix. (For example, maybe you got this by Gram-Schmidt on $A$.) This corresponds to a factorization $A^T = R^T Q^T$ called an "LQ" factorization of $A^T$ (because $L = R^T$ is lower-triangular), which is used in practice to find minimum-norm solutions to underdetermined systems of equations (as opposed to least-square solutions of overdetermined systems), as you will explore in this problem.
(a) $A^T x = b$ has a solution for any $b \in \mathbb{R}^n$ because ____________ but the solution $x$ is not unique because __________.
(b) We can find a unique solution $x_0 \in C(A)$ of $A^T x_0 = b$, given by the formula ____________ in terms of $A,b$.
(c) Alternatively, any $x \in C(A)$ can be written $x = Qy$ for some $y\in \mathbb{R}^n$ because ____________. We can find the same solution $x_0 = Qy$ of $A^T x_0 = b$ as in (b), with $y$ given by the formula ____________ in terms of $Q,R,b$ (plug in $A=QR$ and simplify!).
(d) Is the solution $x_0$ in (b) or (c) generally the same as the "particular" solution we would have gotten from the exam-1 approach of setting the free variables equal to zero? Why or why not? (Maybe try with $A^T = [1\;1]$ and $b=[1]$.)
(e) In terms of $x_0$ from (b) or (c), any solution $x$ of $A^T x = b$ (the "complete" solution) can be written as $x = x_0 + x_1$ where $x_1$ is any vector in the subspace ____________. The solution $\hat{x}$ with the minimum norm $\Vert x \Vert$ is therefore $\hat{x} = \_\_\_\_\_\_$. (Hint: write $\Vert x \Vert^2 = x^Tx$ in terms of $x_0$ and $x_1$, remembering orthogonality of subspaces.)
(f) In Julia, x̂ = W \ b
for any "wide" full-row-rank matrix W
computes the minimum-norm solution $\hat{x}$ to $Wx = b$. Check that this matches your answer from parts (c) and (d) for a random $A$ using the code below:
using LinearAlgebra
A = randn(10, 3) # a random full-rank "tall" matrix
QR = qr(A) # its QR factorization
Q = Matrix(QR.Q) # the 10x3 "thin" Q factor
R = QR.R # and the 3x3 R factor
b = randn(3) # a random right-hand side
x̂ = A' \ b # the minimum-norm solution to Aᵀx = b using the built-in \
y = ?????? # your solution from (c), a formula in terms of Q, R, and/or b
x₀ = Q * y
x̂ₑ = ???? # your solution from (e)
x̂ₑ ≈ x̂ # should print "true"
(a) "$A^T x = b$ has a solution for any $b \in \mathbb{R}^n$ because $C(A^T)$ has dimension $n$ and $C(A^T)=\mathbb R^n$ but the solution $x$ is not unique because the dimension of $N(A^T)$ is $m-n>0$ and hence $N(A^T)$ is non-zero."
More precisely, since we know that $A$ has dimensions $m\times n$ and rank $r=n$, we know the dimensions of all four foundamental subspaces. In particular, the dimension of $C(A^T)$ is the rank $n$, and since $C(A^T)$ is a subspace of $\mathbb R^n$ we get $C(A^T)=\mathbb R^n$ and any vector $b$ is in $C(A^T)$. The dimension of $N(A^T)$ is $m-r=m-n>0$, so the null-space of $A^T$ contains non-zero vectors and full solutions $x+N(A^T)$ always have more than one vector.
(b) "We can find a unique solution $x_0 \in C(A)$ of $A^T x_0 = b$, given by the formula $\boxed{x_0=A(A^TA)^{-1}b}$ in terms of $A,b$."
We want to find a vector $x_0$ such that $A^Tx_0=b$ and $x_0=Az_0$. Let us find $z_0$: combining these two conditions, we have $A^TAz_0=b$. Note that since $A$ is full column-rank, $A^TA$ is invertible (proved in lectures), so $A^TAz_0=b$ has a unique solution $z_0=(A^TA)^{-1}b$. Hence the desired $x_0$ is also unique and $x_0=A(A^TA)^{-1}b$.
(c) "Alternatively, any $x \in C(A)$ can be written $x = Qy$ for some $y\in \mathbb{R}^n$ because for any $x\in C(A)$ we have $z$ such that $x=Az=QRz$, so we can set $y=Rz$. We can find the same solution $x_0 = Qy$ of $A^T x_0 = b$ as in (b), with $y$ given by the formula $\boxed{y=(R^T)^{-1}b}$ in terms of $Q,R,b$ (plug in $A=QR$ and simplify!)."
This expression for $y$ is obtained by using that $x_0=A (A^TA)^{-1}b$ from part (b), so $y=R(A^TA)^{-1}b$. Since $Q$ has orthogonal columns, we have $Q^TQ=I$ and $A^TA=R^TQ^TQQR=R^TR$. Hence $y=R(R^TR)^{-1}b=(R^T)^{-1}b$.
(d) The solution $x_0$ is generally different from the "particular solution". The hint provides an example: when $A^T = [1\;1]$ and $b=[1]$, the solution $x_0$ from part (b) is $\dfrac{1}{2}\begin{bmatrix}1\\1\end{bmatrix}$, while the special solution (sometimes called a "basic solution") sets the second variable to $0$ and is equal to $\begin{bmatrix}1\\0\end{bmatrix}$.
(e) "In terms of $x_0$ from (b) or (c), any solution $x$ of $A^T x = b$ (the "complete" solution) can be written as $x = x_0 + x_1$ where $x_1$ is any vector in the subspace $\boxed{N(A^T)}$. The solution $\hat{x}$ with the minimum norm $\Vert x \Vert$ is therefore $\boxed{\hat{x} = x_0}$. (Hint: write $\Vert x \Vert^2 = x^Tx$ in terms of $x_0$ and $x_1$, remembering orthogonality of subspaces.)"
To show that $\hat{x}=x_0$, note that we want to minimize $\Vert x\Vert^2=x_0^Tx_0+x_1^Tx_0+x_0^Tx_1+x_1^Tx_1$, where $x_0$ is our solution from previous parts, which is fixed, and $x_1$ is an arbitrary vector from $N(A^T)$. Recall that subspaces $C(A)$ and $N(A^T)$ are orthogonal complements of each other, hence $x_0$ is always orthogonal to $x_1$. Hence $\Vert x\Vert^2=x_0^Tx_0+x_1^Tx_1=\Vert x_0\Vert^2+\Vert x_1\Vert^2$ and we need to minimize $\Vert x_1\Vert^2$, which is achieved when $x_1=0$.
(f) In Julia, x̂ = W \ b
for any "wide" full-row-rank matrix W
computes the minimum-norm solution $\hat{x}$ to $Wx = b$. Check that this matches your answer from parts (c) and (d) for a random $A$ using the code below:
using LinearAlgebra
A = randn(10, 3) # a random full-rank "tall" matrix
QR = qr(A) # its QR factorization
Q = Matrix(QR.Q) # the 10x3 "thin" Q factor
R = QR.R # and the 3x3 R factor
b = randn(3) # a random right-hand side
x̂ = A' \ b # the minimum-norm solution to Aᵀx = b using the built-in \
10-element Vector{Float64}: 0.2122948675625042 -0.05057152739601002 0.08655272655122272 -0.30100333944745666 -0.12011586997211171 -0.10275806015260548 0.07650186369816225 -0.0616304499080549 0.00041573730377877927 0.1162350744786731
y = R'^(-1)*b # solution from (c), a formula in terms of Q, R, and/or b
x₀ = Q * y
x̂ₑ = x₀ # your solution from (e)
10-element Vector{Float64}: 0.21229486756250404 -0.050571527396009966 0.08655272655122272 -0.3010033394474565 -0.12011586997211157 -0.10275806015260544 0.0765018636981623 -0.06163044990805486 0.00041573730377877753 0.11623507447867301
x̂ₑ ≈ x̂ # should print "true"
true
(a) Apply Gram-Schmidt to the polynomials ${1, x, x^2}$ to find an orthonormal basis ${q_1,q_2,q_3}$ of polynomials under the inner ("dot") product (different from the one used in class): $$ f \cdot g = \int_0^\infty f(x) g(x) e^{-x} dx $$ (Unlike the Legendre polynomials in class, normalize your polynomials $q_k$ to have $\Vert q_k \Vert = \sqrt {q_k \cdot q_k} = 1$ under this inner product, i.e. to be really orthonormal.)
(b) Consider the function $f(x) = \begin{cases} x & x < 1 \\ 0 & x \ge 1 \end{cases}$. Find the degree-1 polynomial p(x)=αx+β that is the "best fit" to $f(x)$ in the sense of minimizing $$ \Vert f - \alpha x - \beta \Vert^2 = \int_0^\infty \left[ f(x) - p(x) \right]^2 e^{-x} dx $$ In particular, find $p(x)$ by performing the orthogonal projection (with this dot product) of $f(x)$ onto .........?
(a) Before we solve the problem, recall that $\int_0^{\infty}x^n e^{-x} dx=n!$, which can be proved using induction and integration by parts.
We have three functions $a(x) = 1, b(x) = x$ and $c(x) = x^2$. Let's firstly use Gram-Schmidt construct an orthogonal set of functions $A(x), B(x), C(x)$: \begin{align} A(x) &= 1\\ B(x) &= b(x) - \frac{A\cdot b}{A\cdot A}A(x) \end{align} Now $A\cdot A = \int_0^{\infty} e^{-x} dx = 1$ and $A\cdot b = \int_0^{\infty} xe^{-x} dx = 1$, so that \begin{align} B(x) &= x - 1 \end{align} Then \begin{align} C(x) = c(x) - \frac{A\cdot c}{A \cdot A}A(x) - \frac{B\cdot c}{B \cdot B}B(x), \end{align} and $A\cdot c = \int_0^{\infty} x^2 e^{-x} dx =2$, $B\cdot c = \int_0^{\infty} x^2(x-1) e^{-x} dx = 4$, and $ B\cdot B = \int_0^{\infty} (x-1)^2 e^{-x} dx = 1$, so that \begin{align} C(x) &= x^2 - \frac{2}{1}\times 1 - \frac{4}{1} (x-1)\\ &= x^2 - 4x + 2. \end{align} Finally we normalize $A, B, C$. We already found that $A\cdot A = B \cdot B = 1$. We can calculate $\Vert C \Vert^2 = C\cdot C = \int_0^{\infty} (x^2 - 4x + 2)^2 e^{-x} dx = 4$, so that \begin{align} \boxed{q_1(x) = 1, \;\; q_2(x) = x-1, \;\; q_3(x) = \frac{x^2 - 4x +2}{2}.} \end{align}
You might be interested that this is actually a well-known basis of orthogonal polynomials called Laguerre polynomials.
(b) We find $p(x)$ by calculating the orthogonal projection of $f(x)$ onto the two-dimensional space of functions $\alpha x +\beta$, which are spanned by $1,x$. Note that we have already computed an orthonormal basis for this space in part (a), it is $q_1(x)=1$ and $q_2(x)=x-1$. Hence the orthogonal projection is given by $$p(x)=(q_1 \cdot f) q_1(x)+ (q_2 \cdot f) q_2(x)$$ We have $q_1\cdot f=\int_0^{\infty} f(x) e^{-x} dx=\int_0^1 x e^{-x} dx= (-x-1)e^{-x}{\Big|}^{x=1}_{x=0}=1-2e^{-1}$ and $q_2\cdot f=\int_0^{\infty} (x-1)f(x) e^{-x} dx=\int_0^1 (x^2-x) e^{-x} dx= (-x^2-x-1)e^{-x}{\Big|}^{x=1}_{x=0}=1-3e^{-1}$. Hence $$ \boxed{p(x)=(1-2e^{-1})+(1-3e^{-1})(x-1)=(1-3e^{-1})x+e^{-1}} $$
In a common variant of least-squares fitting called "ridge" or "Tikhonov" regression to get more robust solutions to noisy fitting problems, one minimizes: $$ f(x) = \Vert b - Ax \Vert^2 + \lambda \Vert x \Vert^2 $$ instead of just $\Vert b - Ax \Vert^2$, for some "penalty" parameter $\lambda \ge 0$. As $ \lambda$ gets bigger and bigger, this favors smaller solutions $x$. Here, $A$ is $m \times n$.
(a) Write this same $f(x)$ as $f(x) = \Vert c - Bx \Vert^2$ for some matrix $B$ and some vector $c$ defined in terms of $A,\lambda,b$. Hint: $\Vert x \Vert^2 + \Vert y \Vert^2$ equals the length of a single vector ______ made by stacking $x$ and $y$.
(b) Since part (a) rewrote $f(x)$ as an ordinary-looking least-squares problem, use the normal equations for $B$ and $c$ to write down an equation $(n\times n \mbox{ matrix})\hat{x} = (\mbox{right-hand side})$ for the minimizer $\hat{x}$ in terms of $A,\lambda,b$.
(c) For $\lambda > 0$, show that (b) has a unique solution even if $A$ is not full column rank (i.e. $A^TA$ is singular). In particular, why does your matrix on the left-hand-side have a nullspace of {0}?
A really common strategy in mathematics is to try to rewrite new problems so that they look like old problems, letting us re-use the old solutions!
(a) Following the hint, note that $\Vert x\Vert^2+\Vert y\Vert^2$ is the length of $\begin{bmatrix}x\\y\end{bmatrix}$ (we stack the coordinates of $x$ and $y$). Hence $f(x)=\Vert b - Ax \Vert^2 + \lambda \Vert x \Vert^2$ is the length of the vector $\begin{bmatrix}b-Ax\\\sqrt{\lambda}x\end{bmatrix}$, it is a vector in $\mathbb R^{n+m}$ with the top part having $m$ rows and the bottom part having $n$ rows. We can rewrite this vector as $\begin{bmatrix}b\\ 0\end{bmatrix}-\begin{bmatrix}Ax\\\sqrt{\lambda}x\end{bmatrix}=\begin{bmatrix}b\\ 0\end{bmatrix}-\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}x$, where $B=\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}$ is the $(m+n)\times n$ matrix, with the top $m\times n$ part being matrix $A$ and the bottom $n\times n$ part being the identity matrix $I_n$ of size $n$. Hence $f(x)=\Vert c-Bx\Vert^2$ where $$ \boxed{B=\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}\qquad c= \begin{bmatrix}b\\ 0\end{bmatrix}} $$
(b) The normal equations have the form $B^TB\hat x=B^Tc$. Plugging the values from part (a) we have $$ B^TB=\begin{bmatrix}A^T&\sqrt{\lambda}I_n\end{bmatrix}\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}=A^TA+\lambda I_n $$ $$ B^Tc=\begin{bmatrix}A^T&\sqrt{\lambda}I_n\end{bmatrix}\begin{bmatrix}b\\ 0\end{bmatrix}=A^Tb $$ So, the equation for $\hat x$ is $$ \boxed{ (A^TA+\lambda I_n)\hat x=A^Tb } $$
(c) Assume that $\lambda>0$ and $x\in N(A^TA+\lambda I_n)$, so $A^TAx+\lambda x=0$. Then we also have $x^TA^TAx+\lambda x^Tx=0$, or, in other words, $\Vert Ax\Vert^2+\lambda \Vert x\Vert ^2=0$. Since both $\Vert Ax\Vert^2$ and $\lambda\Vert x\Vert ^2$ are non-negative, this implies that $\lambda\Vert x\Vert ^2=0$. Using that $\lambda\neq 0$ we get $\Vert x\Vert ^2=0$, hence $x=0$. (Here we have mimicked the proof of $N(A^TA)\subset N(A)$ from the lecture.)