Due 11am Wednesday March 30. This will be the last problem set covered on exam 2.
Suppose that A is an m×n "tall" matrix (m>n) with full column rank (rank n). Consider the "wide" (underdetermined, i.e. fewer equations than unknowns) system of equations:
ATx=bYou are also given the QR factorization A=QR of A, where Q is an m×n matrix with orthonormal columns and R is an n×n invertible upper-triangular matrix. (For example, maybe you got this by Gram-Schmidt on A.) This corresponds to a factorization AT=RTQT called an "LQ" factorization of AT (because L=RT is lower-triangular), which is used in practice to find minimum-norm solutions to underdetermined systems of equations (as opposed to least-square solutions of overdetermined systems), as you will explore in this problem.
(a) ATx=b has a solution for any b∈Rn because ____________ but the solution x is not unique because __________.
(b) We can find a unique solution x0∈C(A) of ATx0=b, given by the formula ____________ in terms of A,b.
(c) Alternatively, any x∈C(A) can be written x=Qy for some y∈Rn because ____________. We can find the same solution x0=Qy of ATx0=b as in (b), with y given by the formula ____________ in terms of Q,R,b (plug in A=QR and simplify!).
(d) Is the solution x0 in (b) or (c) generally the same as the "particular" solution we would have gotten from the exam-1 approach of setting the free variables equal to zero? Why or why not? (Maybe try with AT=[11] and b=[1].)
(e) In terms of x0 from (b) or (c), any solution x of ATx=b (the "complete" solution) can be written as x=x0+x1 where x1 is any vector in the subspace ____________. The solution ˆx with the minimum norm ‖ is therefore \hat{x} = \_\_\_\_\_\_. (Hint: write \Vert x \Vert^2 = x^Tx in terms of x_0 and x_1, remembering orthogonality of subspaces.)
(f) In Julia, x̂ = W \ b
for any "wide" full-row-rank matrix W
computes the minimum-norm solution \hat{x} to Wx = b. Check that this matches your answer from parts (c) and (d) for a random A using the code below:
using LinearAlgebra
A = randn(10, 3) # a random full-rank "tall" matrix
QR = qr(A) # its QR factorization
Q = Matrix(QR.Q) # the 10x3 "thin" Q factor
R = QR.R # and the 3x3 R factor
b = randn(3) # a random right-hand side
x̂ = A' \ b # the minimum-norm solution to Aᵀx = b using the built-in \
y = ?????? # your solution from (c), a formula in terms of Q, R, and/or b
x₀ = Q * y
x̂ₑ = ???? # your solution from (e)
x̂ₑ ≈ x̂ # should print "true"
(a) "A^T x = b has a solution for any b \in \mathbb{R}^n because C(A^T) has dimension n and C(A^T)=\mathbb R^n but the solution x is not unique because the dimension of N(A^T) is m-n>0 and hence N(A^T) is non-zero."
More precisely, since we know that A has dimensions m\times n and rank r=n, we know the dimensions of all four foundamental subspaces. In particular, the dimension of C(A^T) is the rank n, and since C(A^T) is a subspace of \mathbb R^n we get C(A^T)=\mathbb R^n and any vector b is in C(A^T). The dimension of N(A^T) is m-r=m-n>0, so the null-space of A^T contains non-zero vectors and full solutions x+N(A^T) always have more than one vector.
(b) "We can find a unique solution x_0 \in C(A) of A^T x_0 = b, given by the formula \boxed{x_0=A(A^TA)^{-1}b} in terms of A,b."
We want to find a vector x_0 such that A^Tx_0=b and x_0=Az_0. Let us find z_0: combining these two conditions, we have A^TAz_0=b. Note that since A is full column-rank, A^TA is invertible (proved in lectures), so A^TAz_0=b has a unique solution z_0=(A^TA)^{-1}b. Hence the desired x_0 is also unique and x_0=A(A^TA)^{-1}b.
(c) "Alternatively, any x \in C(A) can be written x = Qy for some y\in \mathbb{R}^n because for any x\in C(A) we have z such that x=Az=QRz, so we can set y=Rz. We can find the same solution x_0 = Qy of A^T x_0 = b as in (b), with y given by the formula \boxed{y=(R^T)^{-1}b} in terms of Q,R,b (plug in A=QR and simplify!)."
This expression for y is obtained by using that x_0=A (A^TA)^{-1}b from part (b), so y=R(A^TA)^{-1}b. Since Q has orthogonal columns, we have Q^TQ=I and A^TA=R^TQ^TQQR=R^TR. Hence y=R(R^TR)^{-1}b=(R^T)^{-1}b.
(d) The solution x_0 is generally different from the "particular solution". The hint provides an example: when A^T = [1\;1] and b=[1], the solution x_0 from part (b) is \dfrac{1}{2}\begin{bmatrix}1\\1\end{bmatrix}, while the special solution (sometimes called a "basic solution") sets the second variable to 0 and is equal to \begin{bmatrix}1\\0\end{bmatrix}.
(e) "In terms of x_0 from (b) or (c), any solution x of A^T x = b (the "complete" solution) can be written as x = x_0 + x_1 where x_1 is any vector in the subspace \boxed{N(A^T)}. The solution \hat{x} with the minimum norm \Vert x \Vert is therefore \boxed{\hat{x} = x_0}. (Hint: write \Vert x \Vert^2 = x^Tx in terms of x_0 and x_1, remembering orthogonality of subspaces.)"
To show that \hat{x}=x_0, note that we want to minimize \Vert x\Vert^2=x_0^Tx_0+x_1^Tx_0+x_0^Tx_1+x_1^Tx_1, where x_0 is our solution from previous parts, which is fixed, and x_1 is an arbitrary vector from N(A^T). Recall that subspaces C(A) and N(A^T) are orthogonal complements of each other, hence x_0 is always orthogonal to x_1. Hence \Vert x\Vert^2=x_0^Tx_0+x_1^Tx_1=\Vert x_0\Vert^2+\Vert x_1\Vert^2 and we need to minimize \Vert x_1\Vert^2, which is achieved when x_1=0.
(f) In Julia, x̂ = W \ b
for any "wide" full-row-rank matrix W
computes the minimum-norm solution \hat{x} to Wx = b. Check that this matches your answer from parts (c) and (d) for a random A using the code below:
using LinearAlgebra
A = randn(10, 3) # a random full-rank "tall" matrix
QR = qr(A) # its QR factorization
Q = Matrix(QR.Q) # the 10x3 "thin" Q factor
R = QR.R # and the 3x3 R factor
b = randn(3) # a random right-hand side
x̂ = A' \ b # the minimum-norm solution to Aᵀx = b using the built-in \
10-element Vector{Float64}: 0.2122948675625042 -0.05057152739601002 0.08655272655122272 -0.30100333944745666 -0.12011586997211171 -0.10275806015260548 0.07650186369816225 -0.0616304499080549 0.00041573730377877927 0.1162350744786731
y = R'^(-1)*b # solution from (c), a formula in terms of Q, R, and/or b
x₀ = Q * y
x̂ₑ = x₀ # your solution from (e)
10-element Vector{Float64}: 0.21229486756250404 -0.050571527396009966 0.08655272655122272 -0.3010033394474565 -0.12011586997211157 -0.10275806015260544 0.0765018636981623 -0.06163044990805486 0.00041573730377877753 0.11623507447867301
x̂ₑ ≈ x̂ # should print "true"
true
(a) Apply Gram-Schmidt to the polynomials {1, x, x^2} to find an orthonormal basis {q_1,q_2,q_3} of polynomials under the inner ("dot") product (different from the one used in class): f \cdot g = \int_0^\infty f(x) g(x) e^{-x} dx (Unlike the Legendre polynomials in class, normalize your polynomials q_k to have \Vert q_k \Vert = \sqrt {q_k \cdot q_k} = 1 under this inner product, i.e. to be really orthonormal.)
(b) Consider the function f(x) = \begin{cases} x & x < 1 \\ 0 & x \ge 1 \end{cases}. Find the degree-1 polynomial p(x)=αx+β that is the "best fit" to f(x) in the sense of minimizing \Vert f - \alpha x - \beta \Vert^2 = \int_0^\infty \left[ f(x) - p(x) \right]^2 e^{-x} dx In particular, find p(x) by performing the orthogonal projection (with this dot product) of f(x) onto .........?
(a) Before we solve the problem, recall that \int_0^{\infty}x^n e^{-x} dx=n!, which can be proved using induction and integration by parts.
We have three functions a(x) = 1, b(x) = x and c(x) = x^2. Let's firstly use Gram-Schmidt construct an orthogonal set of functions A(x), B(x), C(x): \begin{align} A(x) &= 1\\ B(x) &= b(x) - \frac{A\cdot b}{A\cdot A}A(x) \end{align} Now A\cdot A = \int_0^{\infty} e^{-x} dx = 1 and A\cdot b = \int_0^{\infty} xe^{-x} dx = 1, so that \begin{align} B(x) &= x - 1 \end{align} Then \begin{align} C(x) = c(x) - \frac{A\cdot c}{A \cdot A}A(x) - \frac{B\cdot c}{B \cdot B}B(x), \end{align} and A\cdot c = \int_0^{\infty} x^2 e^{-x} dx =2, B\cdot c = \int_0^{\infty} x^2(x-1) e^{-x} dx = 4, and B\cdot B = \int_0^{\infty} (x-1)^2 e^{-x} dx = 1, so that \begin{align} C(x) &= x^2 - \frac{2}{1}\times 1 - \frac{4}{1} (x-1)\\ &= x^2 - 4x + 2. \end{align} Finally we normalize A, B, C. We already found that A\cdot A = B \cdot B = 1. We can calculate \Vert C \Vert^2 = C\cdot C = \int_0^{\infty} (x^2 - 4x + 2)^2 e^{-x} dx = 4, so that \begin{align} \boxed{q_1(x) = 1, \;\; q_2(x) = x-1, \;\; q_3(x) = \frac{x^2 - 4x +2}{2}.} \end{align}
You might be interested that this is actually a well-known basis of orthogonal polynomials called Laguerre polynomials.
(b) We find p(x) by calculating the orthogonal projection of f(x) onto the two-dimensional space of functions \alpha x +\beta, which are spanned by 1,x. Note that we have already computed an orthonormal basis for this space in part (a), it is q_1(x)=1 and q_2(x)=x-1. Hence the orthogonal projection is given by p(x)=(q_1 \cdot f) q_1(x)+ (q_2 \cdot f) q_2(x) We have q_1\cdot f=\int_0^{\infty} f(x) e^{-x} dx=\int_0^1 x e^{-x} dx= (-x-1)e^{-x}{\Big|}^{x=1}_{x=0}=1-2e^{-1} and q_2\cdot f=\int_0^{\infty} (x-1)f(x) e^{-x} dx=\int_0^1 (x^2-x) e^{-x} dx= (-x^2-x-1)e^{-x}{\Big|}^{x=1}_{x=0}=1-3e^{-1}. Hence \boxed{p(x)=(1-2e^{-1})+(1-3e^{-1})(x-1)=(1-3e^{-1})x+e^{-1}}
In a common variant of least-squares fitting called "ridge" or "Tikhonov" regression to get more robust solutions to noisy fitting problems, one minimizes: f(x) = \Vert b - Ax \Vert^2 + \lambda \Vert x \Vert^2 instead of just \Vert b - Ax \Vert^2, for some "penalty" parameter \lambda \ge 0. As \lambda gets bigger and bigger, this favors smaller solutions x. Here, A is m \times n.
(a) Write this same f(x) as f(x) = \Vert c - Bx \Vert^2 for some matrix B and some vector c defined in terms of A,\lambda,b. Hint: \Vert x \Vert^2 + \Vert y \Vert^2 equals the length of a single vector ______ made by stacking x and y.
(b) Since part (a) rewrote f(x) as an ordinary-looking least-squares problem, use the normal equations for B and c to write down an equation (n\times n \mbox{ matrix})\hat{x} = (\mbox{right-hand side}) for the minimizer \hat{x} in terms of A,\lambda,b.
(c) For \lambda > 0, show that (b) has a unique solution even if A is not full column rank (i.e. A^TA is singular). In particular, why does your matrix on the left-hand-side have a nullspace of {0}?
A really common strategy in mathematics is to try to rewrite new problems so that they look like old problems, letting us re-use the old solutions!
(a) Following the hint, note that \Vert x\Vert^2+\Vert y\Vert^2 is the length of \begin{bmatrix}x\\y\end{bmatrix} (we stack the coordinates of x and y). Hence f(x)=\Vert b - Ax \Vert^2 + \lambda \Vert x \Vert^2 is the length of the vector \begin{bmatrix}b-Ax\\\sqrt{\lambda}x\end{bmatrix}, it is a vector in \mathbb R^{n+m} with the top part having m rows and the bottom part having n rows. We can rewrite this vector as \begin{bmatrix}b\\ 0\end{bmatrix}-\begin{bmatrix}Ax\\\sqrt{\lambda}x\end{bmatrix}=\begin{bmatrix}b\\ 0\end{bmatrix}-\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}x, where B=\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix} is the (m+n)\times n matrix, with the top m\times n part being matrix A and the bottom n\times n part being the identity matrix I_n of size n. Hence f(x)=\Vert c-Bx\Vert^2 where \boxed{B=\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}\qquad c= \begin{bmatrix}b\\ 0\end{bmatrix}}
(b) The normal equations have the form B^TB\hat x=B^Tc. Plugging the values from part (a) we have B^TB=\begin{bmatrix}A^T&\sqrt{\lambda}I_n\end{bmatrix}\begin{bmatrix}A\\\sqrt{\lambda}I_n\end{bmatrix}=A^TA+\lambda I_n B^Tc=\begin{bmatrix}A^T&\sqrt{\lambda}I_n\end{bmatrix}\begin{bmatrix}b\\ 0\end{bmatrix}=A^Tb So, the equation for \hat x is \boxed{ (A^TA+\lambda I_n)\hat x=A^Tb }
(c) Assume that \lambda>0 and x\in N(A^TA+\lambda I_n), so A^TAx+\lambda x=0. Then we also have x^TA^TAx+\lambda x^Tx=0, or, in other words, \Vert Ax\Vert^2+\lambda \Vert x\Vert ^2=0. Since both \Vert Ax\Vert^2 and \lambda\Vert x\Vert ^2 are non-negative, this implies that \lambda\Vert x\Vert ^2=0. Using that \lambda\neq 0 we get \Vert x\Vert ^2=0, hence x=0. (Here we have mimicked the proof of N(A^TA)\subset N(A) from the lecture.)