Any linearly independent vectors form a basis, but some basis choices can have distorting effects on data. In particular, nearly parallel vectors can have unpleasant results as a basis.
Consider the vector space $\mathbb{R}^2$ and the nearly parallel basis vectors $\vec{a}_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$ and $\vec{a}_2 = \begin{pmatrix} 1 \\ 10^{-6} \end{pmatrix}$.
(a) Write the vectors $\vec{b} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} = \vec{a}_1 x_1 + \vec{a}_2 x_2$ and $\vec{c} = \begin{pmatrix} 1 \\ 0.1 \end{pmatrix} = \vec{a}_1 y_1 + \vec{a}_2 y_2$ in the basis $\vec{a}_1, \vec{a}_2$: find the coefficients $\vec{x}$ and $\vec{y}$.
(b) Compare $\Vert \vec{c}-\vec{b}\Vert$ to $\Vert \vec{y}-\vec{x}\Vert$: what is the ratio $\Vert \vec{y}-\vec{x}\Vert / \Vert \vec{c}-\vec{b}\Vert$ (to 2 significant digits)?
(c) Suppose that you instead have an orthonormal basis given by the columns of $Q$ (an $m\times n$ matrix … not just $\mathbb{R}^2$ in this part). For any vectors $\vec{b} = Q\vec{x}$ and $\vec{c} = Q\vec{y}$ in this basis, show that $\Vert \vec{y}-\vec{x}\Vert = \Vert \vec{c}-\vec{b}\Vert$: orthonormal bases "preserve distances".
(a) Since $\vec{a}_1 = \vec{b}$, we can let $\boxed{x_1 = 1}$ and $\boxed{x_2 = 0}$ to get $\vec{b} = \vec{a}_1 \cdot 1 + \vec{a}_2 \cdot 0$.
$c$ is not so easy by inspection. More generally, we need to solve: $$ \vec{a}_1 y_1 + \vec{a}_2 y_2 = \begin{pmatrix} \vec{a}_1 \vec{a}_2 \end{pmatrix} \vec{y} = \underbrace{\begin{pmatrix} 1 & 1 \\ 0 & 10^{-6} \end{pmatrix}}_A \vec{y} = \vec{c} = \begin{pmatrix} 1 \\ 0.1 \end{pmatrix} $$ This $A$ is already upper triangular, so we can proceed by backsubstitution to obtain $\boxed{y_2 = 10^5}$ and $\boxed{y_1 = 1 - 10^5}$.
(b) First we compute $$\Vert\vec{c} - \vec{b}\Vert = \left\Vert \begin{pmatrix} 1 - 1 \\ 0 - .1 \end{pmatrix}\right\Vert = \left\Vert\begin{pmatrix} 0 \\ - .1 \end{pmatrix}\right\Vert = .1$$. Next, we see that $$\Vert\vec{y} - \vec{x}\Vert = \left\Vert\begin{pmatrix}(1-10^5) -1 \\ 10^5 - 0 \end{pmatrix}\right\Vert = 10^5 \sqrt{2}$$. The ratio is $\frac{10^5 \sqrt{2}}{.1} = \boxed{10^6 \sqrt{2}} \approx 1.4 \times 10^6$.
Notice what has happened here: the basis vectors were nearly parallel and led to a huge distortion of distances, since a tiny change in the original coordinate system corresponded to a huge change in the new coordinate system. The matrix $A$ is "nearly singular", and we will eventually call it ill-conditioned (and define this more precisely); we will see that such ill-conditioned matrices can dramatically amplify small changes, which is typically not desirable. e.g. it means that if you make a small error in an input vector $b$ (due e.g. to real-world measurement error), you get a huge change in the solution $x = A^{-1} b$.
(c) We can just substitute in the definitions: $$ \Vert\vec{c} - \vec{b}\Vert = \Vert Q\vec{y} - Q\vec{x}\Vert = \sqrt{(Q\vec{y} - Q\vec{x})^T (Q\vec{y} - Q\vec{x})} \\ = \sqrt{[Q(\vec{y} - \vec{x})]^T Q(\vec{y} - \vec{x})} = \sqrt{(\vec{y} - \vec{x})^T \underbrace{Q^T Q}_I (\vec{y} - \vec{x})} = \Vert\vec{y} - \vec{x}\Vert . $$ where we have used the transpose property $(Qz)^T = z^T Q^T$ and the fact that orthonormality of the columns of $Q$ means that $Q^T Q = I$.
Suppose we have four unknowns $\vec{x} = [x_1, x_2, x_3, x_4] \in \mathbb{R}^4$, and we want to solve the four nonlinear equations: $$ \underbrace{\begin{pmatrix} -5 & -1 & -2 \\ -1 & -2 & 4 \\ -2 & 4 & -8 \end{pmatrix}}_A \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} = x_4 \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix} \, , \\ x_1^2 + x_2^2 + x_3^2 = 1 \, . $$
(a) Rewrite this equation in the form $\vec{f}(\vec{x}) = \vec{0}$, and compute both $\vec{f}(\vec{x})$ and its Jacobian matrix $J(\vec{x})$.
(b) Fill in the Julia functions below to compute $f$ and $J$, and check that $\vec{f}(\vec{x} + \vec{\delta}) - \vec{f}(\vec{x})\approx J(\vec{x}) \vec{\delta}$ for a $x=[1,2,3,4]$ and a "random"small vector $\delta = [-2,3,4,1] \times 10^{-8}$, using the sample code below.
(c) Starting with $x=[1,2,3,4]$, run a few steps of Newton's method (either copy-and-paste a few steps @show x = x - J(x) \ f(x)
by hand, or write a loop), and verify that it converges quickly. Check that after a few steps you have a very accurate solution: make sure that A*x[1:3]
is nearly x[4]*x[1:3]
and that x[1:3]'*x[1:3]
is nearly 1
. How many Newton steps did it take for x[4]
to have 4 correct decimal digits?
(a) Let's simplify the notation a bit by defining $\boxed{\vec{x}_{1:3} = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \end{pmatrix}}$. Then we have four equations: $$ A\vec{x}_{1:3} = x_4 \vec{x}_{1:3} \\ \vec{x}_{1:3}^T \vec{x}_{1:3} = 1 $$ To get it into the form $f(\vec{x}) = 0$, we just need to move everything to the left-hand side, i.e. we have: $$ A\vec{x}_{1:3} - x_4 \vec{x}_{1:3} = (A - x_4 I)\vec{x}_{1:3} = \vec{0} \\ \vec{x}_{1:3}^T \vec{x}_{1:3} - 1 = 0 $$ which is equivalent to a vector of four equations: $$ \boxed{\vec{f}(\vec{x}) = \begin{pmatrix} (A - x_4 I)\vec{x}_{1:3} \\ \vec{x}_{1:3}^T \vec{x}_{1:3} - 1 \end{pmatrix}} = \vec{0} \, . $$
There are various ways to get the Jacobian matrix for $\vec{f}$. In 18.02 style, we could write down all 4 equations in $x_1,x_2,x_3,x_4$ and take all 16 partial derivatives, then put them into a matrix. It's a lot easier (once you get used to it) to use the 18.06 style of looking at whole matrices and vectors at once, and applying the product rule: $$ d\vec{f} = \begin{pmatrix} d(A - x_4 I)\vec{x}_{1:3} +(A - x_4 I)d\vec{x}_{1:3} \\ d\vec{x}_{1:3}^T \vec{x}_{1:3} + \vec{x}_{1:3}^T d\vec{x}_{1:3} \end{pmatrix} = \begin{pmatrix} -dx_4 \vec{x}_{1:3} +(A - x_4 I)d\vec{x}_{1:3} \\ 2\vec{x}_{1:3}^T d\vec{x}_{1:3} \end{pmatrix} = \boxed{\underbrace{\begin{pmatrix} A - x_4 I & -\vec{x}_{1:3} \\ 2\vec{x}_{1:3}^T & 0 \end{pmatrix}}_{J(\vec{x})}} \underbrace{\begin{pmatrix} d\vec{x}_{1:3} \\ dx_4 \end{pmatrix}}_{d\vec{x}} $$ Notice that we've once again used the trick that you can divide matrices and vectors into blocks (like the $3\times 3$ block $A - x_4 I$), and still use the "rows-times-columns" rule to multiply (as long as we don't commute terms, since they are no longer scalars).
Alternatively, proceeding component-by-component in 18.02 style: The first component $f_1(x) = (-5-x_4)x_1 - x_2 - 2x_3$ has partial derivatives $\frac{\partial f_1}{\partial x_1} = -5-x_4$, $\frac{\partial f_1}{\partial x_2} = -1$, $\frac{\partial f_1}{\partial x_3} = -2$, $\frac{\partial f_1}{\partial x_4} = -x_1$. The pattern that the first three partial derivatives is the row of the matrix in the definition of $f$ is clear. The last partial derivative $\frac{\partial f_1}{\partial x_4} = -x_1$ will always be $-x_j$ for $j$ the row number. The final row of the Jacobian can be computed directly as $(2x_1, 2x_2, 2x_3,0)$. Putting this together it is $$J(\vec{x}) = \begin{pmatrix} -5-x_4 & -1 & -2 & -x_1 \\ -1 & -2-x_4 & 4 & -x_2 \\ -2 & 4 & -8-x_4 & - x_3 \\ 2x_1 & 2x_2 & 2x_3 & 0\end{pmatrix}.$$ which is the same as the matrix above.
# part (b)
using LinearAlgebra
A = [ -5 -1 -2
-1 -2 4
-2 4 -8]
function f(x)
return [ (A - x[4]*I)*x[1:3] # (A-x₄I)x₁₂₃
x[1:3]'x[1:3] - 1 ] # x₁₂₃ᵀx₁₂₃ - 1
end
function J(x)
# return a 4x4 matrix of the Jacobian
return [ A-x[4]*I -x[1:3]
2x[1:3]' 0 ]
end
J (generic function with 1 method)
# part (b) continued: check
x = [1,2,3,4]
δ = [-2,3,4,1] * 1e-8 # a "random" small vector
# these should be very close in value:
@show f(x+δ) - f(x)
@show J(x)*δ
# to be more quantitative, let's compute the relative error between δf and Jδx
norm(f(x+δ)-f(x) - J(x)*δ) / norm(f(x+δ)-f(x))
f(x + δ) - f(x) = [6.000000141170858e-8, -1.999999810209374e-8, -3.500000076428478e-7, 3.2000000160792297e-7] J(x) * δ = [6.0e-8, -2.0000000000000027e-8, -3.5000000000000004e-7, 3.2000000000000006e-7]
1.7056622631393434e-8
Yup, they are pretty darn close!
How close is "darn close"? The overall magnitude of the vectors is tiny because $\delta$ is a tiny vector, so that's irrelevant. The right way to look at it is to look at how many significant digits match, and in this case it looks like they match to about 7-8 digits, which is enough to satisfy ourselves that we implemented f(x)
and J(x)
correctly (or at least consistently).
Note that they will not match exactly, because the relationship is only exact in the limit $\Vert\delta\Vert \to 0$; the derivative $J\delta$ drops higher-order terms. (But we don't want to make δ
too much smaller on the computer because we will run out of digits — remember, the computer only does arithmetic to 15-16 digits. See also this notebook on finite-difference checks for more information.
A more precise way to compare two vectors $a$ and $b$ is not just to ask whether $a-b$ is "small", but small compared to what? Typically, you want $\Vert a - b\Vert$ to be small compared to $\Vert a \Vert$ and/or $\Vert b \Vert$, or equivalently you want the relative error $\Vert a - b\Vert / \Vert a \Vert$ to be small. On the last line of the code above I computed the relative error
$$
\frac{\Vert \delta f - J \delta x \Vert}{\Vert \delta f \Vert}
$$
and it turns out to be 1.7e-8
, consistent with our observation above that the derivative and difference matched to almost 8 significant digits.
# part (c)
x = [1,2,3,4]
# a single Newton step:
#@show x = x - J(x) \ f(x)
# Here is a loop:
for i = 1:10
@show x = x - J(x) \ f(x)
end
x = x - J(x) \ f(x) = [-0.1952807160292922, 1.8429617575264445, 1.336452400325468, 1.2416598860862478] x = x - J(x) \ f(x) = [-0.41009227663755143, 1.1694709608315415, 0.6547187442392998, 0.8281625115675232] x = x - J(x) \ f(x) = [-0.33044311645139424, 0.8791948182276725, 0.48653309970019226, 0.6486537276345775] x = x - J(x) \ f(x) = [-0.31363721207805906, 0.832271623035906, 0.46055892881484034, 0.5934595790624397] x = x - J(x) \ f(x) = [-0.31315062703635266, 0.8309587069065312, 0.4598334470658898, 0.5903675561447501] x = x - J(x) \ f(x) = [-0.3131502414279795, 0.8309576722193772, 0.45983287537419326, 0.590362680184185] x = x - J(x) \ f(x) = [-0.3131502414277405, 0.8309576722187336, 0.459832875373838, 0.5903626801781222] x = x - J(x) \ f(x) = [-0.31315024142774045, 0.8309576722187335, 0.45983287537383793, 0.5903626801781224] x = x - J(x) \ f(x) = [-0.3131502414277405, 0.8309576722187336, 0.459832875373838, 0.5903626801781224] x = x - J(x) \ f(x) = [-0.3131502414277405, 0.8309576722187335, 0.45983287537383793, 0.5903626801781223]
Notice that it takes about 5 steps to converge to 4 digits, after which first four significant digits of x
no longer change.
# part (c): checks
@show f(x)
@show A*x[1:3]
@show x[4]*x[1:3]
@show x[1:3]'*x[1:3] # same as x[1]^2 + x[2]^2 + x[3]^2
f(x) = [1.1102230246251565e-16, 0.0, 0.0, -1.1102230246251565e-16] A * x[1:3] = [-0.18487221582770685, 0.4905663984856252, 0.2714681687397116] x[4] * x[1:3] = [-0.18487221582770696, 0.49056639848562517, 0.27146816873971147] (x[1:3])' * x[1:3] = 0.9999999999999999
0.9999999999999999
Did the Newton iteration converge to a correct solution x
? The code above checks:
A*x[1:3]
and x[4]*x[1:3]
match closely (again to ≈15 digits).x[1:3]' * x[1:3]
≈ 1 (to ≈15 digits).Therefore this looks like a correct solution.
In class, we showed how to differentiate some matrix-valued functions: $d(A^2) = A dA + dA A$, which is a linear operator acting on $dA$, and also $d(A^{-1}) = -A^{-1} dA A^{-1}$.
(a) Write $d(A^4)$ as a linear operator acting on $dA$.
(b) We can compute $f(A) = x^T A^{-1} y$ (a scalar-valued function of a square invertible matrix $A$) by solving only one linear system $z = A^{-1}y$ (i.e. solving $Az = y$) and then taking a dot product. Show that $df$ (also a scalar!), for any infinitesimal change $dA$, can be computed by solving only one additional linear system $\_\_\_\_\_\_ = \_\_\_\_\_\_$ (in addition to solving $Az = y$), and then taking some matrix/vector products. (This is called an "adjoint solve" by engineers or "backpropagation" by computer scientists.)
(c) Let $A = \begin{pmatrix} a_1 & a_3 \\ a_2 & a_4 \end{pmatrix}$ be a $2\times 2$ matrix. Let $\text{vec}(A) = \begin{pmatrix} a_1 \\ a_2 \\ a_3 \\ a_4\end{pmatrix} = \vec{a}$ be the vector formed by "stacking" the columns of $A$ (this is computed by vec(A)
in Julia). Equivalently, $\vec{a}$ are the coefficients you get from expanding $A$ in the basis ______________. Let $dA = \begin{pmatrix} da_1 & da_3 \\ da_2 & da_4 \end{pmatrix}$. Show that $\text{vec}(A dA + dA A) = (\text{some matrix}) \text{vec}(dA)$: find the $4\times 4$ "some matrix" (which expresses the linear operator $A dA + dA A$ in this "vectorized" basis). Check your answer in Julia for an arbitrary choice of A
and dA
:
(a) Since $A^4 = (A^2)^2$ we can apply the product rule to $d(A^4) = d(A^2 \cdot A^2)$. That gives $d(A^2) A^2 + A^2 d(A^2)$. From class we know $d(A^2)$ as given above. Substituting, we get $(AdA + dA A) A^2 + A^2 (AdA + dA A) = \boxed{A \, dA \, A^2 + dA \, A^3 + A^3 \, dA + A^2 \, dA \, A}$. (Note that this only equals $4A\,dA$, like the scalar case, in the unlikely event that $dA$ and $A$ commute!)
(b) Using the formula from class and re-parenthesizing (i.e. using associativity) $$ df = x^T (-A^{-1}dA A^{-1})y = -(x^T A^{-1}) dA (A^{-1} y) = - \underbrace{(\underbrace{(A^{-1})^T}_{=(A^T)^{-1}}) x)^T}_{w^T} dA \underbrace{(A^{-1} y)}_z \, . $$ Therefore, we need to solve the original system $Az = y$, and one additional linear system for $w = (A^{-1})^T x = (A^T)^{-1} x$: we solve $\boxed{A^T w = x}$. The key fact that we relied on here is $(A^{-1})^T =(A^T)^{-1}$, which we showed in class.
(c) By parameterizing $A$ this way, we are effectively writing $$ A = \begin{pmatrix} a_1 & a_3 \\ a_2 & a_4 \end{pmatrix} = \underbrace{\begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}}_{A_1} a_1 + \underbrace{\begin{pmatrix} 0 & 0 \\ 1 & 0 \end{pmatrix}}_{A_2} a_2 + \underbrace{\begin{pmatrix} 0 & 1 \\ 0 & 0 \end{pmatrix}}_{A_3} a_3 + \underbrace{\begin{pmatrix} 0 & 0 \\ 0 & 1 \end{pmatrix}}_{A_4} a_4 \, , $$ so $\vec{a}$ are the coefficients of $A$ in the basis $\boxed{A_1,A_2,A_3,A_4}$ from above.
Now, let us explicitly compute $$AdA + dAA = \begin{pmatrix} a_1 & a_3 \\ a_2 & a_4\end{pmatrix} \begin{pmatrix}da_1 & da_3 \\ da_2 & da_4 \end{pmatrix} + \begin{pmatrix}da_1 & da_3 \\ da_2 & da_4 \end{pmatrix}\begin{pmatrix} a_1 & a_3 \\ a_2 & a_4\end{pmatrix} = \begin{pmatrix} a_1 da_1 + a_3 da_2 + da_1 a_1 + da_3 a_2 & a_1 da_3 + a_3 da_4 + da_1a_3 + da_3 a_4 \\ a_2 da_1 + a_4 da_2 + da_2 a_1 + da_4 a_2 & a_2 da_3 + a_4 da_4 + da_2 a_3 + da_4 a_4\end{pmatrix}.$$ Vectorized this is $$\begin{pmatrix} a_1 da_1 + a_3 da_2 + da_1 a_1 + da_3 a_2 \\ a_2 da_1 + a_4 da_2 + da_2 a_1 + da_4 a_2 \\ a_1 da_3 + a_3 da_4 + da_1a_3 + da_3 a_4 \\ a_2 da_3 + a_4 da_4 + da_2 a_3 + da_4 a_4\end{pmatrix}.$$
Factoring out $da$ we get this is $$\mathrm{vec}(AdA + dAA ) = \boxed{\begin{pmatrix}2a_1 & a_3 & a_2 & 0 \\ a_2 & a_1 + a_4 & 0 & a_2 \\ a_3 & 0 & a_1 + a_4 & a_3 \\ 0 & a_3 & a_2 & 2a_4 \end{pmatrix}} \underbrace{\begin{pmatrix} da_1 \\ da_2 \\ da_3 \\ da_4 \end{pmatrix}}_{\mathrm{vec}(dA)} \, ,$$ where the boxed term is our desired matrix.
# check your answer for part (c):
A = [1 2
3 4]
a = vec(A)
dA = [7 9
6 -3]
some_matrix = [2*a[1] a[3] a[2] 0; a[2] a[1]+a[4] 0 a[2]; a[3] 0 a[1]+a[4] a[3]; 0 a[3] a[2] 2*a[4]] # your answer from (c)
vec(A*dA + dA*A) ≈ some_matrix * vec(dA) # should return "true"
true
(From Strang, section 4.4.)
(a) If $A$ has three orthogonal columns of lengths 1,2,3, what is $A^T A$?
(b) Give an example of a matrix $Q$ with orthonormal columns but $QQ^T \ne I$.
(c) Give an example of two orthogonal vectors that are not linearly independent.
(d) If we have a basis $a_1 = \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}, a_2, a_3$ of orthogonal vectors in $\mathbb{R}^3$ that are not normalized to length 1, and $$\underbrace{\begin{pmatrix} -1 \\ 3 \\ 2 \end{pmatrix}}_b = Ax = a_1 x_1 + a_2 x_2 + a_3 x_3\,,$$ then $x_1 = \_\_\_\_\_\_\_\_\_\_\_\_$.
(a) Let $A = \begin{pmatrix} | & | & | \\ a_1 & a_2 & a_3 \\ | & | & | \end{pmatrix} $. Then $A^TA = \begin{pmatrix} a_1^T a_1 & a_1^T a_2 & a_1^T a_3 \\ a_2^T a_1 & a_2^T a_2 & a_2^T a_3 \\ a_3^T a_1 & a_3^T a_2 & a_3^T a_3 \end{pmatrix}$. By the orthogonality of the vectors, the off diagonal elements are zero, and the diagonal entries are the squares of the lengths. So $\boxed{A^TA = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 4 & 0 \\ 0 & 0 & 9 \end{pmatrix}}$.
(b) Take for example the rectangular matrix $$Q = \begin{pmatrix} 1 & 0 \\ 0 & 1\\ 0 & 0 \end{pmatrix}.$$ Then $$QQ^T = \begin{pmatrix} 1& 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{pmatrix} \neq I.$$ (We did an example like this in lecture.)
(c) The vector $v_1 = [0,0]$ is orthogonal to $v_2 = [1,0]$ because $v_1^Tv_2 = 0$. Also, any $\lambda \in \mathbb{R}$ satisfies $\lambda \cdot v_1 + 0 \cdot v_2= 0$, so $v_1$ and $v_2$ are not linearly independent.
(d) Multiply both sides of the equation on the left by $\vec{a}_1^T$. Then $$ \vec{a}_1^T \vec{b} = \vec{a}_1^T \vec{a}_1 x_1 + \vec{a}_1^T\vec{a}_2 x_2 + \vec{a}_1^T\vec{a}_3 x_3\ = ||\vec{a}_1||^2 x_1 + 0 + 0,,$$ because $\vec{a}_1$ is orthogonal to $\vec{a}_2$ and $\vec{a}_3$. Plugging in numbers we get $(1\cdot -1 + 2\cdot3 + 3 \cdot 2) = (1^2 + 2^2 + 3^2) x_1$, so $14x_1 = 11$, or $\boxed{x_1 = \frac{11}{14}}$.
(If you remember one thing about orthogonal bases, remember this: you can get basis coefficients simply by dot products if the basis vectors are orthogonal.)