Linear algebra basics
[5 points] Sensitivity Analysis in Linear Systems Consider a nonsingular matrix A \in \mathbb{R}^{n \times n} and a vector b \in \mathbb{R}^n. Suppose that due to measurement or computational errors, the vector b is perturbed to \tilde{b} = b + \delta b.
- Derive an upper bound for the relative error in the solution x of the system Ax = b in terms of the condition number \kappa(A) and the relative error in b.
- Provide a concrete example using a 2 \times 2 matrix where \kappa(A) is large (say, \geq 100500).
- Derive an upper bound for the relative error in the solution x of the system Ax = b in terms of the condition number \kappa(A) and the relative error in b.
[5 points] Effect of Diagonal Scaling on Rank Let A \in \mathbb{R}^{n \times n} be a matrix with rank r. Suppose D \in \mathbb{R}^{n \times n} is a diagonal matrix. Determine the rank of the product DA. Explain your reasoning.
[8 points] Unexpected SVD Compute the Singular Value Decomposition (SVD) of the following matrices:
- A_1 = \begin{bmatrix} 2 \\ 2 \\ 8 \end{bmatrix}
- A_2 = \begin{bmatrix} 0 & x \\ x & 0 \\ 0 & 0 \end{bmatrix}, where x is the sum of your birthdate numbers (day + month).
[10 points] Effect of normalization on rank Assume we have a set of data points x^{(i)}\in\mathbb{R}^{n},\,i=1,\dots,m, and decide to represent this data as a matrix X = \begin{pmatrix} | & & | \\ x^{(1)} & \dots & x^{(m)} \\ | & & | \\ \end{pmatrix} \in \mathbb{R}^{n \times m}.
We suppose that \text{rank}\,X = r.
In the problem below, we ask you to find the rank of some matrix M related to X. In particular, you need to find relation between \text{rank}\,X = r and \text{rank}\,M, e.g., that the rank of M is always larger/smaller than the rank of X or that \text{rank}\,M = \text{rank}\,X \big / 35. Please support your answer with legitimate arguments and make the answer as accurate as possible.
Note that border cases are possible depending on the structure of the matrix X. Make sure to cover them in your answer correctly.
In applied statistics and machine learning, data is often normalized. One particularly popular strategy is to subtract the estimated mean \mu and divide by the square root of the estimated variance \sigma^2. i.e. x \rightarrow (x - \mu) \big / \sigma. After the normalization, we get a new matrix \begin{split} Y &:= \begin{pmatrix} | & & | \\ y^{(1)} & \dots & y^{(m)} \\ | & & | \\ \end{pmatrix},\\ y^{(i)} &:= \frac{x^{(i)} - \frac{1}{m}\sum_{j=1}^{m} x^{(j)}}{\sigma}. \end{split} What is the rank of Y if \text{rank} \; X = r? Here \sigma is a vector and the division is element-wise. The reason for this is that different features might have different scales. Specifically: \sigma_i = \sqrt{\frac{1}{m}\sum_{j=1}^{m} \left(x_i^{(j)}\right)^2 - \left(\frac{1}{m}\sum_{j=1}^{m} x_i^{(j)}\right)^2}.
[20 points] Image Compression with Truncated SVD Explore image compression using Truncated Singular Value Decomposition (SVD). Understand how varying the number of singular values affects the quality of the compressed image. Implement a Python script to compress a grayscale image using Truncated SVD and visualize the compression quality.
- Truncated SVD: Decomposes an image A into U, S, and V matrices. The compressed image is reconstructed using a subset of singular values.
- Mathematical Representation:
A \approx U_k \Sigma_k V_k^T
- U_k and V_k are the first k columns of U and V, respectively.
- \Sigma_k is a diagonal matrix with the top k singular values.
- Relative Error: Measures the fidelity of the compressed image compared to the original. \text{Relative Error} = \frac{\| A - A_k \|}{\| A \|}
import matplotlib.pyplot as plt import matplotlib.animation as animation import numpy as np from skimage import io, color import requests from io import BytesIO def download_image(url): response = requests.get(url) img = io.imread(BytesIO(response.content)) return color.rgb2gray(img) # Convert to grayscale def update_plot(i, img_plot, error_plot, U, S, V, original_img, errors, ranks, ax1, ax2): # Adjust rank based on the frame index if i < 70: rank = i + 1 else: rank = 70 + (i - 69) * 10 reconstructed_img = ... # YOUR CODE HERE # Calculate relative error relative_error = ... # YOUR CODE HERE errors.append(relative_error) ranks.append(rank) # Update the image plot and title img_plot.set_data(reconstructed_img) ax1.set_title(f"Image compression with SVD\n Rank {rank}; Relative error {relative_error:.2f}") # Remove axis ticks and labels from the first subplot (ax1) ax1.set_xticks([]) ax1.set_yticks([]) # Update the error plot error_plot.set_data(ranks, errors) ax2.set_xlim(1, len(S)) ax2.grid(linestyle=":") ax2.set_ylim(1e-4, 0.5) ax2.set_ylabel('Relative Error') ax2.set_xlabel('Rank') ax2.set_title('Relative Error over Rank') ax2.semilogy() # Set xticks to show rank numbers ax2.set_xticks(range(1, len(S)+1, max(len(S)//10, 1))) # Adjust the step size as needed plt.tight_layout() return img_plot, error_plot def create_animation(image, filename='svd_animation.mp4'): U, S, V = np.linalg.svd(image, full_matrices=False) errors = [] ranks = [] fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(5, 8)) img_plot = ax1.imshow(image, cmap='gray', animated=True) error_plot, = ax2.plot([], [], 'r-', animated=True) # Initial empty plot for errors # Add watermark ax1.text(1, 1.02, '@fminxyz', transform=ax1.transAxes, color='gray', va='bottom', ha='right', fontsize=9) # Determine frames for the animation initial_frames = list(range(70)) # First 70 ranks subsequent_frames = list(range(70, len(S), 10)) # Every 10th rank after 70 frames = initial_frames + subsequent_frames ani = animation.FuncAnimation(fig, update_plot, frames=len(frames), fargs=(img_plot, error_plot, U, S, V, image, errors, ranks, ax1, ax2), interval=50, blit=True) ani.save(filename, writer='ffmpeg', fps=8, dpi=300) # URL of the image url = "" # Download the image and create the animation image = download_image(url) create_animation(image)
Convergence rates
[6 points] Determine (it means to prove the character of convergence if it is convergent) the convergence or divergence of a given sequences
- r_{k} = \frac{1}{\sqrt{k+5}}.
- r_{k} = 0.101^k.
- r_{k} = 0.101^{2^k}.
[8 points] Let the sequence \{r_k\} be defined by r_{k+1} = \begin{cases} \frac{1}{2}\,r_k, & \text{if } k \text{ is even}, \\ r_k^2, & \text{if } k \text{ is odd}, \end{cases} with initial value 0 < r_0 < 1. Prove that \{r_k\} converges to 0 and analyze its convergence rate. In your answer, determine whether the overall convergence is linear, superlinear, or quadratic.
[6 points] Determine the following sequence \{r_k\} by convergence rate (linear, sublinear, superlinear). In the case of superlinear convergence, determine whether there is quadratic convergence. r_k = \dfrac{1}{k!}
[8 points] Consider the recursive sequence defined by r_{k+1} = \lambda\,r_k + (1-\lambda)\,r_k^p,\quad k\ge0, where \lambda\in [0,1) and p>1. Which additional conditions on r_0 should be satisfied for the sequence to converge? Show that when \lambda>0 the sequence converges to 0 with a linear rate (with asymptotic constant \lambda), and when \lambda=0 determine the convergence rate in terms of p. In particular, for p=2 decide whether the convergence is quadratic.
Line search
[10 points] Consider a strongly convex quadratic function f: \mathbb{R}^n \rightarrow \mathbb{R}, and let us start from a point x_k \in \mathbb{R}^n moving in the direction of the antigradient -\nabla f(x_k), note that \nabla f(x_k)\neq 0. Show that the minimum of f along this direction as a function of the step size \alpha, for a decreasing function at x_k, satisfies Armijo’s condition for any c_1 in the range 0 \leq c_1 \leq \frac{1}{2}. Specifically, demonstrate that the following inequality holds at the optimal \alpha^*: \varphi(\alpha) = f(x_{k+1}) = f(x_k - \alpha \nabla f(x_k)) \leq f(x_k) - c_1 \alpha \|\nabla f(x_k)\|_2^2
Implementing and Testing Line Search Conditions in Gradient Descent [36 points] x_{k+1} = x_k - \alpha \nabla f(x_k) In this assignment, you will modify an existing Python code for gradient descent to include various line search conditions. You will test these modifications on two functions: a quadratic function and the Rosenbrock function. The main objectives are to understand how different line search strategies influence the convergence of the gradient descent algorithm and to compare their efficiencies based on the number of function evaluations.
import numpy as np import matplotlib.pyplot as plt from scipy.optimize import minimize_scalar np.random.seed(214) # Define the quadratic function and its gradient def quadratic_function(x, A, b): return 0.5 * np.dot(x.T, np.dot(A, x)) - np.dot(b.T, x) def grad_quadratic(x, A, b): return np.dot(A, x) - b # Generate a 2D quadratic problem with a specified condition number def generate_quadratic_problem(cond_number): # Random symmetric matrix M = np.random.randn(2, 2) M = np.dot(M, M.T) # Ensure the matrix has the desired condition number U, s, V = np.linalg.svd(M) s = np.linspace(cond_number, 1, len(s)) # Spread the singular values A = np.dot(U, np.dot(np.diag(s), V)) # Random b b = np.random.randn(2) return A, b # Gradient descent function def gradient_descent(start_point, A, b, stepsize_func, max_iter=100): x = start_point.copy() trajectory = [x.copy()] for i in range(max_iter): grad = grad_quadratic(x, A, b) step_size = stepsize_func(x, grad) x -= step_size * grad trajectory.append(x.copy()) return np.array(trajectory) # Backtracking line search strategy using scipy def backtracking_line_search(x, grad, A, b, alpha=0.3, beta=0.8): def objective(t): return quadratic_function(x - t * grad, A, b) res = minimize_scalar(objective, method='golden') return res.x # Generate ill-posed problem cond_number = 30 A, b = generate_quadratic_problem(cond_number) # Starting point start_point = np.array([1.0, 1.8]) # Perform gradient descent with both strategies trajectory_fixed = gradient_descent(start_point, A, b, lambda x, g: 5e-2) trajectory_backtracking = gradient_descent(start_point, A, b, lambda x, g: backtracking_line_search(x, g, A, b)) # Plot the trajectories on a contour plot x1, x2 = np.meshgrid(np.linspace(-2, 2, 400), np.linspace(-2, 2, 400)) Z = np.array([quadratic_function(np.array([x, y]), A, b) for x, y in zip(x1.flatten(), x2.flatten())]).reshape(x1.shape) plt.figure(figsize=(10, 8)) plt.contour(x1, x2, Z, levels=50, cmap='viridis') plt.plot(trajectory_fixed[:, 0], trajectory_fixed[:, 1], 'o-', label='Fixed Step Size') plt.plot(trajectory_backtracking[:, 0], trajectory_backtracking[:, 1], 'o-', label='Backtracking Line Search') # Add markers for start and optimal points plt.plot(start_point[0], start_point[1], 'ro', label='Start Point') optimal_point = np.linalg.solve(A, b) plt.plot(optimal_point[0], optimal_point[1], 'y*', markersize=15, label='Optimal Point') plt.legend() plt.title('Gradient Descent Trajectories on Quadratic Function') plt.xlabel('x1') plt.ylabel('x2') plt.savefig("linesearch.svg") plt.show()The code above plots this Start by reviewing the provided Python code. This code implements gradient descent with a fixed step size and a backtracking line search on a quadratic function. Familiarize yourself with how the gradient descent function and the step size strategies are implemented.
[10/36 points] Modify the gradient descent function to include the following line search conditions:
- Dichotomy
- Sufficient Decrease Condition
- Wolfe Condition
- Polyak step size
\alpha_k = \frac{f(x_k) - f^*}{\|\nabla f(x_k)\|_2^2},
where f^* is the optimal value of the function. It seems strange to use the optimal value of the function in the step size, but there are options to estimate it even without knowing the optimal value.
- Sign Gradient Method: \alpha_k = \frac{1}{\|\nabla f(x_k)\|_2}, Test your modified gradient descent algorithm with the implemented step size search conditions on the provided quadratic function. Plot the trajectories over iterations for each condition. Choose and specify hyperparameters for inexact line search conditions. Choose and specify the termination criterion. Start from the point x_0 = (-1, 2)^T.
[8/36 points] Compare these 7 methods from the budget perspective. Plot the graph of function value from the number of function evaluations for each method on the same graph.
[10/36 points] Plot trajectory for another function with the same set of methods f(x_1, x_2) = 10(x_2 − x_1^2)^2 + (x_1 − 1)^2 with x_0 = (-1, 2)^T. You might need to adjust hyperparameters.
[8/36 points] Plot the same function value from the number of function calls for this experiment.
Matrix calculus
[6 points] Find the gradient \nabla f(x) and hessian f^{\prime\prime}(x), if f(x) = \frac{1}{2}\Vert A - xx^T\Vert ^2_F, A \in \mathbb{S}^n
[6 points] Find the gradient \nabla f(x) and hessian f''(x), if f(x) = \dfrac{1}{2} \Vert Ax - b\Vert^2_2.
[8 points] Find the gradient \nabla f(x) and hessian f''(x), if f(x) = \frac1m \sum\limits_{i=1}^m \log \left( 1 + \exp(a_i^{T}x) \right) + \frac{\mu}{2}\Vert x\Vert _2^2, \; a_i, x \in \mathbb R^n, \; \mu>0
[8 points] Compute the gradient \nabla_A f(A) of the trace of the matrix exponential function f(A) = \text{tr}(e^A) with respect to A. Hint: Use the definition of the matrix exponential. Use the definition of the differential df = f(A + dA) - f(A) + o(\Vert dA \Vert) with the limit \Vert dA \Vert \to 0.
[20 points] Principal Component Analysis through gradient calculation. Let there be a dataset \{x_i\}_{i=1}^N, x_i \in \mathbb{R}^D, which we want to transform into a dataset of reduced dimensionality d using projection onto a linear subspace defined by the matrix P \in \mathbb{R}^{D \times d}. The orthogonal projection of a vector x onto this subspace can be computed as P(P^TP)^{-1}P^Tx. To find the optimal matrix P, consider the following optimization problem: F(P) = \sum_{i=1}^N \|x_i - P(P^TP)^{-1}P^Tx_i\|^2 = N \cdot \text{tr}\left((I - P(P^TP)^{-1}P^T)^2 S\right) \to \min_{P \in \mathbb{R}^{D \times d}}, where S = \frac{1}{N} \sum_{i=1}^N x_i x_i^T is the sample covariance matrix for the normalized dataset.
Find the gradient \nabla_P F(P), calculated for an arbitrary matrix P with orthogonal columns, i.e., P : P^T P = I.
Hint: When calculating the differential dF(P), first treat P as an arbitrary matrix, and then use the orthogonality property of the columns of P in the resulting expression.
Consider the eigendecomposition of the matrix S: S = Q \Lambda Q^T, where \Lambda is a diagonal matrix with eigenvalues on the diagonal, and Q = [q_1 | q_2 | \ldots | q_D] \in \mathbb{R}^{D \times D} is an orthogonal matrix consisting of eigenvectors q_i as columns. Prove the following:
- The gradient \nabla_P F(P) equals zero for any matrix P composed of d distinct eigenvectors q_i as its columns.
- The minimum value of F(P) is achieved for the matrix P composed of eigenvectors q_i corresponding to the largest eigenvalues of S.
Automatic differentiation and jax
Benchmarking Hessian-Vector Product (HVP) Computation in a Neural Network via JAX [22 points]
You are given a simple neural network model (an MLP with several hidden layers using a nonlinearity such as GELU). The model’s parameters are defined by the weights of its layers. Your task is to compare different approaches for computing the Hessian-vector product (HVP) with respect to the model’s loss and to study how the computation time scales as the model grows in size.
Model and Loss Definition: [2/22 points] Here is the code for the model and loss definition. Write a method
get_params()that returns the flattened vector of all model weights.import jax import jax.numpy as jnp import time import matplotlib.pyplot as plt import numpy as np import pandas as pd from tqdm.auto import tqdm from jax.nn import gelu # Определение MLP модели class MLP: def __init__(self, key, layer_sizes): self.layer_sizes = layer_sizes keys = jax.random.split(key, len(layer_sizes) - 1) self.weights = [ jax.random.normal(k, (layer_sizes[i], layer_sizes[i + 1])) for i, k in enumerate(keys) ] def forward(self, x): for w in self.weights[:-1]: x = gelu(jnp.dot(x, w)) return jnp.dot(x, self.weights[-1]) def get_params(self): ### YOUR CODE HERE ### return NoneHessian and HVP Implementations: [2/22 points] Write a function
# Функция для вычисления Гессиана def calculate_hessian(model, params): def loss_fn(p): x = jnp.ones((1, model.layer_sizes[0])) # Заглушка входа return jnp.sum(model.forward(x)) ### YOUR CODE HERE ### #hessian_fn = return hessian_fn(params)that computes the full Hessian H of the loss function with respect to the model parameters using JAX’s automatic differentiation.
Naive HVP via Full Hessian: [2/22 points] Write a function
naive_hvp(hessian, vector)that, given a precomputed Hessian H and a vector v (of the same shape as the parameters), computes the Hessian-vector product using a straightforward matrix-vector multiplication.Efficient HVP Using Autograd: [4/22 points] Write a function
python def hvp(f, x, v): return jax.grad(lambda x: jnp.vdot(jax.grad(f)(x), v))(x)that directly computes the HVP without explicitly forming the full Hessian. This leverages the reverse-mode differentiation capabilities of JAX.Timing Experiment: Consider a family of models with an increasing number of hidden layers.
ns = np.linspace(50, 1000, 15, dtype=int) # The number of hidden layers num_runs = 10 # The number of runs for averagingFor each model configuration:
- Generate the model and extract its parameter vector.
- Generate a random vector v of the same dimension as the parameters.
- Measure (do not forget to use
.block_until_ready()to ensure accurate timing and proper synchronization) the following:- Combined Time (Full Hessian + Naive HVP): The total time required to compute the full Hessian and then perform the matrix-vector multiplication.
- Naive HVP Time (Excluding Hessian Computation): The time required to perform the matrix-vector multiplication given a precomputed Hessian.
- Efficient HVP Time: The time required to compute the HVP using the autograd-based function.
- Repeat each timing measurement for a fixed number of runs (e.g., 10 runs) and record both the mean and standard deviation of the computation times.
Visualization and Analysis: [12/22 points]
- Plot the timing results for the three methods on the same graph. For each method, display error bars corresponding to the standard deviation over the runs.
- Label the axes clearly (e.g., “Number of Layers” vs. “Computation Time (seconds)”) and include a legend indicating which curve corresponds to which method.
- Analyze the scaling behavior. Try analytically derive the scaling of the methods and compare it with the experimental results.
[15 points] We can use automatic differentiation not only to calculate necessary gradients but also for tuning hyperparameters of the algorithm like learning rate in gradient descent (with gradient descent 🤯). Suppose, we have the following function f(x) = \frac{1}{2}\Vert x\Vert^2, select a random point x_0 \in \mathbb{B}^{1000} = \{0 \leq x_i \leq 1 \mid \forall i\}. Consider 10 steps of the gradient descent starting from the point x_0: x_{k+1} = x_k - \alpha_k \nabla f(x_k) Your goal in this problem is to write the function, that takes 10 scalar values \alpha_i and return the result of the gradient descent on function L = f(x_{10}). And optimize this function using gradient descent on \alpha \in \mathbb{R}^{10}. Suppose that each of 10 components of \alpha is uniformly distributed on [0; 0.1]. \alpha_{k+1} = \alpha_k - \beta \frac{\partial L}{\partial \alpha} Choose any constant \beta and the number of steps you need. Describe the obtained results. How would you understand, that the obtained schedule (\alpha \in \mathbb{R}^{10}) becomes better than it was at the start? How do you check numerically local optimality in this problem?