Linear algebra basics

  1. [5 points] Sensitivity Analysis in Linear Systems Consider a nonsingular matrix A \in \mathbb{R}^{n \times n} and a vector b \in \mathbb{R}^n. Suppose that due to measurement or computational errors, the vector b is perturbed to \tilde{b} = b + \delta b.

    1. Derive an upper bound for the relative error in the solution x of the system Ax = b in terms of the condition number \kappa(A) and the relative error in b.
    2. Provide a concrete example using a 2 \times 2 matrix where \kappa(A) is large (say, \geq 100500).
  2. [5 points] Effect of Diagonal Scaling on Rank Let A \in \mathbb{R}^{n \times n} be a matrix with rank r. Suppose D \in \mathbb{R}^{n \times n} is a diagonal matrix. Determine the rank of the product DA. Explain your reasoning.

  3. [8 points] Unexpected SVD Compute the Singular Value Decomposition (SVD) of the following matrices:

    • A_1 = \begin{bmatrix} 2 \\ 2 \\ 8 \end{bmatrix}
    • A_2 = \begin{bmatrix} 0 & x \\ x & 0 \\ 0 & 0 \end{bmatrix}, where x is the sum of your birthdate numbers (day + month).
  4. [10 points] Effect of normalization on rank Assume we have a set of data points x^{(i)}\in\mathbb{R}^{n},\,i=1,\dots,m, and decide to represent this data as a matrix X = \begin{pmatrix} | & & | \\ x^{(1)} & \dots & x^{(m)} \\ | & & | \\ \end{pmatrix} \in \mathbb{R}^{n \times m}.

    We suppose that \text{rank}\,X = r.

    In the problem below, we ask you to find the rank of some matrix M related to X. In particular, you need to find relation between \text{rank}\,X = r and \text{rank}\,M, e.g., that the rank of M is always larger/smaller than the rank of X or that \text{rank}\,M = \text{rank}\,X \big / 35. Please support your answer with legitimate arguments and make the answer as accurate as possible.

    Note that border cases are possible depending on the structure of the matrix X. Make sure to cover them in your answer correctly.

    In applied statistics and machine learning, data is often normalized. One particularly popular strategy is to subtract the estimated mean \mu and divide by the square root of the estimated variance \sigma^2. i.e. x \rightarrow (x - \mu) \big / \sigma. After the normalization, we get a new matrix \begin{split} Y &:= \begin{pmatrix} | & & | \\ y^{(1)} & \dots & y^{(m)} \\ | & & | \\ \end{pmatrix},\\ y^{(i)} &:= \frac{x^{(i)} - \frac{1}{m}\sum_{j=1}^{m} x^{(j)}}{\sigma}. \end{split} What is the rank of Y if \text{rank} \; X = r? Here \sigma is a vector and the division is element-wise. The reason for this is that different features might have different scales. Specifically: \sigma_i = \sqrt{\frac{1}{m}\sum_{j=1}^{m} \left(x_i^{(j)}\right)^2 - \left(\frac{1}{m}\sum_{j=1}^{m} x_i^{(j)}\right)^2}.

  5. [20 points] Image Compression with Truncated SVD Explore image compression using Truncated Singular Value Decomposition (SVD). Understand how varying the number of singular values affects the quality of the compressed image. Implement a Python script to compress a grayscale image using Truncated SVD and visualize the compression quality.

    • Truncated SVD: Decomposes an image A into U, S, and V matrices. The compressed image is reconstructed using a subset of singular values.
    • Mathematical Representation: A \approx U_k \Sigma_k V_k^T
      • U_k and V_k are the first k columns of U and V, respectively.
      • \Sigma_k is a diagonal matrix with the top k singular values.
      • Relative Error: Measures the fidelity of the compressed image compared to the original. \text{Relative Error} = \frac{\| A - A_k \|}{\| A \|}
    import matplotlib.pyplot as plt
    import matplotlib.animation as animation
    import numpy as np
    from skimage import io, color
    import requests
    from io import BytesIO
    
    def download_image(url):
        response = requests.get(url)
        img = io.imread(BytesIO(response.content))
        return color.rgb2gray(img)  # Convert to grayscale
    
    def update_plot(i, img_plot, error_plot, U, S, V, original_img, errors, ranks, ax1, ax2):
        # Adjust rank based on the frame index
        if i < 70:
            rank = i + 1
        else:
            rank = 70 + (i - 69) * 10
    
        reconstructed_img = ... # YOUR CODE HERE 
    
        # Calculate relative error
        relative_error = ... # YOUR CODE HERE
        errors.append(relative_error)
        ranks.append(rank)
    
        # Update the image plot and title
        img_plot.set_data(reconstructed_img)
        ax1.set_title(f"Image compression with SVD\n Rank {rank}; Relative error {relative_error:.2f}")
    
        # Remove axis ticks and labels from the first subplot (ax1)
        ax1.set_xticks([])
        ax1.set_yticks([])
    
        # Update the error plot
        error_plot.set_data(ranks, errors)
        ax2.set_xlim(1, len(S))
        ax2.grid(linestyle=":")
        ax2.set_ylim(1e-4, 0.5)
        ax2.set_ylabel('Relative Error')
        ax2.set_xlabel('Rank')
        ax2.set_title('Relative Error over Rank')
        ax2.semilogy()
    
        # Set xticks to show rank numbers
        ax2.set_xticks(range(1, len(S)+1, max(len(S)//10, 1)))  # Adjust the step size as needed
        plt.tight_layout()
    
        return img_plot, error_plot
    
    
    def create_animation(image, filename='svd_animation.mp4'):
        U, S, V = np.linalg.svd(image, full_matrices=False)
        errors = []
        ranks = []
    
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(5, 8))
        img_plot = ax1.imshow(image, cmap='gray', animated=True)
        error_plot, = ax2.plot([], [], 'r-', animated=True)  # Initial empty plot for errors
    
        # Add watermark
        ax1.text(1, 1.02, '@fminxyz', transform=ax1.transAxes, color='gray', va='bottom', ha='right', fontsize=9)
    
        # Determine frames for the animation
        initial_frames = list(range(70))  # First 70 ranks
        subsequent_frames = list(range(70, len(S), 10))  # Every 10th rank after 70
        frames = initial_frames + subsequent_frames
    
        ani = animation.FuncAnimation(fig, update_plot, frames=len(frames), fargs=(img_plot, error_plot, U, S, V, image, errors, ranks, ax1, ax2), interval=50, blit=True)
        ani.save(filename, writer='ffmpeg', fps=8, dpi=300)
    
        # URL of the image
        url = ""
    
        # Download the image and create the animation
        image = download_image(url)
        create_animation(image)

Convergence rates

  1. [6 points] Determine (it means to prove the character of convergence if it is convergent) the convergence or divergence of a given sequences

    • r_{k} = \frac{1}{\sqrt{k+5}}.
    • r_{k} = 0.101^k.
    • r_{k} = 0.101^{2^k}.
  2. [8 points] Let the sequence \{r_k\} be defined by r_{k+1} = \begin{cases} \frac{1}{2}\,r_k, & \text{if } k \text{ is even}, \\ r_k^2, & \text{if } k \text{ is odd}, \end{cases} with initial value 0 < r_0 < 1. Prove that \{r_k\} converges to 0 and analyze its convergence rate. In your answer, determine whether the overall convergence is linear, superlinear, or quadratic.

  3. [6 points] Determine the following sequence \{r_k\} by convergence rate (linear, sublinear, superlinear). In the case of superlinear convergence, determine whether there is quadratic convergence. r_k = \dfrac{1}{k!}

  4. [8 points] Consider the recursive sequence defined by r_{k+1} = \lambda\,r_k + (1-\lambda)\,r_k^p,\quad k\ge0, where \lambda\in [0,1) and p>1. Which additional conditions on r_0 should be satisfied for the sequence to converge? Show that when \lambda>0 the sequence converges to 0 with a linear rate (with asymptotic constant \lambda), and when \lambda=0 determine the convergence rate in terms of p. In particular, for p=2 decide whether the convergence is quadratic.

Matrix calculus

  1. [6 points] Find the gradient \nabla f(x) and hessian f^{\prime\prime}(x), if f(x) = \frac{1}{2}\Vert A - xx^T\Vert ^2_F, A \in \mathbb{S}^n

  2. [6 points] Find the gradient \nabla f(x) and hessian f''(x), if f(x) = \dfrac{1}{2} \Vert Ax - b\Vert^2_2.

  3. [8 points] Find the gradient \nabla f(x) and hessian f''(x), if f(x) = \frac1m \sum\limits_{i=1}^m \log \left( 1 + \exp(a_i^{T}x) \right) + \frac{\mu}{2}\Vert x\Vert _2^2, \; a_i, x \in \mathbb R^n, \; \mu>0

  4. [8 points] Compute the gradient \nabla_A f(A) of the trace of the matrix exponential function f(A) = \text{tr}(e^A) with respect to A. Hint: Use the definition of the matrix exponential. Use the definition of the differential df = f(A + dA) - f(A) + o(\Vert dA \Vert) with the limit \Vert dA \Vert \to 0.

  5. [20 points] Principal Component Analysis through gradient calculation. Let there be a dataset \{x_i\}_{i=1}^N, x_i \in \mathbb{R}^D, which we want to transform into a dataset of reduced dimensionality d using projection onto a linear subspace defined by the matrix P \in \mathbb{R}^{D \times d}. The orthogonal projection of a vector x onto this subspace can be computed as P(P^TP)^{-1}P^Tx. To find the optimal matrix P, consider the following optimization problem: F(P) = \sum_{i=1}^N \|x_i - P(P^TP)^{-1}P^Tx_i\|^2 = N \cdot \text{tr}\left((I - P(P^TP)^{-1}P^T)^2 S\right) \to \min_{P \in \mathbb{R}^{D \times d}}, where S = \frac{1}{N} \sum_{i=1}^N x_i x_i^T is the sample covariance matrix for the normalized dataset.

    1. Find the gradient \nabla_P F(P), calculated for an arbitrary matrix P with orthogonal columns, i.e., P : P^T P = I.

      Hint: When calculating the differential dF(P), first treat P as an arbitrary matrix, and then use the orthogonality property of the columns of P in the resulting expression.

    2. Consider the eigendecomposition of the matrix S: S = Q \Lambda Q^T, where \Lambda is a diagonal matrix with eigenvalues on the diagonal, and Q = [q_1 | q_2 | \ldots | q_D] \in \mathbb{R}^{D \times D} is an orthogonal matrix consisting of eigenvectors q_i as columns. Prove the following:

      1. The gradient \nabla_P F(P) equals zero for any matrix P composed of d distinct eigenvectors q_i as its columns.
      2. The minimum value of F(P) is achieved for the matrix P composed of eigenvectors q_i corresponding to the largest eigenvalues of S.

Automatic differentiation and jax

  1. Benchmarking Hessian-Vector Product (HVP) Computation in a Neural Network via JAX [22 points]

    You are given a simple neural network model (an MLP with several hidden layers using a nonlinearity such as GELU). The model’s parameters are defined by the weights of its layers. Your task is to compare different approaches for computing the Hessian-vector product (HVP) with respect to the model’s loss and to study how the computation time scales as the model grows in size.

    Model and Loss Definition: [2/22 points] Here is the code for the model and loss definition. Write a method get_params() that returns the flattened vector of all model weights.

    import jax
    import jax.numpy as jnp
    import time
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from tqdm.auto import tqdm
    from jax.nn import gelu
    
    # Определение MLP модели
    class MLP:
        def __init__(self, key, layer_sizes):
            self.layer_sizes = layer_sizes
            keys = jax.random.split(key, len(layer_sizes) - 1)
            self.weights = [
                jax.random.normal(k, (layer_sizes[i], layer_sizes[i + 1]))
                for i, k in enumerate(keys)
                ]
    
        def forward(self, x):
            for w in self.weights[:-1]:
                x = gelu(jnp.dot(x, w))
            return jnp.dot(x, self.weights[-1])
    
        def get_params(self):
            ### YOUR CODE HERE ###
            return None

    Hessian and HVP Implementations: [2/22 points] Write a function

    # Функция для вычисления Гессиана
    def calculate_hessian(model, params):
        def loss_fn(p):
            x = jnp.ones((1, model.layer_sizes[0]))  # Заглушка входа
            return jnp.sum(model.forward(x))
    
        ### YOUR CODE HERE ###
        #hessian_fn =           
        return hessian_fn(params)

    that computes the full Hessian H of the loss function with respect to the model parameters using JAX’s automatic differentiation.

    Naive HVP via Full Hessian: [2/22 points] Write a function naive_hvp(hessian, vector) that, given a precomputed Hessian H and a vector v (of the same shape as the parameters), computes the Hessian-vector product using a straightforward matrix-vector multiplication.

    Efficient HVP Using Autograd: [4/22 points] Write a function python def hvp(f, x, v): return jax.grad(lambda x: jnp.vdot(jax.grad(f)(x), v))(x) that directly computes the HVP without explicitly forming the full Hessian. This leverages the reverse-mode differentiation capabilities of JAX.

    Timing Experiment: Consider a family of models with an increasing number of hidden layers.

    ns = np.linspace(50, 1000, 15, dtype=int)  # The number of hidden layers
    num_runs = 10  # The number of runs for averaging

    For each model configuration:

    • Generate the model and extract its parameter vector.
    • Generate a random vector v of the same dimension as the parameters.
    • Measure (do not forget to use .block_until_ready() to ensure accurate timing and proper synchronization) the following:
      1. Combined Time (Full Hessian + Naive HVP): The total time required to compute the full Hessian and then perform the matrix-vector multiplication.
      2. Naive HVP Time (Excluding Hessian Computation): The time required to perform the matrix-vector multiplication given a precomputed Hessian.
      3. Efficient HVP Time: The time required to compute the HVP using the autograd-based function.
    • Repeat each timing measurement for a fixed number of runs (e.g., 10 runs) and record both the mean and standard deviation of the computation times.

    Visualization and Analysis: [12/22 points]

    • Plot the timing results for the three methods on the same graph. For each method, display error bars corresponding to the standard deviation over the runs.
    • Label the axes clearly (e.g., “Number of Layers” vs. “Computation Time (seconds)”) and include a legend indicating which curve corresponds to which method.
    • Analyze the scaling behavior. Try analytically derive the scaling of the methods and compare it with the experimental results.
  2. [15 points] We can use automatic differentiation not only to calculate necessary gradients but also for tuning hyperparameters of the algorithm like learning rate in gradient descent (with gradient descent 🤯). Suppose, we have the following function f(x) = \frac{1}{2}\Vert x\Vert^2, select a random point x_0 \in \mathbb{B}^{1000} = \{0 \leq x_i \leq 1 \mid \forall i\}. Consider 10 steps of the gradient descent starting from the point x_0: x_{k+1} = x_k - \alpha_k \nabla f(x_k) Your goal in this problem is to write the function, that takes 10 scalar values \alpha_i and return the result of the gradient descent on function L = f(x_{10}). And optimize this function using gradient descent on \alpha \in \mathbb{R}^{10}. Suppose that each of 10 components of \alpha is uniformly distributed on [0; 0.1]. \alpha_{k+1} = \alpha_k - \beta \frac{\partial L}{\partial \alpha} Choose any constant \beta and the number of steps you need. Describe the obtained results. How would you understand, that the obtained schedule (\alpha \in \mathbb{R}^{10}) becomes better than it was at the start? How do you check numerically local optimality in this problem?