Skip to main content

Numerical Computation and Optimization

Numerical Computation

Most of the algo's discussed in courses are general that should work in theory, but making them actually work on a computer is difficult - numerical computation is all around how to get things to run efficiently on a computer, and what's typically done in industry

A good example is linear regression - a fair amount of theory is around interpreting the coefficients and understanding the model, but when it comes to actually implementing it, there are many practical considerations such as numerical stability, convergence, and optimization techniques that need to be taken into account to solve it analytically for billions of examples (also none of the underlying assumptions ever hold in real life)

Trying to shove a billion row and hundred of thousand feature matrix into a single machine to solve linear regression is a recipe for disaster. Instead, techniques like mini-batch gradient descent, distributed computing, and specialized hardware (like GPUs) are often employed to make the problem tractable

To make things even worse most real numbers on machines only have a finite precision representation (e.g., float32, float64), which can lead to some instability, but then multiplying these small numbers can quickly lead to underflow or overflow issues, vanishing or exploding gradients, and other numeric issues that can arsise during optimization, training, and inference

Overflow and Underflow

Overflow and underflow are two common issues that can arise during numerical computation, particularly when dealing with floating-point numbers. We can't represent an infinite number of real numbers with a finite number of bits, so there are approximation errors for almost all real numbers - this is compounded as there are more arithmetic operations performed, especially multiplication and division

  • Overflow occurs when a calculation produces a result that is larger than the maximum representable value for a given data type. For example, if you try to compute the exponential of a large number (e.g., e1000e^{1000}), the result probably exceeds the maximum value that can be represented by a float64, leading to an overflow error
  • Underflow, on the other hand, happens when a calculation produces a result that is smaller than the minimum representable value (but not zero) which get rounded down to 0. One particular pitfall is division by 0, instead of a small real number. This often occurs when dealing with very small numbers, such as probabilities. For instance, if you multiply a series of small probabilities together, the result may become so small that it is rounded down to zero, leading to underflow
    • Just taking the inverse, 1e10001 \over {e^{1000}} can lead to underflow

Representing a number that's over or under flowed can lead to inf, NaN, or 0 values in computations, which can propagate through calculations and lead to incorrect results or failures in algorithms which then require careful debugging to figure out where an approximation on a random Tuesday occurred

It's a fairly straightforward problem here, but in practice these issues can lead to significant inaccuracies in computations, especially in iterative algorithms like gradient descent where small errors can accumulate over time. In gradient descent you're multiplying small learning rates with small gradients, which can lead to underflow if not handled properly, and adding large gradients across thousands of iterations can lead to overflow

A common example is the Softmax Function which is used to predict probabilities associated with a multinoulli distribution. The softmax function involves exponentiating the input values, which can lead to overflow for large input values. To mitigate this, a common technique is to subtract the maximum input value from all input values,z=xmaxi(xi)\bold{z} = \bold{x} - \max_i(\bold{x}_i), before applying the softmax function. This helps to keep the exponentiated values within a manageable range and prevents overflow, while also preserving the relative differences between the input values

  • This brings the largest item to 0, which removes the risk of overflow when exponentiating, since e0=1e^0 = 1 is the largest value now
  • This also ensures at least one term in the denominator is 1, preventing division by zero
  • Underflow is still an issue for Softmax, specifically when all input values are very negative and running log\log on the output probabilities - this can be mitigated by using the log-sum-exp trick, which is a numerically stable way to compute the logarithm of a sum of exponentials

The above is one example of an extremely common function, and pretty much each function in a low-level library has to deal with these numerical stability issues in some way, but common folk don't necessarily have to worry about it when using high-level libraries like TensorFlow or PyTorch since these libraries have already implemented these stability techniques under the hood

Poor Conditioning

Conditioning refers to how sensitive a function or algorithm is to small changes in its input. In numerical computation, poor conditioning can lead to significant errors in the output, even when the input changes only slightly. This is particularly problematic in optimization algorithms, where small perturbations in the input can lead to large changes in the output, making it difficult to converge to a solution

Obviously numerical precision plays a big role here, as small changes to an approximation of a real number can lead to drastically different results, so that it seems almost non-deterministic when running the same algorithm multiple times with slightly different initial conditions or random seeds

The function f(x)=A1xf(\bold{x}) = \bold{A}^-1 \bold{x}, where A1Rn×n\bold{A}^-1 \in \mathbb{R}^{n \times n} is a matrix. If it has an eigenvalue decomposition it's condition number is defined as the ratio of the largest eigenvalue to the smallest eigenvalue, κ(A)=λmaxλmin\kappa(\bold{A}) = {|\lambda_{max}| \over |\lambda_{min}|}. If the condition number is large, it indicates that the function is poorly conditioned, meaning that small changes in the input can lead to large changes in the output. The sensitivity is a property of the matrix itself, and not rounding errors or numerical precision - it's something you simply have to deal with.

The inverse function applied to inputs, A1x\bold{A}^-1 \bold{x}, is a fairly used operation throughout numerical computation, and if A\bold{A} is poorly conditioned, then small changes in x\bold{x} can lead to large changes in the output. This can make it difficult to solve systems of linear equations or perform other operations that involve matrix inversion

Gradient Based Optimization

Most ML / DL algorithms involve optimizations, and most of them involve gradient descent or some variant of it

Optimization in this sense means minimizing or maximizing some objective function f(x)f(\bold{x}) with respect to its parameters x\bold{x} - this objective function is also known as cost function, loss function, energy function, or error function depending on the context, but they all refer to the same thing - a function that quantifies how well a model is performing, and the goal of optimization is to find the parameters that minimize (or maximize) this function

The value that minimizes or maximizes this function is known as the Optimum and is referred to as x\bold{x}^* with a * superscript

x=argminxf(x)\bold{x}^* = \arg \min_{\bold{x}} f(\bold{x})

Gradient descent is an iterative optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent, as defined by the negative of the gradient. So of course in practice as you're moving around numbers typically between [1,1][-1, 1] with small learning rates (e.g., 0.0010.001), you can easily run into underflow issues where the updates become so small that they get rounded down to zero, effectively halting the optimization process - this is known as the Vanishing Gradient problem

Gradient Descent

Quick Calculus Review

For a specific function y=f(x)y = f(x) where both xx and yy are real numbers, the derivative f(x)f'(x) represents the rate of change of yy with respect to xx. In the context of optimization, the gradient (or derivative) is used to determine the direction in which to update the parameters in order to minimize the objective function

The rate of change of yy with respect to xx essentially tells us how much a one unit change in xx will affect the value of yy

The derivative f(x)f'(x) is a fundamental concept in calculus and is defined as the limit of the average rate of change of the function over an interval as the interval approaches zero:

f(x)=limΔx0f(x+Δx)f(x)Δxf'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}

In practice, the derivative gives the slope of f(x)f(x) at the point xx, and specifically is defined around an ϵ\epsilon-neighborhood of xx - f(x+ϵ)f(x)+ϵf(x)f(x + \epsilon) \approx f(x) + \epsilon \cdot f'(x) for small ϵ\epsilon

So how does this help us? The derivative helps us to minimize a function by telling us how to alter xx to make a small improvement in some direction for yy - meaning if our yy is "wrong" in some direction, we can move it more towards the "right" direction by altering xx in some way, and we can mathematically figure this out using the derivative

Where the slope is 0, we know there's a min or max, and using a second-order derivative test can help us determine which one it is. The second derivative f(x)f''(x) tells us about the curvature of the function at that point: if f(x)>0f''(x) > 0, we have a local minimum, and if f(x)<0f''(x) < 0, we have a local maximum

Global min / max is determined by evaluating the function at critical points and endpoints, and comparing their values to find the absolute minimum or maximum over the entire domain, but in practice gradient descent can converge to a local minimum rather than the global minimum, especially in non-convex optimization problems - below, we'd just tweak xx a bit to get closer to the bottom of the valley

f(x+ϵ)f(x)+ϵf(x)f(x + \epsilon) \approx f(x) + \epsilon \cdot f'(x)

To

f(xϵsign(f(x)))f(x)ϵ(f(x))2f(x - \epsilon \cdot \text{sign}(f'(x))) \approx f(x) - \epsilon \cdot (f'(x))^2

The sign(f(x))sign(f'(x)) function returns 11 if the derivative is positive, 1-1 if it's negative, and 00 if it's zero. This means that if the derivative is positive, we subtract a small value from xx to decrease f(x)f(x), and if the derivative is negative, we add a small value to xx to also decrease f(x)f(x)

We know f(xϵsign(f(x)))f(x - \epsilon \cdot \text{sign}(f'(x))) is less than f(x)f(x) since (f(x))2(f'(x))^2 is always non-negative, so by updating xx in the direction opposite to the sign of the derivative, we can decrease the value of the function - if we reduce f(x)f(x), we are reducing yy

This all then happens multi-dimensionally, and so instead of computing the derivative with respect to a single xx input, we must compute the partial derivative with respect to each input dimension xix_i in the input vector x\bold{x} for yy - the partial derivative fxi\frac{\partial f}{\partial x_i} represents the rate of change of ff with respect to xix_i, while keeping all other input dimensions constant

The Gradient is actually defined as the vector of all partial derivatives with respect to each input dimension, denoted xf(x)\nabla_x f(\bold{x}) or gradf(x)\bold{grad} f(\bold{x}), and critical points occur where the gradient is zero, meaning all partial derivatives are zero simultaneously

The directional derivative in the direction u\bold{u} is the slope of the function ff in the direction of u\textit{u}

This directional derivative is equivalent to f(x+αu)f(x)+αxf(x)uf(\bold{x} + \alpha \cdot \bold{u}) \approx f(\bold{x}) + \alpha \cdot \nabla_x f(\bold{x}) \cdot \bold{u} evaluated at α=0\alpha = 0

From this, and the chain rule, the derivative will equate to f(x+αu)α\frac{\partial f(\bold{x} + \alpha \cdot \bold{u})}{\partial \alpha}

Which simplifies to the directional vector multiplied by the gradient uTxf(x)\bold{u}^{T} \nabla_x f(\bold{x})

Therefore, to minimize ff we would like to find the direction in which ff decreases the fastest which is given by the negative gradient xf(x)-\nabla_x f(\bold{x}) - why? minuuTxf(x)\min_{\bold{u}} \bold{u}^{T} \nabla_x f(\bold{x})

Equates to minuuxf(x)cos(θ)\min_{\bold{u}} ||\bold{u}|| \cdot ||\nabla_x f(\bold{x})|| \cdot \cos(\theta)

Where θ\theta is the angle between u\bold{u} and xf(x)\nabla_x f(\bold{x}). Most proofs will put ||u||$$ as a unit vector to make things easy, so it simplifies to 1$

This ends up equating to minucos(θ)\min_{\bold{u}} \cos(\theta)

This is minimized when cos(θ)\cos(\theta) is at its minimum value, which is 1-1, which occurs when u\bold{u} points in the opposite direction of the gradient xf(x)\nabla_x f(\bold{x}).

Therefore, the gradient points uphill (the direction of steepest ascent), and the negative gradient points downhill (the direction of steepest descent). This is why, in gradient descent algorithms, we update our parameters in the direction of the negative gradient to minimize the objective function

x=xηxf(x)\bold{x}^{`} = \bold{x} - \eta \cdot \nabla_x f(\bold{x})

Where η\eta is the learning rate, which controls the step size of each update - how do you choose this rate?

  • Constant rate: If it's too large you could overshoot the minimum, and if it's too small the convergence will be slow
  • Adaptive rate:
    • Line Search: Evaluating f(xϵxf(x))f(\bold{x} - \epsilon \cdot \nabla_x f(\bold{x})) for different ϵ\epsilon values to find the optimal step size at each iteration
      • This is a highly computationally expensive approach since it requires multiple evaluations of the objective function per iteration
    • Backtracking Line Search: A more efficient approach that starts with a large step size and reduces it until a sufficient decrease in the objective function is observed
    • Adaptive Methods: Techniques like AdaGrad, RMSProp, and Adam adjust the learning rate based on the history of gradients, allowing for more efficient convergence

Jacobian and Hessian

Jacobian and Hessian are just extensions of Gradients that work on partial derivatives and store results in a vector

They make doing gradient optimizations more efficient by using matrix multiplication and vector multiplication

They're useful for storing all of the partial derivatives of a function whose inputs and outputs are both vectors

The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function, while the Hessian is the matrix of all second-order partial derivatives

For a function f:RmRnf: \mathbb{R}^m \rightarrow \mathbb{R}^n, the Jacobian is an n×mn \times m matrix, while the Hessian is an n×nn \times n matrix

Ji,j=f(x)ixj:JRn×mJ_{i, j} = \frac{\partial \bold{f}(\bold{x})_{i{}}}{\partial x_j} : \bold{J} \in \mathbb{R}^{n \times m}

Hi,j=2f(x)xixj:HRn×nH_{i, j} = \frac{\partial^2 f(\bold{x})}{\partial x_i \partial x_j} : \bold{H} \in \mathbb{R}^{n \times n}

The Hessian is a symmetric matrix as long as the second partial derivatives are continuous, the partial of ii applied to jj is the same as jj applied to ii as long as things are well defined

Therefore, the Hessian is real and symmetric - so we're able to decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors. The second derivative in a specific direction is represented by a unit vector d\bold{d} given by dTHd\bold{d}^{T} \bold{H} \bold{d}. The corresponding eigenvalue is the amount of curvature in that direction. The maximum eigenvalue determines the maximum second derivative (i.e. the maximum curvature)

This is helpful, as the maximum eigenvalue gives us the direction we can expect a gradient descent step to perform

So as before, first derivative tells us the direction of steepest ascent, while the second derivative tells us about the curvature of the function in that direction

How do we interpret curvature? Suppose there's a quadratic function (or a function that locally behaves as a quadratic function)

  • If the second derivative of this function is 00, then there's no curvature and it's a perfectly straight line. At this point it's value can be perfectly predicted using the gradient - if the gradient is 11, then a step size of η\eta will result in a new value of 1+η1 + \eta
  • If the second derivative is negative, the function curves downwards, so the cost function would actually decrease more than η\eta (like a ball accelerating downhill)
  • Same thing for positive - it would actually decrease by less than η\eta (like a ball decelerating uphill)

Curvature

Therefore, the Jacobian helps us to find the multi-dimensional steepest ascent direction, while the Hessian provides information about the curvature in that direction

Taylor Series

Instead of computing the Hessian directly, we can use a Taylor series expansion to approximate the function around a point. The second-order Taylor series expansion of a function f(x)f(\bold{x}) around a point x0\bold{x}_0 is given by:

f(x)f(x0)+f(x0)T(xx0)+12(xx0)TH(xx0)f(\bold{x}) \approx f(\bold{x}_0) + \nabla f(\bold{x}_0)^T (\bold{x} - \bold{x}_0) + \frac{1}{2} (\bold{x} - \bold{x}_0)^T \bold{H} (\bold{x} - \bold{x}_0)

Where f(x0)\nabla f(\bold{x}_0) is the gradient at x0\bold{x}_0 and H\bold{H} is the Hessian matrix at x0\bold{x}_0

With a learning rate of η\eta, we can use this Taylor series expansion to perform a gradient descent step:

x1=x0ηg\bold{x}_{1} = \bold{x}_{0} - \eta \bold{g}

Which we can substitute to obtain

f(x0ηg)f(x0)ηgTgη22gTHgf(\bold{x}_{0} - \eta \bold{g}) \approx f(\bold{x}_{0}) - \eta \bold{g}^T \bold{g} - \frac{\eta^2}{2} \bold{g}^T \bold{H} \bold{g}

Which has 3 major portions:

  • The original value of the function
  • The expected improvement due to the slope of the function, i.e. the linear approximation (first-order term)
  • The correction we apply to account for the curvature, i.e. the quadratic approximation (second-order term)
    • If this part is too large we can actually move back uphill (go into the valley and back up)
    • If gTHg\bold{g}^T \bold{H} \bold{g} is too large the taylor series can predict increasing η\eta forever will decrease ff forever

Altogether, this taylor series expansion helps to compute this gradient, and reduces overall complexity into a localized area, but it can start to degrade with a large enough η\eta

So how to choose η\eta heuristically?

  • Line search and other computationally expensive steps are mentioned above
  • Theoretically, it mostly revolves around utilizing first and second derivative tests - however, since most of the time it's a multi-dimensional optimization problem there's a need to consider the Jacobian and Hessian

The eigendecomposition of the Hessian matrix can provide info on curvature of multi-dimensional space such as local max, min, or saddle point. As we view the eigenvalues, they can provide following info:

  • All positive = local min
  • All negative = local max
  • Mixed = saddle point
    • Saddle Point

Second Derivative Test

The condition number of the Hessian (ratio of smallest over largest) can also provide info on the sensitivity of the optimization problem to changes in the input space. A high condition number indicates that the problem is ill-conditioned, meaning that small changes in the input can lead to large changes in the output

Optimization Problems

Taking a look at Newton's method - it is an iterative optimization algorithm that uses the first and second derivatives of a function to find its local maxima and minima. The update rule for Newton's method is given by:

x1=x0ηH1f(x0)\bold{x}_{1} = \bold{x}_{0} - \eta \bold{H}^{-1} \nabla f(\bold{x}_{0})

Where H\bold{H} is the Hessian matrix at x0\bold{x}_{0} and f(x0)\nabla f(\bold{x}_{0}) is the gradient at x0\bold{x}_{0}. This update rule incorporates both the gradient and the curvature information provided by the Hessian, allowing for more informed steps towards the optimum, but it's referred to as the "simplest" because it assumes that the Hessian is constant over the optimization step

Near saddle points, the behavior of Newton's method can be problematic. Since the Hessian may have both positive and negative eigenvalues, the update step can lead to large changes in the optimization variable, potentially causing divergence or oscillations. In such cases, modifications to the basic Newton's method, such as adding a damping factor or using a trust region approach, may be necessary to ensure convergence

Optimization algorithms that use only the gradient, such as gradient descent, are known as First Order Optimization Algorithms

Optimization algorithms that also use the Hessian, such as Newton's Method, are known as Second Order Optimization Algorithms

In practice, most ML and Deep Learning algorithms tend to design optimization algorithms for limited families of functions, meaning there's no perfect one-size fits all solution

We can gain some guarantee's (one size fits all) if we restrict attention to functions that have certain properties, such as:

  • Convexity: If the function is convex, any local minimum is also a global minimum, making optimization easier
    • Means the Hessian is positive semi-definite everywhere
    • I.e. they lack saddle points
  • Smoothness: If the function is smooth (i.e., has continuous derivatives), optimization algorithms can make more informed steps
  • Lipschitz Continuity: If the function satisfies a Lipschitz condition, we can bound the optimization error and ensure convergence
    • Lipschitz continuity implies that the function does not change too rapidly, which helps in ensuring that optimization steps are stable and converge to a solution
    • The below constraint basically says "don't let the gradients change too quickly"
    • It is defined via a Lipschitz constant LL, which satisfies: x,y,f(x)f(y)Lxy \forall\bold{x}, \forall\bold{y}, \|\nabla f(\bold{x}) - \nabla f(\bold{y})\| \leq L \|\bold{x} - \bold{y}\|

Lipschitz continuity is a weaker constraint, but convex optimization is a much stricter constraint that requires the function to be both Lipschitz continuous and convex

Constrained Optimization

Constrained optimization refers to maximizing or minimizing an objective function inside some set S\bold{S} instead of the entire potential metric space

xS\bold{x} \in S are known as feasible points, and the set of all feasible points is known as the feasible set

In terms of gradient descent, this usually means checking around some ϵ\epsilon-neighborhood of the current point to see if any of those points are in the feasible set, and then picking the one that results in the largest decrease in the objective function. We can do this in multiple different ways:

  • Can simply restrict ourselves into that neighborhood
    • In line search for example, we can check only points that are in the feasible set
  • Can project the point back into the feasible set after each update
    • For example, if we have a constraint that all elements of x\bold{x} must be non-negative, after each gradient descent update, we can set any negative elements of x\bold{x} to zero
  • Can use Lagrange multipliers to incorporate the constraints into the objective function
    • This involves adding a term to the objective function that penalizes violations of the constraints, allowing us to optimize the modified objective function without explicitly checking the constraints at each step

Talking more about the Lagrange example - the idea is to design a different, unconstrained optimization problem whose solution can be converted into a solution for the original

A quick example is trying to minimize f(x)xR2f(x) \forall x \in \mathbb{R}^2 constrained to xx having L2L^2 norm less than or equal to 1. Instead of directly solving that, we can instead minimize g(θ)=f([cosθ,sinθ]T)g(\theta) = f([cos\theta, sin \theta]^T) with respect to θ\theta, and then return x=[cosθ,sinθ]Tx^* = [cos\theta^*, sin \theta^*]^T as the solution to the original problem

Finding the transformation above in practice requires creative thinking, but there's a more generalized approach using Lagrange multipliers

In doing so, we must first describe SS in terms of mm functions g(i)g^{(i)} and nn functions h(j)h^{(j)} which are known as equality constraints and inequality constraints respectively. At that point S={xi,g(i)(x)=0;j,h(j)(x)0}S = \{\bold{x} | \forall i, g^{(i)}(\bold{x}) = 0; \forall j, h^{(j)}(\bold{x}) \leq 0 \}

Suppose we want to minimize a function f(x)f(\bold{x}) subject to an equality constraint g(x)=0g(\bold{x}) = 0. We can introduce a Lagrange multipliers λ\lambda and α\alpha and define the Lagrangian function: =f(x)+iλig(i)(x)+jαjh(j)(x) = f(\bold{x}) + \sum_{i} \lambda_i g^{(i)}(\bold{x}) + \sum_{j} \alpha_j h^{(j)}(\bold{x})

The intuition here is that we can solve a constrained optimization problem using unconstrained optimization techniques, and as long as at least one feasible point exists, and f(x)f(\bold{x}) is not permitted to have value \infty, we can find a solution

Moreover, the below equalities have the same optimal objective function values and set of optimal points x\bold{x}

minxmaxλmaxα,α0L(x,λ,α)min_{\bold{x}} max_{\lambda} max_{\alpha, \alpha \geq 0} \mathcal{L}(\bold{x}, \lambda, \alpha) minxf(x)min_{\bold{x}} f(\bold{x})

Why? Because any time the constraints are satisfied we have

maxλmaxα,α0L(x,λ,α)=f(x)max_{\lambda} max_{\alpha, \alpha \geq 0} \mathcal{L}(\bold{x}, \lambda, \alpha) = f(\bold{x})

And any time the constraints are violated, it's equal to \infty

To find the optimal solution, we take the gradients of the Lagrangian with respect to both x\bold{x} and λ\lambda, and set them to zero: xL(x,λ,α)=f(x)+λg(x)+αh(x)=0\nabla_{\bold{x}} \mathcal{L}(\bold{x}, \lambda, \alpha) = \nabla f(\bold{x}) + \lambda \nabla g(\bold{x}) + \alpha \nabla h(\bold{x}) = 0 L(x,λ,α)λ=g(x)=0\frac{\partial \mathcal{L}(\bold{x}, \lambda, \alpha)}{\partial \lambda} = g(\bold{x}) = 0

Lagrange Intuition

The above never really sat well with me - it doesn't explain why it intuitively matters or helps anyone

Let's set it up - Given g(x,y)=cg(x, y) = c, we want to maximize f(x,y)f(x, y)

Lagrange Intuition

Ideally, we can just set f(x,y)=0f'(x, y) = 0 and use some basic calculus to find the max! It'd be wonderful. But this doesn't take the constraint into account

So what mathematical setup can be produced so there's a way to maximize that function while also taking the constraint into account? The Lagrange Multipliers are specifically designed to do that!

Rewrite gg: g(x,y)=cg(x,y)c=0g(x, y) = c \therefore g(x, y) - c = 0

0×constant=00 \times constant = 0: λ×(g(x,y)c)=0\lambda \times (g(x, y) - c) = 0

Adding 00 to anything is simply that value: f(x,y)=f(x,y)+λ×(g(x,y)c)f(x, y) = f(x, y) + \lambda \times (g(x, y) - c)

At this point we're fairly close - there's a function that's equal to our original function, but it takes our constraint into account

The partial derivative of this new function with respect to λ\lambda is would give us intuition on how our function ff changes as our multiplier changes Lλ=g(x,y)c=0\frac{\partial L}{\partial \lambda} = g(x, y) - c = 0

It also equates out to g(x,y)cg(x, y) - c, which we know is equal to 00 - this means as long as this partial derivative is equal to 00, we are satisfying our constraint!

What can we do with this? We can use it to find the optimal values of xx and yy that maximize our function while satisfying the constraint. We will use gradients, as they take all partial derivatives into a vector, and we will set the gradient of our Lagrangian to zero: xL(x,λ,α)=0=f(x)+λg(x)\nabla_{\bold{x}} \mathcal{L}(\bold{x}, \lambda, \alpha) = 0 = \nabla f(\bold{x}) + \lambda \nabla g(\bold{x})

As we set this gradient to 00 and solve for it, we guarantee that the constraint is met and that we solve for the optimal values of xx and yy!

Next would be if the constraint is g(x,y)<cg(x, y) < c where there's some constraint region instead of a strict line - in this scenario most of the above still holds, except our Lagrangian will need to account for the inequality constraint. This is typically done by introducing a slack variable s0s \geq 0 such that g(x,y)+s2=cg(x, y) + s^2 = c

As long as s2s^2 is non-zero, there's bound to be a feasible solution that satisfies the constraint:

  • ss is non-zero
  • Therefore, g(x,y)c<0g(x, y) - c < 0

g(x,y)<cg(x,y)c<0g(x, y) < c \rarr g(x, y) - c < 0

ss is known as the slack variable

The Lagrangian is then defined as: L(x,y,λ,s)=f(x,y)+λ(g(x,y)+s2c)\mathcal{L}(x, y, \lambda, s) = f(x, y) + \lambda (g(x, y) + s^2 - c)

Example: Linear Least Squares