Interior Point Methods for Solving Linear Programs

These notes are based on my semester research project in the Theoretical Computer Science Lab at EPFL, where I was supervised by Prof. Kapralov and Kshiteej Sheth.
Having an idea of what Linear Programs are and Convex Optimization is recommended before reading.

1. Introduction & Problem Statement

Interior-Point Methods are approximation algorithms that solve linear programs. Seen as a framework, it is until now the fastest known algorithm for solving the famous maximum-flow problem on graphs, and although it has many numerical stability problems in practice, it's the current best framework for optimizing (with high accuracy) general convex functions in both theory and practice.

These notes first present the general framework of Interior-Point Methods and how it works, and then the concept of barriers is formalized to present the Lee-Sidford barrier, which is used to achieve runtime improvements. The notes will focus on understanding the theoretical concepts of the method rather than focusing on the details of the runtime. The concepts are in my opinion super interesting and very smart!

Given primal variable $x \in R^{n}$ , objective $c^{⊤} x$ with $c \in R^{n}$ and $m + n$ constraints $A x \geq b, x \geq 0$ with $A \in R^{m \times n}$ and $b \in R^{m}$ , our goal is to solve

$(P) : x min c^{⊤} x subject to A x \geq b, x \geq 0$

approximately up to $ϵ$ -error. This is our contrained optimization problem, and the constraints define a convex subset of $R^{n}$ which we call the polytope.

2. Framework Intuition

Unlike other methods for solving linear programs like the Simplex Method that stay on the border of the polytope, in the Interior-Point method we want to start inside the polytope and away from the border, and slowly move towards the border to always work with a feasible point.

2.1 How We Do It

To stay away from the border we massage our constrained optimization problem into an unconstrained one. We introduce a convex function $ϕ$ that we will add in our objective, such that as we get closer to the border, the value of the function explodes. This enables to penalize when we arrive near the border and we don't go out of the polytope. Such a function is called a barrier, and one simple example of such a function is the log barrier $ϕ (x) = - \sum_{i = 1}^{n} [lo g (x_{i}) + lo g (a_{i}^{⊤} x_{i} - b_{i})]$ , where $a_{i}$ are the columns of $A$ . We can now add to our objective function the term $tϕ$ , where $t$ is a scalar. $ϕ$ ensures that we stay in the feasible region, and $t$ acts as a regularization term such that when $t$ is small, we are close enough to the original problem that we can get a satisfactory approximation.
The modified problem statement is now

$(P_{t}) : x min c^{⊤} x + tϕ (x)$

and we will look at solving this (form of) problem now. Note that this is a set of problems, because the problem $(P_{t})$ changes as $t$ changes. The set of optimum points $x_{t}$ of $(P_{t})$ is called the central path. And as said before, intuitively, when $t \to 0$ , $x_{t} \to \tilde{x}$ where $\tilde{x}$ is the optimum of $(P)$ . So the idea of the algorithm is that we will follow the central path going in the direction of $\tilde{x}$ (thus decreasing $t$ ).

The algorithm proceeds iteratively as follows:

Initialization: start at $t = 1$ , find an initial feasible point $x$ "close enough" to $x_{1}$ .
At each iteration, move $x$ "close enough" to $x_{t}$ for current $t$ , and then decrease $t$ multiplicatively: $t \leftarrow t (1 - h)$ , $h > 0$
return $x$ once $∥ x - \tilde{x} ∥ < ε$ (achieved $ϵ$ -error)

How we measure "closeness" will be defined more formally later. Achieving $ϵ$ -error is dependent on how small $t$ is.

From this general intuition, two main questions arise:

How close should $x$ be to $x_{t}$ at each iteration ? THis will control the per-iteration cost.
How much can we decrease $t$ at each iteration ? This will control the number of iterations.

2.2 Some More Intuition

Before going to the next section, I give more detail to the above two questions. Theorems and lemmas defined later will clarify this part, so if things don't get clearer while reading, you can continue to the next parts.
There's a region that $x$ has to lie into to be able to move closer to $x_{t}$ . If $x$ is not in this region, it might be impossible (or too costly) to later move it closer to $x_{t}$ . So when changing $t$ to $t (1 - h)$ , we change the target $x_{t}$ to $x_{t (1 - h)}$ and thus we change the region we're looking at: $x$ might be in the region of $x_{t}$ but it might not be in the region of $x_{t (1 - h)}$ . That's why we first move $x$ closer to $x_{t}$ , and because of geometric dependecies between $x_{t}$ and $x_{t (1 - h)}$ , now $x$ is also close enough so that it is in the region of $x_{t (1 - h)}$ . Also, if we decrease $t$ too fast ( $h$ too high) then we might pay too much in moving $x$ closer to $x_{t}$ , because the region of $x_{t (1 - h)}$ is even farther away.
These are subtleties that will be handled by lemmas in the more formal sections. But for now here are the two more detailed questions:

At each iteration, how close should $x$ be to $x_{t}$ to lie inside the region of $x_{t (1 - h)}$ ? This will control the per-iteratin cost.
How much can we decrease $t$ at each iteration ? This will control the number of iterations.

Finally, recall that we always want to be close to $x_{t}$ because as we decrease $t$ , $x_{t}$ aproaches $\tilde{x}$ .

Central Path Following

The above image illustrates the process of following the central path. The dotted circles are the regions of $x_{t}$ and $x_{t (1 - h)}$ . The per-iteration cost is represented by the grey arrows.

3. General Framework

As mentioned before, we start by softening

$(P) : x min c^{⊤} x subject to x \in K$

where $K$ is a subset of $R^{n}$ into

$(P_{t}) : x min ϕ_{t} (x)$

where $ϕ_{t} (x) = c^{⊤} x + tϕ (x)$ and $ϕ (x)$ is a convex function such that $ϕ (x) \to + \infty$ as $x \to \partial K$ . We call $ϕ$ a barrier function for $K$ .

Definition 1: The central path is defined as ${x_{t} = a r g mi n_{x} ϕ_{t} (x) ∣ t \in] 0, 1]}$

The framework is as follows:

Find a feasible $x$ close to $x_{1}$
While $t$ is not small enough
- Move $x$ closer to $x_{t}$
- $t \leftarrow t (1 - h)$ , $h > 0$

The goal of IPMs is to start with a point in the central path, and following it until we get a good approximation of the original problem. We do not present how to get the initial point, but there exists multiple methods for getting an initial point efficiently, one of which can be found in Reference 2.
Let's now show that this framework works!

3.1 Newton Step and Distance Measure

To move $x$ closer to $x_{t}$ , we are using Newton steps:

$x \leftarrow x - [\nabla^{2} ϕ_{t} (x)]^{- 1} \nabla ϕ_{t} (x)$

Since Newton's method uses the Hessian and performs better when changes in $x$ doesn’t change too much the hessian, we are motivated to make the assumption that the Hessian is Lipschitz (smoothness property). This is why we introduce the definition of self-concordance in the next subsection.
Note that the norm of $[\nabla^{2} ϕ_{t} (x)]^{- 1} \nabla ϕ_{t} (x)$ is called the step size.

To study the performance of a Newton’s iteration, we need to define how we measure the distance of $x$ to $x_{t}$ . Since the smaller the step size, the closer we are to the optimum (because $\nabla ϕ_{t} (x)$ small), we define the distance in terms of the size of the step in the Hessian norm.

Definition 2: Given a positive-definite matrix $A$ , define $∥ v ∥_{A}^{2} = v^{⊤} A v$ . The distance $δ_{t} (x)$ of $x$ to $x_{t}$ is defined as follows

$δ_{t} (x) = ∥ \nabla^{2} ϕ_{t} (x)^{- 1} \nabla ϕ_{t} (x)∥_{\nabla^{2} ϕ (x)}$

The inverse hessian norm is used to account for the amount of Newton steps needed for $x$ to converge to $x_{t}$ . This makes sense algorithmically because the algorithm is faster if few Newton steps are made.
Note that $δ_{t} (x)$ is also $∥ \nabla ϕ_{t} (x)∥_{\nabla^{2} ϕ (x)^{- 1}}$ (this can be derived by simple calculation). We will mainly use this form instead of the one given in the definiton, but I really wanted to point out that the definition comes from a motivation to analyze Newton steps.
Also note that $\nabla^{2} ϕ_{t} (x) = \nabla^{2} ϕ (x)$ .

3.2 Self-Concordance: Enabling Convergence

Since we're using Newton steps, remember the convenient assumption that the Hessian should be Lipschitz. Let's define more formally what is the smoothness property we want.

Definition 3: A convex function $f$ is said to be self-concordant if $\forall v \in R^{n}$ , $\forall x \in dom f$ , we have

$D^{3} f (x) [v, v, v] \leq 2 (D^{2} f (x) [v, v])^{\frac{3}{2}} = 2 ∥ v ∥_{\nabla^{2} f (x)}^{3}$

D $^{k} f (x) [v_{1}, ..., v_{k}]$ is the directional $k^{t h}$ derivative of $f$ in directions $v_{1}, ..., v_{k}$ .

Integrating the local inequality that we have from the definition, we get the following inequality, which says that locally the Hessian of a self-concordant function $f$ doesn't change too fast:

Lemma 1: Let $f$ be a self-concordant function. Then $\forall x \in dom f$ , and $\forall y \in dom f$ s.t. $d = ∥ y - x ∥_{\nabla^{2} f (x)} < 1$ ,

$(1 - d)^{2} \nabla^{2} f (x) ⪯ \nabla^{2} f (y) ⪯ \frac{1}{( 1 - d ) ^{2}} \nabla^{2} f (x)$

Proof. See Appendix (Self-concordance is necessary to prove this lemma)

When we change $x$ , we want the new distance from $x_{t}$ to be smaller in order to make progress. This is what self-concordance enables. It says that locally, hessian doesn't change too much, so it gives insurance that our Newton step won't diverge or make jumps and stop progressing.
The following lemma uses this, and says that the self-concordance condition on $ϕ$ guarantees convergence when we try to move $x$ closer to $x_{t}$ , provided that $x$ was already close enough. This lemma is one of the main tools that enables the algorithm to proceed iteratively and converge towards the optimum.

Lemma 2: Let $ϕ$ be a self-concordant barrier function. Suppose that $r = δ_{t} (x) < 1$ . Then, after one Newton step of the form $x^{'} = x - [\nabla^{2} ϕ_{t} (x)]^{- 1} \nabla ϕ_{t} (x)$ , we have

$δ_{t} (x^{'}) \leq (\frac{r}{1 - r})^{2}$

And moving $x$ using the Newton method results in moving closer to $x_{t}$ .

Proof. See Appendix (Lemma 1. is necessary to prove this lemma)

3.3 Following the Central Path

Now, changing $t$ must be done carefully, because we want $h$ to be as big as possible (to minimize the number of iterations), while still maintaining our invariant $δ_{t} (x) < 1$ of Lemma 2. The following lemma gives an upper-bound on the amount that $δ_{t} (x)$ can change as we change $t$ , and suggests a maximum amount by which $t$ can change up to the point that our premise of Lemma 2. still holds in the next iteration.

Lemma 3: Let $ϕ$ be a self-concordant barrier function. $t$ is updated following $t \leftarrow t (1 - h)$ . Then we have that

$∥ \nabla ϕ_{t (1 - h)} (x)∥_{\nabla^{2} ϕ (x)^{- 1}} \leq ∥ \nabla ϕ_{t} (x)∥_{\nabla^{2} ϕ (x)^{- 1}} + h ∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}}$

Proof. See Appendix

We can now know how big $h$ can be. For this, let's make our framework more quantitative:

Find a feasible $x$ such that $δ_{1} (x) \leq \frac{1}{3}$
While $t$ is not small enough:
- $x \leftarrow x - [\nabla^{2} ϕ_{t} (x)]^{- 1} \nabla ϕ_{t} (x)$ (this decreases $δ_{t} (x)$ to $\frac{1}{4}$ )
- $t \leftarrow t (1 - h)$

Assume that $δ_{t} (x) \leq \frac{1}{3}$ . Then, according to Lemma 2., after a single Newton step, $δ_{t} (x) \leq \frac{1}{4}$ . To maintain $δ_{t (1 - h)} \leq \frac{1}{3}$ for the next iteration, we want

$∥ \nabla ϕ_{t} (x)∥_{\nabla^{2} ϕ (x)^{- 1}} + h ∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}} \leq \frac{1}{4} + h x max ∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}} \leq \frac{1}{3}$

So we need $h \leq \frac{1}{12 m a x _{x} ∥ \nabla ϕ ( x )∥ _{\nabla^{2} ϕ (x)^{- 1}}}$ . Taking $h$ to be equal to this value maximizes the change of $t$ .
Notice that the amount by which we change $t$ depends on $max_{x} ∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}}$ . The lower this value, the fewer number of iterations we will have to do. We can now derive the second property of self-concordance, and we can now fully define what is $ν$ -self-concordance.

Definition 4: Let $K$ be a convex subset of $R^{n}$ . We say that $ϕ$ is a $ν$ -self-concordant barrier on $K$ if

$D^{3} ϕ (x) [v, v, v] \leq 2 (D^{2} ϕ (x) [v, v])^{\frac{3}{2}}$ for all $x \in K$ and $v \in R^{n}$ (Definition 3.)
$∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}}^{2} \leq ν$ for all $x \in K$

3.4 Termination Condition

Finally, it remains to determine how small $t$ needs to be to achieve $ϵ$ -error. One can upper bound the error by using the duality gap of the problem.

Lemma 4: Let $ϕ$ be a $ν$ -self-concordant barrier on $K$ , and let $\tilde{x}$ be the optimum of $(P)$ . We have that

$c^{⊤} x_{t} - c^{⊤} \tilde{x} \leq t ν$

Proof. See Appendix

Our $x$ is a constant number of iterations close to $x_{t}$ , so to have $c^{⊤} x - c^{⊤} \tilde{x} \leq t ν$ there is only needed a constant number more iterations. Hence, to get an $ϵ$ -error we can set $t = \frac{ϵ}{ν} + O (1)$ . We can now give the final version of our framework.

Final IPM framework

Find a feasible $x$ such that $δ_{1} (x) \leq \frac{1}{3}$
While $t > \frac{ϵ}{ν}$ :
- $x \leftarrow x - [\nabla^{2} ϕ_{t} (x)]^{- 1} \nabla ϕ_{t} (x)$ (by Lemma 2., $δ_{t} (x) \leq \frac{1}{4}$ )
- $t \leftarrow t (1 - h)$ , $h = \frac{1}{12 ν}$

With this framework complete we can now state a main IPM theorem regarding performance.

3.5 Main IPM Theorem

Theorem 1: Let $K$ be the polytope defined by the constraints of $(P)$ . Given a $ν$ -self-concordant barrier on $K$ , we can solve $(P)$ up to $ϵ$ -error in $O (ν l o g (\frac{ν}{ϵ}))$ iterations.

Proof. We need to show that we reach $ϵ$ -error in $O (ν l o g (\frac{ν}{ϵ}))$ Newton steps and updates of $t$ .
We start at $t = 1$ , and at each iteration we multiply $t$ by a factor $(1 - h)$ . Let $k$ be the number of iterations. We want

$(1 - h)^{k} \leq \frac{ϵ}{ν}$

Solving this gives $k = O (ν l o g (\frac{ν}{ϵ}))$ .

Discussion

The performance of the IPM framework we've given depends on $ν$ , the self-concordance of the barrier function we choose. This sort of modularizes the research on the subject and enables to improve the performance by concentrating on finding a better (smaller $ν$ ) barrier function.
This is the topic of another note, which concentrates on the Lee-Sidford barrier, the current best barrier function found that is computable.

References

Y. T. Lee, A. Sidford - Solving Linear Programs with sqrt(rank) linear system solves
Lecture notes from the famous Yin Tat Lee: CSE 599: Interplay between Convex Optimization and Geometry

Appendix

Proof Lemma 1.

Let $α (λ) = u^{⊤} \nabla^{2} f (x_{λ})^{⊤} u$ with $x_{λ} = x + λ (y - x)$ . Then we have that $α^{'} (λ) = D f (x_{λ}) [y - x, u, u]$ . By self-concordance we have

$∣ α^{'} (λ)∣ \leq 2 ∥ y - x ∥_{\nabla^{2} f (x_{λ})} ∥ u ∥_{\nabla^{2} f (x_{λ})}^{2}$

For $u = y - x$ , we have $∣ α^{'} (λ)∣ \leq 2 α (λ)^{\frac{3}{2}}$ . Hence we have $\frac{d}{d λ} \frac{1}{α ( λ )} \geq - 1$ . Integrating both on $λ$ we have

$\frac{1}{α ( λ )} \geq \frac{1}{α ( 0 )} - λ = \frac{1}{∥ y - x ∥ _{\nabla^{2} f (x)}} - λ$

Rearranging it gives

$α (λ) \leq \frac{1}{( \frac{1}{∥ y - x ∥ _{\nabla^{2} f (x)}} - λ ) ^{2}} = \frac{∥ y - x ∥ _{\nabla^{2} f (x)}^{2}}{( 1 - λ ∥ y - x ∥ _{\nabla^{2} f (x)} ) ^{2}}$

For general $u$ , we can obtain

$∣ α^{'} (λ)∣ \leq 2 \frac{∥ y - x ∥ _{\nabla^{2} f (x)}}{1 - λ ∥ y - x ∥ _{\nabla^{2} f (x)}} α (λ)$

Rearranging gives

$\frac{d}{d λ} ln (α (λ)) \leq 2 \frac{∥ y - x ∥ _{\nabla^{2} f (x)}}{1 - λ ∥ y - x ∥ _{\nabla^{2} f (x)}} = - 2 \frac{d}{d λ} ln (1 - λ ∥ y - x ∥_{\nabla^{2} f (x)})$

Integrating from $λ = 0$ to $1$ gives the result.

Proof Lemma 2.

Lemma 1. shows that

$\nabla^{2} f (x^{'}) ⪰ (1 - r)^{2} \nabla^{2} f (x)$

and hence

$∥ \nabla f (x^{'})∥_{\nabla^{2} f (x^{'})^{- 1}} \leq \frac{∥ \nabla f ( x ^{'} )∥ _{\nabla^{2} f (x)^{- 1}}}{1 - r} (*)$

To bound $\nabla f (x^{'})$ , we calculate that

$\nabla f (x^{'}) = \nabla f (x) + \int_{0}^{1} \nabla^{2} f (x + t (x^{'} - x)) (x^{'} - x) d t = \nabla f (x) - \int_{0}^{1} \nabla^{2} f (x + t (x^{'} - x)) (\nabla^{2} f (x))^{- 1} \nabla f (x) d t = (\nabla^{2} f (x) - \int_{0}^{1} \nabla^{2} f (x + t (x^{'} - x)) d t) (\nabla^{2} f (x))^{- 1} \nabla f (x) (**)$

For the first term in the bracket, we use Lemma 1. to get that

$(1 - r + \frac{1}{3} r^{2}) \nabla^{2} f (x) ⪯ \int_{0}^{1} \nabla^{2} f (x + t (x^{'} - x)) d t ⪯ \frac{1}{1 - r} \nabla^{2} f (x)$

Therefore we have

$(\nabla^{2} f (x))^{- \frac{1}{2}} (\nabla^{2} f (x) - \int_{0}^{1} \nabla^{2} f (x + t (x^{'} - x)) d t) (\nabla^{2} f (x))^{- \frac{1}{2}}_{op} \leq max (\frac{r}{1 - r}, r - \frac{1}{3} r^{2}) = \frac{r}{1 - r}$

Putting it into (**) gives

$∥ \nabla f (x^{'})∥_{\nabla^{2} f (x)^{- 1}} \leq \frac{r}{1 - r} (\nabla^{2} f (x))^{- \frac{1}{2}} \nabla f (x) = \frac{r ^{2}}{1 - r}$

Using this inequality with (*) gives the result.

Proof Lemma 3.

$\nabla ϕ_{t (1 - h)} (x) = c + t (1 - h) \nabla ϕ (x) = \nabla ϕ_{t} (x) - t \nabla ϕ (x) + t (1 - h) \nabla ϕ (x) = \nabla ϕ_{t} (x) - t h \nabla ϕ (x)$

Taking the norm, using triangle inequality and using that $0 < t \leq 1$ and $h \geq 0$ , we get

$∥ \nabla ϕ_{t (1 - h)} (x)∥_{\nabla^{2} ϕ (x)^{- 1}} \leq ∥ \nabla ϕ_{t} (x) - t h \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}} \leq ∥ \nabla ϕ_{t} (x)∥_{\nabla^{2} ϕ (x)^{- 1}} + t h ∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}} \leq ∥ \nabla ϕ_{t} (x)∥_{\nabla^{2} ϕ (x)^{- 1}} + h ∥ \nabla ϕ (x)∥_{\nabla^{2} ϕ (x)^{- 1}}$

Proof Lemma 4.

We state a first lemma that will help us in the proof.

Lemma: Let $ϕ$ be a $ν$ -self-concordant barrier, and let $K$ be the polytope defined by the constraints of $(P)$ . Then, for any $x, y \in K$

$⟨ \nabla ϕ (x), (y - x)⟩ \leq ν$

Proof. Let $α (λ) = ⟨ \nabla ϕ (x_{λ}), (y - x)⟩$ where $x_{λ} = x + λ (y - x)$ . Then

$α^{'} (λ) = (y - x)^{⊤} \nabla^{2} ϕ (x_{λ}) (y - x) = ∥ y - x ∥_{\nabla^{2} ϕ (x_{λ})}^{2}$

We also have that

$α (λ) = ⟨ \nabla^{2} ϕ (x_{λ})^{- \frac{1}{2}} \nabla ϕ (x_{λ}), \nabla^{2} ϕ (x_{λ})^{\frac{1}{2}} (y - x)⟩ \leq ∥ \nabla ϕ (x_{λ})∥_{\nabla^{2} ϕ (x_{λ})^{- 1}} ∥ y - x ∥_{\nabla^{2} ϕ (x_{λ})^{- 1}} (by Cauchy-Schwarz inequality) \leq ν ∥ y - x ∥_{\nabla^{2} ϕ (x_{λ})^{- 1}} (by ν -self-concordance)$

Combining the two, we get $α (λ) \leq ν α^{'} (λ) ⟹ α^{'} (λ) \geq \frac{1}{ν} α (λ)^{2}$

If $α (0) \leq 0$ , we are done because $ν \geq 0$ and our inequality is satisfied.
If $α (0) > 0$ , then $α$ is increasing in $λ$ , because $ϕ$ is convex (and look at the monotonicity of $\nabla ϕ (x_{λ})^{⊤} (y - x)$ depending on the sign of $y - x$ ). Thus $α (1) \geq α (0) > 0$ .
Furthermore, integrating the inequality $\frac{α ^{'} ( λ )}{α ( λ ) ^{2}} \geq \frac{1}{ν}$ , we have that $\frac{1}{α ( 1 )} \leq \frac{1}{α ( 0 )} - \frac{1}{ν}$ . Thus, $\frac{1}{α ( 0 )} - \frac{1}{ν} \geq 0$ and $α (0) \leq ν$ which gives the target inequality.

Now we can use this lemma to prove Lemma 4.
At the optimal of $(P_{t})$ we have $\nabla ϕ_{t} (x_{t}) = 0 ⟹ c + t \nabla ϕ (x_{t}) = 0 ⟹ c = - t \nabla ϕ (x_{t})$ . Therefore we have

$c^{⊤} x_{t} - c^{⊤} \tilde{x} = c^{⊤} (x_{t} - \tilde{x}) = t \nabla ϕ (x_{t})^{⊤} (\tilde{x} - x_{t}) \leq t ν$

which completes the proof.

# Interior Point Methods for Solving Linear Programs

# 1. Introduction & Problem Statement

# 2. Framework Intuition

# 2.1 How We Do It

# 2.2 Some More Intuition

# 3. General Framework

# The framework is as follows:

# 3.1 Newton Step and Distance Measure

# 3.2 Self-Concordance: Enabling Convergence

# 3.3 Following the Central Path

# 3.4 Termination Condition

# Final IPM framework

# 3.5 Main IPM Theorem

# Discussion

# References

# Appendix

# Proof Lemma 1.

# Proof Lemma 2.

# Proof Lemma 3.

# Proof Lemma 4.