An introduction to first-order methods for linear programming, focusing on the Primal-Dual Hybrid Gradient (PDHG), the algorithm behind GPU-accelerated LP solvers, and its practical enhancements.

If you’ve been following the optimization community lately, you might have noticed something interesting happening: GPUs are finally coming for linear programming.
For decades, linear programming (LP) solvers have been dominated by two workhorses: the simplex method and interior-point methods (IPMs). These methods are mature, reliable, and backed by decades of theoretical and practical refinement. But they have a challenge: as problems grow to massive scale, their reliance on matrix factorizations and sequential operations becomes a bottleneck, especially on modern parallel hardware like GPUs.
In the past couple of years, we’ve seen an explosion of interest in GPU-accelerated LP solvers:
NVIDIA cuOpt (2024): NVIDIA released and open-sourced cuOpt, a GPU-accelerated optimization engine that excels in mixed integer linear programming (MILP), linear programming (LP), and vehicle routing problems (VRP). Their developer blog demonstrates substantial speedups on large-scale problems.
Gurobi Integration: The industry-leading commercial solver Gurobi adopted GPU acceleration based on similar first-order methods, recognizing the potential of this approach.
Open-Source Ecosystem: A vibrant ecosystem of open-source solvers emerged, including cuPDLPx and cuPDLP.jl, Google’s PDLP, and MPAX.
But here’s the interesting part: all of these solvers are built on first-order methods (FOMs), specifically variants of the Primal-Dual Hybrid Gradient (PDHG) algorithm. This represents a different philosophy from traditional LP solving, and it’s worth understanding why this approach is gaining traction.
The key insight is simple: GPUs excel at massively parallel operations, particularly matrix-vector multiplications. Modern GPUs can perform thousands of such operations simultaneously with enormous throughput. Traditional LP methods (simplex and IPMs) rely heavily on solving linear systems via matrix factorizations—operations that are inherently sequential and difficult to parallelize effectively.
First-order methods, on the other hand, are built around simple iterative updates involving matrix-vector products: $Ax$, $A^Ty$. No factorizations. No solving linear systems. Just multiply and update. This computational profile aligns beautifully with GPU architectures.
The tradeoff? First-order methods typically require more iterations to converge than their second-order counterparts. But when each iteration can be dramatically faster on a GPU, this iteration count penalty often becomes irrelevant—especially for massive problems where factorization costs grow prohibitively.
Let’s understand how we arrived at PDHG by starting from scratch. We’ll attempt to build a first-order method for LP, see where each approach fails, and understand how PDHG elegantly resolves these challenges.
Consider the standard form LP:
\[\begin{align} \min_{x \in \mathbb{R}^n} \quad & c^T x \\ \text{s.t.} \quad & Ax = b \\ & x \geq 0 \end{align}\]where \(A \in \mathbb{R}^{m \times n}\), \(b \in \mathbb{R}^m\), and \(c \in \mathbb{R}^n\). The dual problem is:
\[\begin{align} \max_{y \in \mathbb{R}^m} \quad & b^T y \\ \text{s.t.} \quad & A^T y \leq c \end{align}\]Our goal is to find a first-order method that:
Why start here? For unconstrained optimization, gradient descent is the canonical first-order method. When we have constraints, the natural extension is projected gradient descent: take a gradient step, then project back onto the feasible set.
The algorithm would be:
\[x^{k+1} = \text{proj}_{\mathcal{C}}(x^k - \eta c)\]where \(\mathcal{C} = \{x \in \mathbb{R}^n : Ax = b, x \geq 0\}\) is our feasible set and \(\eta > 0\) is the step-size.
What is projection? The projection of a point \(z\) onto a set \(\mathcal{C}\) is the closest point in \(\mathcal{C}\) to \(z\):
\[\text{proj}_{\mathcal{C}}(z) = \arg\min_{x \in \mathcal{C}} \|x - z\|^2\]Geometrically, it’s like “snapping” \(z\) to the nearest feasible point.
How is it computed? This depends on the set \(\mathcal{C}\):
This is itself a quadratic program with equality and inequality constraints!
The Problem: Solving this QP requires:
Both options involve solving linear systems or running an optimization algorithm inside our optimization algorithm. For large-scale LP, computing the projection can be as expensive as solving the original problem! This defeats our goal of having a simple, GPU-friendly method.
Key Insight: We need to somehow handle the constraints separately—keeping the “easy” constraints (like \(x \geq 0\)) and dealing with the “hard” constraints (like \(Ax = b\)) differently.
Why move to a primal-dual approach? The key observation is that our constraints fall into two categories:
What if we could keep the simple constraints explicit and handle the complex ones differently?
The solution: Use Lagrangian duality. Instead of projecting onto \(Ax = b\), we dualize this constraint by introducing dual variables \(y\) and work with the Lagrangian:
\[\mathcal{L}(x, y) = c^T x + y^T(b - Ax)\]Notice that:
This gives us the primal-dual formulation:
\[\min_{x \geq 0} \max_{y} \mathcal{L}(x, y) = \min_{x \geq 0} \max_{y} \left[ c^T x - y^T Ax + b^T y \right]\]More compactly, with the indicator function \(\iota_{\mathbb{R}^n_+}(x)\) (which is 0 if \(x \geq 0\), infinity otherwise):
\[\min_{x} \max_{y} L(x, y) := c^T x + \iota_{\mathbb{R}^n_+}(x) - y^T A x + b^T y\]Why is this better?
Gradient Descent-Ascent: For saddle point problems, the natural first-order method is gradient descent in \(x\) (minimization) and gradient ascent in \(y\) (maximization):
\[\begin{align} x^{k+1} &= \text{proj}_{\mathbb{R}^n_+}(x^k - \eta \nabla_x L(x^k, y^k)) \\ &= \text{proj}_{\mathbb{R}^n_+}(x^k - \eta(c - A^T y^k)) \\ &= \max(x^k + \eta A^T y^k - \eta c, 0) \\[1em] y^{k+1} &= y^k + \sigma \nabla_y L(x^k, y^k) \\ &= y^k + \sigma(b - A x^k) \\ &= y^k - \sigma A x^k + \sigma b \end{align}\]where \(\eta, \sigma > 0\) are primal and dual step-sizes. The projection is now just a max operation—exactly what we wanted!
Does it work? Let’s test on a simple example.
Consider the simple bilinear saddle point problem:
\[\min_{x \geq 0} \max_y (x - 3)y\]The unique saddle point is at \((x^*, y^*) = (3, 0)\). Starting from \((x^0, y^0) = (2, 2)\) with step-sizes \(\eta = \sigma = 0.2\), GDA gives:
# Simple GDA example showing divergence
def gda_update(x, y, eta=0.2, sigma=0.2):
x_new = max(0, x - eta * y) # gradient of L w.r.t. x is y
y_new = y + sigma * (3 - x) # gradient of L w.r.t. y is (x - 3)
return x_new, y_new
# Run GDA from (2, 2)
x, y = 2.0, 2.0
for iteration in range(50):
x, y = gda_update(x, y)
if x > 10 or abs(y) > 10: # Diverging!
print(f"Diverged at iteration {iteration}")
break
# Output: Diverged at iteration 15
The iterates spiral outward rather than converging to \((3, 0)\). Plotting the trajectory shows ever-widening circles.
Why does GDA fail? The problem is that GDA updates each variable based on the gradient at the current point. For saddle point problems, this creates a “chasing” behavior where \(x\) and \(y\) oscillate and amplify each other’s errors rather than correcting them. Mathematically, the update matrix has eigenvalues outside the unit circle, causing divergence.
This is a fundamental limitation: GDA does not converge for bilinear saddle point problems (which includes all LPs!). We need something better.
Why does GDA fail, and what fixes it? GDA fails because it uses gradient information from the current iterate, creating instability. What if we could use gradient information from the next iterate instead?
This is the idea behind the Proximal Point Method. Instead of the explicit update: \(x^{k+1} = x^k - \eta \nabla f(x^k)\)
PPM solves an implicit equation: \(x^{k+1} = x^k - \eta \nabla f(x^{k+1})\)
For our saddle point problem, this becomes:
\[(x^{k+1}, y^{k+1}) = \arg\min_{x \geq 0} \max_y \left[ L(x, y) + \frac{1}{2\eta}\|x - x^k\|^2 - \frac{1}{2\sigma}\|y - y^k\|^2 \right]\]The quadratic penalty terms \(\frac{1}{2\eta}\|x - x^k\|^2\) and \(-\frac{1}{2\sigma}\|y - y^k\|^2\) act as regularizers, preventing the iterates from moving too far in one step.
Why this works: The implicit gradient creates a stabilizing effect. Geometrically, we’re not just following the gradient direction blindly—we’re finding a point that balances between:
This regularization fixes GDA’s divergence problem. In fact, on the same toy problem where GDA diverged, PPM would converge smoothly to the saddle point!
The catch: This is an implicit update. To compute \((x^{k+1}, y^{k+1})\), we need to solve the saddle point problem shown above at every iteration. This subproblem is nearly as expensive as our original LP for large-scale problems—completely impractical.
The dilemma:
Can we get the best of both worlds?
Yes! PDHG (also known as the Chambolle-Pock algorithm, after its originators) ingeniously combines explicit computation with convergence guarantees.
The key insight: What if we could approximate the implicit PPM update without actually solving it? PDHG achieves this through an extrapolation trick:
\[\begin{align} x^{k+1} &\leftarrow \text{proj}_{\mathbb{R}^n_+}(x^k + \eta A^T y^k - \eta c) \quad \text{(primal update, like GDA)} \\ y^{k+1} &\leftarrow y^k - \sigma A(\textcolor{red}{2x^{k+1} - x^k}) + \sigma b \quad \text{(dual update with extrapolation!)} \end{align}\]The primal update is identical to GDA. The magic is in the dual update: instead of using \(x^k\) or \(x^{k+1}\), we use \(\textcolor{red}{2x^{k+1} - x^k}\).
What is this extrapolated point?
Let’s break down \(2x^{k+1} - x^k\): \(2x^{k+1} - x^k = x^{k+1} + \underbrace{(x^{k+1} - x^k)}_{\text{displacement}}\)
This is the current iterate plus the displacement vector. Geometrically, it’s extrapolating along the direction of movement, predicting where the primal variable is “heading.”
Why does the extrapolation work?
Think about what we’re trying to fix from GDA. In GDA, the dual update at iteration \(k\) uses \(x^k\), which is “stale”—by the time we compute \(y^{k+1}\), we already have the newer \(x^{k+1}\). Using \(x^{k+1}\) directly would be better, but still not ideal.
The extrapolation \(2x^{k+1} - x^k\) goes further: it estimates what \(x^{k+2}\) might be. This implicitly approximates the “next iterate” idea from PPM, but without solving the implicit problem!
Multiple theoretical perspectives:
Operator theory: PDHG can be proven equivalent to a preconditioned proximal point method, inheriting PPM’s convergence guarantees
Predictor-corrector: The primal step predicts the next \(x\); the extrapolation corrects for this prediction in the dual update
Momentum interpretation: Rewriting as \(x^{k+1} + (x^{k+1} - x^k)\) shows it adds momentum to the iterate
Let’s verify this actually works on our toy problem:
def pdhg_update(x, y, eta=0.2, sigma=0.2):
x_new = max(0, x - eta * y)
y_new = y + sigma * (2*x_new - x - 3) # Extrapolation!
return x_new, y_new
# Run PDHG from same starting point (2, 2) where GDA diverged
x, y = 2.0, 2.0
for iteration in range(100):
x, y = pdhg_update(x, y)
if iteration % 20 == 0:
print(f"Iteration {iteration}: x = {x:.4f}, y = {y:.4f}")
# Result: Converges to (3.0, 0.0)!
It works! With PDHG, the iterates spiral inward toward the saddle point \((3, 0)\). The extrapolation provides just enough stabilization to ensure convergence, giving us:
This is exactly what we needed!
Now we see why PDHG is perfect for GPU-accelerated LP solving. Each PDHG iteration requires only:
That’s it! No linear system solves. No matrix factorizations. No iterative subsolvers. Everything is:
This computational profile is perfectly suited for GPUs, which excel at:
Summary so far:
One fascinating observed property of PDHG when applied to LP is its distinctive spiral ray behavior. When visualizing the trajectory of iterates in primal-dual space, you often see a spiral pattern.
Within phases where the active set of variables (which variables are at their bounds) remains constant, the behavior can be understood through a dynamical systems lens. The iterates exhibit two simultaneous behaviors:
When the active set changes—when a variable hits its bound or leaves it—the algorithm transitions to a new spiral pattern with a different center and ray direction. This phase-by-phase behavior is characteristic of first-order primal-dual algorithms on LP and differs fundamentally from the edge-following behavior of simplex methods or the central-path-following behavior of interior-point methods.
While vanilla PDHG is elegant, it’s not ready for prime time. To build a practical, general-purpose LP solver, we need several enhancements. This is where PDLP (Primal-Dual hybrid gradient for Linear Programming) comes in.
PDLP works with a more general LP formulation than the standard form:
\[\begin{align} \min_x \quad & c^T x \\ \text{s.t.} \quad & Ax = b \\ & Gx \geq h \\ & l \leq x \leq u \end{align}\]where \(G \in \mathbb{R}^{m_1 \times n}\), \(A \in \mathbb{R}^{m_2 \times n}\), and \(l, u\) can include \(-\infty\) and \(\infty\) for unbounded variables.
This formulation avoids introducing auxiliary variables (which would increase problem size and potentially degrade convergence).
PDLP solves the primal-dual formulation:
\[\min_{x \in X} \max_{y \in Y} L(x, y) := c^T x - y^T K x + q^T y\]where:
The PDHG update becomes:
\[\begin{align} x^{k+1} &\leftarrow \text{proj}_X(x^k - \tau(c - K^T y^k)) \\ y^{k+1} &\leftarrow \text{proj}_Y(y^k + \sigma(q - K(2x^{k+1} - x^k))) \end{align}\]The step-sizes are parameterized as \(\tau = \eta/\omega\) and \(\sigma = \eta\omega\), where:
Both projections \(\text{proj}_X\) and \(\text{proj}_Y\) are trivial: just clamping to box constraints!
Vanilla PDHG, while theoretically sound, isn’t immediately competitive with industrial solvers in practice. Empirical studies have shown that the base algorithm solves significantly fewer problem instances to desired accuracy levels compared to enhanced versions.
With practical enhancements, modern implementations of PDHG-based solvers (like PDLP and its variants) achieve dramatically better performance—often solving several times more instances to both moderate and high accuracy. Let’s understand what makes the difference.
First-order methods are notoriously sensitive to problem conditioning—the ratio between the largest and smallest singular values of the constraint matrix. Poorly conditioned problems can converge very slowly.
Diagonal preconditioning rescales the problem to improve conditioning:
\[\tilde{K} = D_1 K D_2\]where \(D_1\) and \(D_2\) are positive diagonal matrices chosen to make \(\tilde{K}\) “well-balanced” (having rows and columns with similar norms).
Several preconditioning strategies exist:
This preprocessing step is computationally cheap (involving only row/column norm computations) but can dramatically improve convergence, especially for ill-conditioned problems.
# Simplified Ruiz rescaling (pseudocode)
def ruiz_rescaling(K, num_iterations=10):
D1 = np.ones(K.shape[0])
D2 = np.ones(K.shape[1])
for _ in range(num_iterations):
row_norms = np.sqrt(np.sum(K**2, axis=1))
col_norms = np.sqrt(np.sum(K**2, axis=0))
D1 *= 1.0 / np.sqrt(row_norms + 1e-10)
D2 *= 1.0 / np.sqrt(col_norms + 1e-10)
K = np.diag(D1) @ K @ np.diag(D2)
return K, D1, D2
After solving the preconditioned problem, the original solution is recovered by applying the inverse scaling.
One of the most powerful enhancements is adaptive restarting. The core idea: periodically check progress, and when sufficient improvement is detected, restart the algorithm from the current iterate to begin a fresh “epoch.”
Why does this help? Many LPs exhibit two-stage convergence:
Restarting exploits this by detecting when we’ve transitioned to the fast linear convergence regime and “resetting” to capitalize on it.
The restart condition typically monitors a progress metric—such as the fixed-point residual (how much the iterate moves in one PDHG step). When this metric decays by a sufficient factor (e.g., by a factor of $1/e$), the algorithm restarts:
\[\|z^{n,k} - \text{PDHG}(z^{n,k})\|_P \leq \frac{1}{e} \|z^{n,0} - \text{PDHG}(z^{n,0})\|_P\]where \(P\) is an appropriate norm for measuring PDHG progress, and \(n\) indexes restart epochs.
This adaptive scheme doesn’t require knowing problem-dependent constants (like sharpness parameters) that are difficult to estimate in practice. When combined with proper step-size selection, restarting achieves optimal complexity for first-order methods:
\[O\left(\sqrt{\frac{\|A\|^2}{\alpha}} \log \frac{1}{\epsilon}\right)\]where \(\alpha\) represents a problem-dependent sharpness parameter. This complexity matches theoretical lower bounds for first-order methods on LP, making it essentially unimprovable from a worst-case perspective.
Recent algorithmic variants incorporate Halpern-type acceleration, a classical technique from fixed-point theory. The Halpern scheme modifies the basic PDHG iteration by maintaining an “anchor” to the initial solution:
\[z^{k+1} = \frac{k+1}{k+2} \text{PDHG}(z^k) + \frac{1}{k+2} z^0\]This weighted average between the PDHG step and the initial point has several advantages:
Reflected Halpern PDHG enhances this further by applying the Halpern iteration to a reflected operator:
\[z^{k+1} = \frac{k+1}{k+2} (2\text{PDHG}(z^k) - z^k) + \frac{1}{k+2} z^0\]The reflection \(2\text{PDHG}(·) - I\) can be interpreted as taking a more aggressive step, which empirically and theoretically leads to faster convergence. The reflection technique has deep connections to operator theory and has been successfully applied to various optimization algorithms.
The primal weight \(\omega\) (recall \(\tau = \eta/\omega\) and \(\sigma = \eta\omega\)) balances the progress between primal and dual variables. If \(\omega\) is too large, the dual makes slow progress; if too small, the primal lags behind.
Modern implementations adaptively adjust \(\omega\) to balance primal and dual progress. The heuristic goal is to approximately equalize distances to optimality in both spaces:
\[\text{primal distance} \approx \text{dual distance}\]This adjustment typically happens at restart events. To avoid oscillations, the updates use exponential smoothing:
\[\omega_{\text{new}} = (1-\theta) \omega_{\text{old}} + \theta \omega_{\text{suggested}}\]where \(\theta \in [0,1]\) controls the smoothing rate and \(\omega_{\text{suggested}}\) is computed based on observed primal/dual progress.
While adaptive primal weight adjustment improves robustness across diverse problem instances, it’s not perfect—per-instance manual tuning can sometimes yield better performance. Determining the ideal primal weight remains an active research direction.
A practical solver must be able to detect when an LP is infeasible or unbounded. First-order methods naturally produce sequences that can encode infeasibility information.
Modern PDLP implementations monitor quantities related to the iterate differences and normalized cumulative iterates. If these quantities converge to non-zero limits with certain properties, they can provide:
The key insight is that the algorithm’s natural trajectory encodes this information—no special infeasibility detection procedure is needed. Recent algorithmic variants (particularly those using Halpern-type iterations with restarts) have been shown to detect infeasibility with strong theoretical guarantees.
The performance gains from GPU acceleration can be substantial, particularly as problem scale increases.
GPU implementations of PDHG-based solvers show significant advantages over CPU implementations:
General trends:
Multi-threaded CPU implementations can partially close the gap, but typically still lag behind GPU performance, especially on large-scale problems. The GPU’s massive parallelism in matrix-vector operations provides a fundamental architectural advantage that’s difficult to overcome with CPU threading alone.
Comparing GPU-accelerated first-order methods to traditional commercial solvers reveals an interesting performance profile:
The crossover point typically occurs when:
The promising results have driven rapid industry adoption:
Now that we’ve seen PDHG works in practice, let’s understand why it works. The theory is elegant and provides intuition for the algorithmic choices.
To analyze convergence, we need to measure progress. For LP, there are several natural metrics:
1. KKT Error: Measures violation of optimality conditions:
\[\text{KKT}(z) = \max\{\|c - A^T y - s\|_\infty, \|Ax - b\|_\infty, \|x^T s\|_\infty, \|x^-\|_\infty, \|s^-\|_\infty\}\]where \(x^- = \min(x, 0)\) (negative part) and \(s\) are dual slack variables.
2. Normalized Duality Gap: Scale-invariant measure of optimality:
\[\rho_r(z) = \frac{c^T x - b^T y}{r + |c^T x| + |b^T y|}\]3. Fixed-Point Residual: Movement after one PDHG iteration:
\[\|z^k - \text{PDHG}(z^k)\|_P\]where \(P\) is the natural metric for PDHG.
These metrics are related: if any goes to zero, so do the others (with explicit rate conversions).
For vanilla PDHG with step-size \(\eta < 1/\|A\|^2\), we have:
Theorem (Ergodic/Average Iterate): The average iterate \(\bar{z}^k = \frac{1}{k}\sum_{i=1}^k z^i\) satisfies:
\[\text{KKT}(\bar{z}^k) = O\left(\frac{1}{k}\right)\]This \(O(1/k)\) rate is standard for first-order methods on convex problems.
Intuition: The average smooths out oscillations in individual iterates. Think of it as a low-pass filter that removes high-frequency noise while preserving the signal (convergence to optimality).
The game-changer is linear convergence. Under a mild condition called “sharpness,” PDHG achieves exponential convergence.
Definition (Sharpness): An LP is \(\alpha\)-sharp if for all primal-dual pairs \(z\) near the optimal set \(Z^*\):
\[\text{dist}(z, Z^*) \leq \frac{1}{\alpha} \text{KKT}(z)\]Intuitively, sharpness means the KKT error tightly bounds the distance to optimality. It holds for many LPs, including:
Theorem (Local Linear Convergence): For an \(\alpha\)-sharp LP, the last iterate \(z^k\) satisfies:
\[\text{dist}(z^{k+1}, Z^*) \leq \left(1 - \frac{\alpha^2 \eta^2}{2}\right) \text{dist}(z^k, Z^*)\]With optimal step-size \(\eta = \Theta(1/\|A\|)\), this gives convergence rate:
\[O\left(\frac{\|A\|^2}{\alpha^2} \log \frac{1}{\epsilon}\right)\]The \(\log(1/\epsilon)\) is exponentially better than \(1/ \epsilon\) for vanilla methods!
Restarting achieves the optimal complexity for first-order methods on LP:
\[O\left(\sqrt{\frac{\|A\|^2}{\alpha}} \log \frac{1}{\epsilon}\right)\]This matches the complexity lower bound proven for FOMs on LP. The square root improvement over non-restarted PDHG is massive for poorly conditioned problems!
Why does restarting help? It exploits the two-stage behavior:
Let’s implement a simple version of PDHG to see how straightforward it is:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import norm as sparse_norm
class SimplePDHG:
"""
Simplified PDHG solver for LP in standard form:
min c'x s.t. Ax = b, x >= 0
"""
def __init__(self, c, A, b, eta_factor=0.9):
self.c = c
self.A = csr_matrix(A) # Use sparse matrix
self.b = b
self.n = len(c)
self.m = len(b)
# Set step-size: eta < 1/||A||
A_norm = sparse_norm(self.A, ord=2) # Spectral norm
self.eta = eta_factor / A_norm
self.sigma = self.eta # Equal step-sizes
def solve(self, max_iter=10000, tol=1e-6):
# Initialize at zero
x = np.zeros(self.n)
y = np.zeros(self.m)
# Track convergence
gaps = []
for k in range(max_iter):
# Store old x for extrapolation
x_old = x.copy()
# Primal update: x^{k+1} = proj_{R+}(x^k + eta*A'y^k - eta*c)
x = x + self.eta * (self.A.T @ y - self.c)
x = np.maximum(x, 0) # Project to non-negative orthant
# Dual update with extrapolation:
# y^{k+1} = y^k - sigma*A(2x^{k+1} - x^k) + sigma*b
x_extrap = 2*x - x_old
y = y - self.sigma * (self.A @ x_extrap - self.b)
# Compute duality gap every 10 iterations
if k % 10 == 0:
primal_obj = self.c @ x
dual_obj = self.b @ y
gap = abs(primal_obj - dual_obj) / (1 + abs(primal_obj))
gaps.append(gap)
if gap < tol:
print(f"Converged in {k} iterations")
break
return x, y, gaps
# Example: Solve a small LP
n, m = 100, 50
np.random.seed(42)
# Generate random LP: min c'x s.t. Ax = b, x >= 0
A = np.random.randn(m, n)
c = np.random.randn(n)
x_true = np.abs(np.random.randn(n)) # Make feasible
b = A @ x_true
solver = SimplePDHG(c, A, b)
x_sol, y_sol, gaps = solver.solve()
# Plot convergence
import matplotlib.pyplot as plt
plt.semilogy(gaps)
plt.xlabel('Iteration (x10)')
plt.ylabel('Relative duality gap')
plt.title('PDHG Convergence')
plt.grid(True)
plt.show()
This implementation is about 50 lines of code! The real production solvers (cuPDLP, PDLP) add:
But the core algorithm is exactly what we’ve shown.
Use GPU-accelerated first-order LP solvers when:
Use traditional solvers (simplex/IPM) when:
Current limitations of first-order LP solvers:
Active research directions:
For deeper understanding, I highly recommend:
Books:
Papers:
Note: This blog post draws inspiration from the general algorithmic ideas and landscape of first-order methods for LP as surveyed in the literature, particularly the third paper above, while presenting concepts in an accessible manner with original explanations.
Code:
The rise of GPU-accelerated first-order methods for linear programming represents a genuine paradigm shift in large-scale optimization. By embracing the strengths of modern parallel hardware and accepting that more simple iterations can beat fewer complex ones, methods like PDHG and PDLP are solving previously intractable problems at unprecedented speeds.
We’ve seen how PDHG emerges naturally from trying to build a practical first-order method for LP saddle point problems, how practical enhancements transform it from a toy algorithm to an industrial-grade solver, and how elegant theory explains its remarkable empirical performance.
Just as GPUs revolutionized deep learning by efficiently handling parallel computations, they’re now reshaping optimization. And we’re still in the early days—like interior-point methods in the 1990s, there’s enormous potential for further improvements in both theory and practice.
The future of large-scale LP solving is parallel, and it’s already here.
Questions? Thoughts? Found this helpful? Let me know! And if you’re working on large-scale LP problems, give cuPDLP or PDLP a try—you might be surprised by the performance.