๐ŸŸ  2d Swift-Hohenberg equation (non-local) on the GPU

Here we give an example where the continuation can be done entirely on the GPU, e.g. on a single V100 NIVDIA.

Why this example?

This is not the simplest GPU example because we need a preconditioned linear solver and shift-invert eigen solver for this problem. On the other hand, you will be shown how to set up your own linear/eigen solver.

We choose the 2d Swift-Hohenberg as an example and consider a larger grid. See 2d Swift-Hohenberg equation for more details. Solving the sparse linear problem in $v$

\[-(I+\Delta)^2 v+(l +2\nu u-3u^2)v = rhs\]

with a direct solver becomes prohibitive. Looking for an iterative method, the conditioning of the jacobian is not good enough to have fast convergence, mainly because of the Laplacian operator. However, the above problem is equivalent to:

\[-v + L \cdot (d \cdot v) = L\cdot rhs\]

where

\[L := ((I+\Delta)^2 + I)^{-1}\]

is very well conditioned and

\[d := l+1+2\nu v-3v^2.\]

Hence, to solve the previous equation, only a few GMRES iterations are required.

In effect, the preconditioned PDE is an example of nonlocal problem.

Linear Algebra on the GPU

We plan to use KrylovKit on the GPU. We define the following types so it is easier to switch to Float32 for example:

using Revise, CUDA

# this disable slow operations but errors if you use one of them
CUDA.allowscalar(false)

# type used for the arrays, can be Float32 if GPU requires it
TY = Float64

# put the AF = Array{TY} instead to make the code on the CPU
AF = CuArray{TY}

Computing the inverse of the differential operator

The issue now is to compute $L$ but this is easy using Fourier transforms.

Hence, that's why we slightly modify the previous Example by considering periodic boundary conditions. Let us now show how to compute $L$. Although the code looks quite technical, it is based on two facts. First, the Fourier transform symbol associated to $L$ is

\[l_1 = 1+(1-k_x^2-k_y^2)^2\]

which is pre-computed in the composite type SHLinearOp. Then, the effect of L on u is as simple as real.(ifft( l1 .* fft(u) )) and the inverse L\u is real.(ifft( fft(u) ./ l1 )). However, in order to save memory on the GPU, we use inplace FFTs to reduce temporaries which explains the following code.

using AbstractFFTs, FFTW, KrylovKit, Setfield, Parameters
using BifurcationKit, LinearAlgebra, Plots
const BK = BifurcationKit

# the following struct encodes the operator L1
# Making the linear operator a subtype of BK.AbstractLinearSolver is handy as it will be used
# in the Newton iterations.
struct SHLinearOp{Treal, Tcomp, Tl1, Tplan, Tiplan} <: BK.AbstractLinearSolver
	tmp_real::Treal         # temporary
	tmp_complex::Tcomp      # temporary
	l1::Tl1
	fftplan::Tplan
	ifftplan::Tiplan
end

# this is a constructor for the above struct
function SHLinearOp(Nx, lx, Ny, ly; AF = Array{TY})
	# AF is a type, it could be CuArray{TY} to run the following on GPU
	k1 = vcat(collect(0:Nx/2), collect(Nx/2+1:Nx-1) .- Nx)
	k2 = vcat(collect(0:Ny/2), collect(Ny/2+1:Ny-1) .- Ny)
	d2 = [(1-(pi/lx * kx)^2 - (pi/ly * ky)^2)^2 + 1. for kx in k1, ky in k2]
	tmpc = Complex.(AF(zeros(Nx, Ny)))
	return SHLinearOp(AF(zeros(Nx, Ny)), tmpc, AF(d2), plan_fft!(tmpc), plan_ifft!(tmpc))
end

import Base: *, \

# generic function to apply operator op to u
function apply(c::SHLinearOp, u, multiplier, op = *)
	c.tmp_complex .= Complex.(u)
	c.fftplan * c.tmp_complex
	c.tmp_complex .= op.(c.tmp_complex, multiplier)
	c.ifftplan * c.tmp_complex
	c.tmp_real .= real.(c.tmp_complex)
	return copy(c.tmp_real)
end

# action of L
*(c::SHLinearOp, u) = apply(c, u, c.l1)

# inverse of L
\(c::SHLinearOp, u) = apply(c, u, c.l1, /)

Before applying a Newton solver, we need to tell how to solve the linear equation arising in the Newton Algorithm.

# inverse of the jacobian of the PDE
function (sh::SHLinearOp)(J, rhs; shift = 0., tol =  1e-9)
	u, l, ฮฝ = J
	udiag = l .+ 1 .+ 2ฮฝ .* u .- 3 .* u.^2 .- shift
	res, info = KrylovKit.linsolve( du -> -du .+ sh \ (udiag .* du), sh \ rhs,
	tol = tol, maxiter = 6)
	return res, true, info.numops
end

Now that we have our operator L, we can encode our functional:

function F_shfft(u, p)
	@unpack l, ฮฝ, L = p
	return -(L * u) .+ ((l+1) .* u .+ ฮฝ .* u.^2 .- u.^3)
end

Let us now show how to build our operator L and an initial guess sol0 using the above defined structures.

using LinearAlgebra, Plots

# to simplify plotting of the solution
plotsol(x; k...) = heatmap(reshape(Array(x), Nx, Ny)'; color=:viridis, k...)
plotsol!(x; k...) = heatmap!(reshape(Array(x), Nx, Ny)'; color=:viridis, k...)
norminf(x) = maximum(abs.(x))

# norm compatible with CUDA
norminf(x) = maximum(abs.(x))

Nx = 2^10
Ny = 2^10
lx = 8pi * 2
ly = 2*2pi/sqrt(3) * 2

X = -lx .+ 2lx/(Nx) * collect(0:Nx-1)
Y = -ly .+ 2ly/(Ny) * collect(0:Ny-1)

sol0 = [(cos(x) .+ cos(x/2) * cos(sqrt(3) * y/2) ) for x in X, y in Y]
		sol0 .= sol0 .- minimum(vec(sol0))
		sol0 ./= maximum(vec(sol0))
		sol0 = sol0 .- 0.25
		sol0 .*= 1.7

L = SHLinearOp(Nx, lx, Ny, ly, AF = AF)
J_shfft(u, p) = (u, p.l, p.ฮฝ)

# parameters of the PDE
par = (l = -0.15, ฮฝ = 1.3, L = L)

# Bifurcation Problem
prob = BK.BifurcationProblem(F_shfft, AF(sol0), par, (@lens _.l) ;
	J =  J_shfft,
	plot_solution = (x, p;kwargs...) -> plotsol!(x; color=:viridis, kwargs...),
	record_from_solution = (x, p) -> norm(x))

Newton iterations and deflation

We are now ready to perform Newton iterations:

opt_new = NewtonPar(verbose = true, tol = 1e-6, max_iterations = 100, linsolver = L)
sol_hexa = @time newton(prob, opt_new, normN = norminf)
println("--> norm(sol) = ", maximum(abs.(sol_hexa.u)))
plotsol(sol_hexa.u)

You should see this:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Newton step         residual     linear iterations  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚       0     โ”‚       3.3758e-01     โ”‚        0       โ”‚
โ”‚       1     โ”‚       8.0152e+01     โ”‚       12       โ”‚
โ”‚       2     โ”‚       2.3716e+01     โ”‚       28       โ”‚
โ”‚       3     โ”‚       6.7353e+00     โ”‚       22       โ”‚
โ”‚       4     โ”‚       1.9498e+00     โ”‚       17       โ”‚
โ”‚       5     โ”‚       5.5893e-01     โ”‚       14       โ”‚
โ”‚       6     โ”‚       1.0998e-01     โ”‚       12       โ”‚
โ”‚       7     โ”‚       1.1381e-02     โ”‚       11       โ”‚
โ”‚       8     โ”‚       1.6393e-04     โ”‚       11       โ”‚
โ”‚       9     โ”‚       7.3277e-08     โ”‚       10       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  0.317790 seconds (42.67 k allocations: 1.256 MiB)
--> norm(sol) = 1.26017611779702

Note that this is about the 10x faster than Example 2 but for a problem almost 100x larger! (On a V100 GPU)

The solution is:

We can also use the deflation technique (see DeflationOperator and DeflatedProblem for more information) on the GPU as follows

deflationOp = DeflationOperator(2, dot, 1.0, [sol_hexa.u])

opt_new = @set opt_new.max_iterations = 250
outdef = @time newton(re_make(prob, u0 = AF(0.4 .* sol_hexa.u .* AF([exp(-1(x+0lx)^2/25) for x in X, y in Y]))),
		deflationOp, opt_new, normN = x-> maximum(abs.(x)))
println("--> norm(sol) = ", norm(outdef.u))
plotsol(outdef.u) |> display
BK.converged(outdef) && push!(deflationOp, outdef.u)

and get:

Computation of the branches

Finally, we can perform continuation of the branches on the GPU:

opts_cont = ContinuationPar(dsmin = 0.001, dsmax = 0.007, ds= -0.005,
	p_max = 0., p_min = -1.0, plot_every_step = 5, detect_bifurcation = 0,
	newton_options = setproperties(opt_new; tol = 1e-6, max_iterations = 15), max_steps = 100)

br = @time continuation(re_make(prob, u0 = deflationOp[1]),
PALC(bls = BorderingBLS(solver = L, check_precision = false)),
opts_cont;
	plot = true, verbosity = 3,
	normC = x -> maximum(abs.(x))
	)

We did not detail how to compute the eigenvalues on the GPU and detect the bifurcations. It is based on a simple Shift-Invert strategy, please look at examples/SH2d-fronts-cuda.jl.

We have the following information about the branch of hexagons

julia> br
Branch number of points: 67
Branch of Equilibrium
Bifurcation points:
 (ind_ev = index of the bifurcating eigenvalue e.g. `br.eig[idx].eigenvals[ind_ev]`)
- #  1,    nd at p โ‰ˆ -0.21522461 โˆˆ (-0.21528614, -0.21522461), |ฮดp|=6e-05, [converged], ฮด = ( 3,  0), step =  24, eigenelements in eig[ 25], ind_ev =   3
- #  2,    nd at p โ‰ˆ -0.21469007 โˆˆ (-0.21479652, -0.21469007), |ฮดp|=1e-04, [converged], ฮด = ( 2,  0), step =  25, eigenelements in eig[ 26], ind_ev =   5
- #  3,    nd at p โ‰ˆ -0.21216919 โˆˆ (-0.21264341, -0.21216919), |ฮดp|=5e-04, [converged], ฮด = ( 2,  0), step =  27, eigenelements in eig[ 28], ind_ev =   7
- #  4,    nd at p โ‰ˆ -0.21052576 โˆˆ (-0.21110899, -0.21052576), |ฮดp|=6e-04, [converged], ฮด = ( 2,  0), step =  28, eigenelements in eig[ 29], ind_ev =   9
- #  5,    nd at p โ‰ˆ -0.20630678 โˆˆ (-0.21052576, -0.20630678), |ฮดp|=4e-03, [converged], ฮด = ( 8,  0), step =  29, eigenelements in eig[ 30], ind_ev =  17
- #  6,    nd at p โ‰ˆ -0.19896508 โˆˆ (-0.19897308, -0.19896508), |ฮดp|=8e-06, [converged], ฮด = ( 6,  0), step =  30, eigenelements in eig[ 31], ind_ev =  23
- #  7,    nd at p โ‰ˆ -0.18621673 โˆˆ (-0.18748234, -0.18621673), |ฮดp|=1e-03, [converged], ฮด = ( 2,  0), step =  33, eigenelements in eig[ 34], ind_ev =  25
- #  8,    nd at p โ‰ˆ -0.17258147 โˆˆ (-0.18096574, -0.17258147), |ฮดp|=8e-03, [converged], ฮด = ( 4,  0), step =  35, eigenelements in eig[ 36], ind_ev =  29
- #  9,    nd at p โ‰ˆ -0.14951737 โˆˆ (-0.15113148, -0.14951737), |ฮดp|=2e-03, [converged], ฮด = (-4,  0), step =  39, eigenelements in eig[ 40], ind_ev =  29
- # 10,    nd at p โ‰ˆ -0.14047758 โˆˆ (-0.14130979, -0.14047758), |ฮดp|=8e-04, [converged], ฮด = (-2,  0), step =  41, eigenelements in eig[ 42], ind_ev =  25
- # 11,    nd at p โ‰ˆ -0.11304882 โˆˆ (-0.11315916, -0.11304882), |ฮดp|=1e-04, [converged], ฮด = (-4,  0), step =  45, eigenelements in eig[ 46], ind_ev =  23
- # 12,    nd at p โ‰ˆ -0.09074623 โˆˆ (-0.09085968, -0.09074623), |ฮดp|=1e-04, [converged], ฮด = (-6,  0), step =  49, eigenelements in eig[ 50], ind_ev =  19
- # 13,    nd at p โ‰ˆ -0.07062574 โˆˆ (-0.07246519, -0.07062574), |ฮดp|=2e-03, [converged], ฮด = (-4,  0), step =  52, eigenelements in eig[ 53], ind_ev =  13
- # 14,    nd at p โ‰ˆ -0.06235903 โˆˆ (-0.06238787, -0.06235903), |ฮดp|=3e-05, [converged], ฮด = (-2,  0), step =  54, eigenelements in eig[ 55], ind_ev =   9
- # 15,    nd at p โ‰ˆ -0.05358077 โˆˆ (-0.05404312, -0.05358077), |ฮดp|=5e-04, [converged], ฮด = (-2,  0), step =  56, eigenelements in eig[ 57], ind_ev =   7
- # 16,    nd at p โ‰ˆ -0.02494422 โˆˆ (-0.02586444, -0.02494422), |ฮดp|=9e-04, [converged], ฮด = (-2,  0), step =  60, eigenelements in eig[ 61], ind_ev =   5
- # 17,    nd at p โ‰ˆ -0.00484022 โˆˆ (-0.00665356, -0.00484022), |ฮดp|=2e-03, [converged], ฮด = (-2,  0), step =  63, eigenelements in eig[ 64], ind_ev =   3
- # 18,    nd at p โ‰ˆ +0.00057801 โˆˆ (-0.00122418, +0.00057801), |ฮดp|=2e-03, [converged], ฮด = ( 5,  0), step =  64, eigenelements in eig[ 65], ind_ev =   6
- # 19,    nd at p โ‰ˆ +0.00320921 โˆˆ (+0.00141327, +0.00320921), |ฮดp|=2e-03, [converged], ฮด = (10,  0), step =  65, eigenelements in eig[ 66], ind_ev =  16
Fold points:
- #  1, fold at p โ‰ˆ -0.21528694 โˆˆ (-0.21528694, -0.21528694), |ฮดp|=-1e+00, [    guess], ฮด = ( 0,  0), step =  24, eigenelements in eig[ 24], ind_ev =   0