๐ 2d Swift-Hohenberg equation (non-local) on the GPU
Here we give an example where the continuation can be done entirely on the GPU, e.g. on a single V100 NIVDIA.
This is not the simplest GPU example because we need a preconditioned linear solver and shift-invert eigen solver for this problem. On the other hand, you will be shown how to set up your own linear/eigen solver.
We choose the 2d Swift-Hohenberg as an example and consider a larger grid. See 2d Swift-Hohenberg equation for more details. Solving the sparse linear problem in $v$
\[-(I+\Delta)^2 v+(l +2\nu u-3u^2)v = rhs\]
with a direct solver becomes prohibitive. Looking for an iterative method, the conditioning of the jacobian is not good enough to have fast convergence, mainly because of the Laplacian operator. However, the above problem is equivalent to:
\[-v + L \cdot (d \cdot v) = L\cdot rhs\]
where
\[L := ((I+\Delta)^2 + I)^{-1}\]
is very well conditioned and
\[d := l+1+2\nu v-3v^2.\]
Hence, to solve the previous equation, only a few GMRES iterations are required.
In effect, the preconditioned PDE is an example of nonlocal problem.
Linear Algebra on the GPU
We plan to use KrylovKit
on the GPU. We define the following types so it is easier to switch to Float32
for example:
using Revise, CUDA
# this disable slow operations but errors if you use one of them
CUDA.allowscalar(false)
# type used for the arrays, can be Float32 if GPU requires it
TY = Float64
# put the AF = Array{TY} instead to make the code on the CPU
AF = CuArray{TY}
Computing the inverse of the differential operator
The issue now is to compute $L$ but this is easy using Fourier transforms.
Hence, that's why we slightly modify the previous Example by considering periodic boundary conditions. Let us now show how to compute $L$. Although the code looks quite technical, it is based on two facts. First, the Fourier transform symbol associated to $L$ is
\[l_1 = 1+(1-k_x^2-k_y^2)^2\]
which is pre-computed in the composite type SHLinearOp
. Then, the effect of L
on u
is as simple as real.(ifft( l1 .* fft(u) ))
and the inverse L\u
is real.(ifft( fft(u) ./ l1 ))
. However, in order to save memory on the GPU, we use inplace FFTs to reduce temporaries which explains the following code.
using AbstractFFTs, FFTW, KrylovKit
using BifurcationKit, LinearAlgebra, Plots
const BK = BifurcationKit
# the following struct encodes the operator L1
# Making the linear operator a subtype of BK.AbstractLinearSolver is handy as it will be used
# in the Newton iterations.
struct SHLinearOp{Treal, Tcomp, Tl1, Tplan, Tiplan} <: BK.AbstractLinearSolver
tmp_real::Treal # temporary
tmp_complex::Tcomp # temporary
l1::Tl1
fftplan::Tplan
ifftplan::Tiplan
end
# this is a constructor for the above struct
function SHLinearOp(Nx, lx, Ny, ly; AF = Array{TY})
# AF is a type, it could be CuArray{TY} to run the following on GPU
k1 = vcat(collect(0:Nx/2), collect(Nx/2+1:Nx-1) .- Nx)
k2 = vcat(collect(0:Ny/2), collect(Ny/2+1:Ny-1) .- Ny)
d2 = [(1-(pi/lx * kx)^2 - (pi/ly * ky)^2)^2 + 1. for kx in k1, ky in k2]
tmpc = Complex.(AF(zeros(Nx, Ny)))
return SHLinearOp(AF(zeros(Nx, Ny)), tmpc, AF(d2), plan_fft!(tmpc), plan_ifft!(tmpc))
end
import Base: *, \
# generic function to apply operator op to u
function apply(c::SHLinearOp, u, multiplier, op = *)
c.tmp_complex .= Complex.(u)
c.fftplan * c.tmp_complex
c.tmp_complex .= op.(c.tmp_complex, multiplier)
c.ifftplan * c.tmp_complex
c.tmp_real .= real.(c.tmp_complex)
return copy(c.tmp_real)
end
# action of L
*(c::SHLinearOp, u) = apply(c, u, c.l1)
# inverse of L
\(c::SHLinearOp, u) = apply(c, u, c.l1, /)
Before applying a Newton solver, we need to tell how to solve the linear equation arising in the Newton Algorithm.
# inverse of the jacobian of the PDE
function (sh::SHLinearOp)(J, rhs; shift = 0., tol = 1e-9)
u, l, ฮฝ = J
udiag = l .+ 1 .+ 2ฮฝ .* u .- 3 .* u.^2 .- shift
res, info = KrylovKit.linsolve( du -> -du .+ sh \ (udiag .* du), sh \ rhs,
tol, maxiter = 6)
return res, true, info.numops
end
Now that we have our operator L
, we can encode our functional:
function F_shfft(u, p)
(;l, ฮฝ, L) = p
return -(L * u) .+ ((l+1) .* u .+ ฮฝ .* u.^2 .- u.^3)
end
Let us now show how to build our operator L
and an initial guess sol0
using the above defined structures.
using LinearAlgebra, Plots
# to simplify plotting of the solution
plotsol(x; k...) = heatmap(reshape(Array(x), Nx, Ny)'; color=:viridis, k...)
plotsol!(x; k...) = heatmap!(reshape(Array(x), Nx, Ny)'; color=:viridis, k...)
norminf(x) = maximum(abs.(x))
# norm compatible with CUDA
norminf(x) = maximum(abs.(x))
Nx = 2^10
Ny = 2^10
lx = 8pi * 2
ly = 2*2pi/sqrt(3) * 2
X = -lx .+ 2lx/(Nx) * collect(0:Nx-1)
Y = -ly .+ 2ly/(Ny) * collect(0:Ny-1)
sol0 = [(cos(x) .+ cos(x/2) * cos(sqrt(3) * y/2) ) for x in X, y in Y]
sol0 .= sol0 .- minimum(vec(sol0))
sol0 ./= maximum(vec(sol0))
sol0 = sol0 .- 0.25
sol0 .*= 1.7
L = SHLinearOp(Nx, lx, Ny, ly; AF)
J_shfft(u, p) = (u, p.l, p.ฮฝ)
# parameters of the PDE
par = (l = -0.15, ฮฝ = 1.3, L = L)
# Bifurcation Problem
prob = BK.BifurcationProblem(F_shfft, AF(sol0), par, (@optic _.l) ;
J = J_shfft,
plot_solution = (x, p;kwargs...) -> plotsol!(x; color=:viridis, kwargs...),
record_from_solution = (x, p; k...) -> norm(x))
Newton iterations and deflation
We are now ready to perform Newton iterations:
opt_new = NewtonPar(verbose = true, tol = 1e-6, max_iterations = 100, linsolver = L)
sol_hexa = @time BK.solve(prob, Newton(), opt_new, normN = norminf)
println("--> norm(sol) = ", maximum(abs.(sol_hexa.u)))
plotsol(sol_hexa.u)
You should see this:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Newton step residual linear iterations โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโค
โ 0 โ 3.3758e-01 โ 0 โ
โ 1 โ 8.0152e+01 โ 12 โ
โ 2 โ 2.3716e+01 โ 28 โ
โ 3 โ 6.7353e+00 โ 22 โ
โ 4 โ 1.9498e+00 โ 17 โ
โ 5 โ 5.5893e-01 โ 14 โ
โ 6 โ 1.0998e-01 โ 12 โ
โ 7 โ 1.1381e-02 โ 11 โ
โ 8 โ 1.6393e-04 โ 11 โ
โ 9 โ 7.3277e-08 โ 10 โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
0.317790 seconds (42.67 k allocations: 1.256 MiB)
--> norm(sol) = 1.26017611779702
Note that this is about the 10x faster than Example 2 but for a problem almost 100x larger! (On a V100 GPU)
The solution is:
We can also use the deflation technique (see DeflationOperator
and DeflatedProblem
for more information) on the GPU as follows
deflationOp = DeflationOperator(2, dot, 1.0, [sol_hexa.u])
opt_new = @set opt_new.max_iterations = 250
outdef = @time BK.solve(re_make(prob, u0 = AF(0.4 .* sol_hexa.u .* AF([exp(-1(x+0lx)^2/25) for x in X, y in Y]))),
deflationOp, opt_new, normN = x-> maximum(abs.(x)))
println("--> norm(sol) = ", norm(outdef.u))
plotsol(outdef.u) |> display
BK.converged(outdef) && push!(deflationOp, outdef.u)
and get:
Computation of the branches
Finally, we can perform continuation of the branches on the GPU:
opts_cont = ContinuationPar(dsmin = 0.001, dsmax = 0.007, ds= -0.005,
p_max = 0., p_min = -1.0, plot_every_step = 5, detect_bifurcation = 3,
newton_options = setproperties(opt_new; tol = 1e-6, max_iterations = 15), max_steps = 100)
br = @time continuation(re_make(prob, u0 = deflationOp[1]),
PALC(bls = BorderingBLS(solver = L, check_precision = false)),
opts_cont;
plot = true, verbosity = 3,
normC = x -> maximum(abs.(x))
)
We did not detail how to compute the eigenvalues on the GPU and detect the bifurcations. It is based on a simple Shift-Invert strategy, please look at examples/SH2d-fronts-cuda.jl
.
We have the following information about the branch of hexagons
julia> br
Branch number of points: 67
Branch of Equilibrium
Bifurcation points:
(ind_ev = index of the bifurcating eigenvalue e.g. `br.eig[idx].eigenvals[ind_ev]`)
- # 1, nd at p โ -0.21522461 โ (-0.21528614, -0.21522461), |ฮดp|=6e-05, [converged], ฮด = ( 3, 0), step = 24, eigenelements in eig[ 25], ind_ev = 3
- # 2, nd at p โ -0.21469007 โ (-0.21479652, -0.21469007), |ฮดp|=1e-04, [converged], ฮด = ( 2, 0), step = 25, eigenelements in eig[ 26], ind_ev = 5
- # 3, nd at p โ -0.21216919 โ (-0.21264341, -0.21216919), |ฮดp|=5e-04, [converged], ฮด = ( 2, 0), step = 27, eigenelements in eig[ 28], ind_ev = 7
- # 4, nd at p โ -0.21052576 โ (-0.21110899, -0.21052576), |ฮดp|=6e-04, [converged], ฮด = ( 2, 0), step = 28, eigenelements in eig[ 29], ind_ev = 9
- # 5, nd at p โ -0.20630678 โ (-0.21052576, -0.20630678), |ฮดp|=4e-03, [converged], ฮด = ( 8, 0), step = 29, eigenelements in eig[ 30], ind_ev = 17
- # 6, nd at p โ -0.19896508 โ (-0.19897308, -0.19896508), |ฮดp|=8e-06, [converged], ฮด = ( 6, 0), step = 30, eigenelements in eig[ 31], ind_ev = 23
- # 7, nd at p โ -0.18621673 โ (-0.18748234, -0.18621673), |ฮดp|=1e-03, [converged], ฮด = ( 2, 0), step = 33, eigenelements in eig[ 34], ind_ev = 25
- # 8, nd at p โ -0.17258147 โ (-0.18096574, -0.17258147), |ฮดp|=8e-03, [converged], ฮด = ( 4, 0), step = 35, eigenelements in eig[ 36], ind_ev = 29
- # 9, nd at p โ -0.14951737 โ (-0.15113148, -0.14951737), |ฮดp|=2e-03, [converged], ฮด = (-4, 0), step = 39, eigenelements in eig[ 40], ind_ev = 29
- # 10, nd at p โ -0.14047758 โ (-0.14130979, -0.14047758), |ฮดp|=8e-04, [converged], ฮด = (-2, 0), step = 41, eigenelements in eig[ 42], ind_ev = 25
- # 11, nd at p โ -0.11304882 โ (-0.11315916, -0.11304882), |ฮดp|=1e-04, [converged], ฮด = (-4, 0), step = 45, eigenelements in eig[ 46], ind_ev = 23
- # 12, nd at p โ -0.09074623 โ (-0.09085968, -0.09074623), |ฮดp|=1e-04, [converged], ฮด = (-6, 0), step = 49, eigenelements in eig[ 50], ind_ev = 19
- # 13, nd at p โ -0.07062574 โ (-0.07246519, -0.07062574), |ฮดp|=2e-03, [converged], ฮด = (-4, 0), step = 52, eigenelements in eig[ 53], ind_ev = 13
- # 14, nd at p โ -0.06235903 โ (-0.06238787, -0.06235903), |ฮดp|=3e-05, [converged], ฮด = (-2, 0), step = 54, eigenelements in eig[ 55], ind_ev = 9
- # 15, nd at p โ -0.05358077 โ (-0.05404312, -0.05358077), |ฮดp|=5e-04, [converged], ฮด = (-2, 0), step = 56, eigenelements in eig[ 57], ind_ev = 7
- # 16, nd at p โ -0.02494422 โ (-0.02586444, -0.02494422), |ฮดp|=9e-04, [converged], ฮด = (-2, 0), step = 60, eigenelements in eig[ 61], ind_ev = 5
- # 17, nd at p โ -0.00484022 โ (-0.00665356, -0.00484022), |ฮดp|=2e-03, [converged], ฮด = (-2, 0), step = 63, eigenelements in eig[ 64], ind_ev = 3
- # 18, nd at p โ +0.00057801 โ (-0.00122418, +0.00057801), |ฮดp|=2e-03, [converged], ฮด = ( 5, 0), step = 64, eigenelements in eig[ 65], ind_ev = 6
- # 19, nd at p โ +0.00320921 โ (+0.00141327, +0.00320921), |ฮดp|=2e-03, [converged], ฮด = (10, 0), step = 65, eigenelements in eig[ 66], ind_ev = 16
Fold points:
- # 1, fold at p โ -0.21528694 โ (-0.21528694, -0.21528694), |ฮดp|=-1e+00, [ guess], ฮด = ( 0, 0), step = 24, eigenelements in eig[ 24], ind_ev = 0