Linear algebra can be parallelised within Magma for operations with large matrices over finite fields of small characteristic. These are two main parallelisation modes available: multi-threaded and GPU. A combination of these two modes (called `hybrid') is also available but is still in an experimental phase.
Once parallelisation in one of the below modes is selected, the most critical base algorithm which is parallelised is matrix multiplication for sufficiently large matrices over the above fields (directly called by A*B). The most important other matrix algorithms which directly benefit from this are powering (A^{e}) and the computation of the echelon form (EchelonForm(A)) and the many operations which are closely derived from that, including inverse (A^{-1}), and the functions Rank, Determinant, Nullspace, Rowspace, Solution, IsConsistent and MinimalPolynomial(A: Al := "KellerGehrig") (which selects the asymptotically-fast Keller-Gehrig algorithm for minimal polynomial computation).
Any other algorithm of Magma which depends on large matrix computations over small finite fields will also be automatically parallelised when any of the above modes of parallelisation is selected and any of the above functions are implicitly called. For example, most of the important functions to compute with KG-modules, where K is a small finite field and G is a finite group, depend heavily on multiplication and echelonisation of matrices, and so automatically benefit from linear algebra parallelisation in large examples; in particular the Meataxe (called by CompositionFactors, etc.) and the computation of sub- and quotient modules are greatly sped up in large examples when linear algebra parallelisation is enabled.
Multi-threaded linear algebra is supported in these cases:
SetNthreads(NT);where NT is the number of threads desired. Note that since matrix operations are memory intensive, the resulting speedups compared with the serial algorithms depend strongly on the memory speed of the computer, and it is unlikely that the speedup obtained will be close to NT. Also, hyperthreading (setting NT to more than the number of physical cores) is usually not beneficial because it increases the contention between threads on memory access.
GPU-based linear algebra (with NVIDIA CUDA) is supported in these cases:
This combines multi-threaded and GPU operations at the same time (for matrices defined over the same types of finite fields as above). This mode is selected by calling SetNthreads(NT); while running the CUDA executable (and with the GPU mode enabled). This mode may be not be beneficial compared to the simple POSIX threads mode or the simple GPU mode (it depends strongly on the number of threads and the type of GPU(s)), so when running the CUDA executable by default, it may be better to keep the number of threads at 1.
> SetNthreads(1); > X := Random(MatrixRing(GF(5), 10000)); > time P := X*X; // single thread, normal CPU time Time: 5.040 > time E,T := EchelonForm(X); Time: 6.940 > SetNthreads(4); > time P2 := X*X; // multi-threads, so real time shown by [r] Time: 1.760[r] > time E,T := EchelonForm(X); Time: 2.660[r] > assert P2 eq P;
> SetGPU(false); > X := Random(MatrixRing(GF(5), 10000)); > time P := X*X; // no GPU Time: 5.040 > time E,T := EchelonForm(X); Time: 6.940 > SetGPU(true); > time P2 := X*X; // 1 GPU Time: 1.320 > time E,T := EchelonForm(X); Time: 2.920 > assert P2 eq P;
The following input tests very large matrix multiplication over GF(2) to see speedups when two GPUs are available (CUDA executable only with GPU selected by default; hardware used here: dual NVIDIA V100 GPUs).
> K := GF(2); > for i in [16 .. 19] do > n := 2^i; > printf "Dim: 2^%o = %o n", i, n; > A := Random(MatrixRing(K, n)); > B := Random(MatrixRing(K, n)); > IndentPush(); > "1 GPU:"; > t0 := Cputime(); SetNGPUs(1); time P1 := A*B; t0 := Cputime(t0); > "2 GPUs:"; > t1 := Cputime(); SetNGPUs(2); time P2 := A*B; t1 := Cputime(t1); > printf "Speedup: %.3o n", t0/t1; > assert P1 eq P2; > IndentPop(); > end for; Dim: 2^16 = 65536 1 GPU: Time: 2.680 2 GPUs: Time: 1.550 Speedup: 1.729 Dim: 2^17 = 131072 1 GPU: Time: 18.420 2 GPUs: Time: 11.170 Speedup: 1.649 Dim: 2^18 = 262144 1 GPU: Time: 134.380 2 GPUs: Time: 75.840 Speedup: 1.772 Dim: 2^19 = 524288 1 GPU: Time: 1085.930 2 GPUs: Time: 571.670 Speedup: 1.900
> G := PermutationGroup(ATLASGroup("O8m2")); > K := GF(17); > L := AbsolutelyIrreducibleModules(G, K); > [Dimension(M): M in L]; [ 1, 34, 51, 83, 204, 204, 357, 476, 476, 595, 714, 714, 1020, 1071, 1071, 1071, 1071, 1190, 1261, 1428, 2142, 2142, 2142, 2142, 2176, 2295, 2295, 2295, 2835, 2856, 2856, 3264, 4284, 4760, 5355 ]
With one thread only:
> SetNthreads(1); SetGPU(false); > time L := AbsolutelyIrreducibleModules(G, K); Time: 6503.600
With 24 threads:
> SetNthreads(24); SetGPU(false); > time L := AbsolutelyIrreducibleModules(G, K); Time: 2432.150[r]
With one thread and a GPU:
> SetNthreads(1); SetGPU(true); > time L := AbsolutelyIrreducibleModules(G, K); Time: 2584.590