Linear Algebra

Linear algebra can be parallelised within Magma for operations with large matrices over finite fields of small characteristic. These are two main parallelisation modes available: multi-threaded and GPU. A combination of these two modes (called `hybrid') is also available but is still in an experimental phase.

Once parallelisation in one of the below modes is selected, the most critical base algorithm which is parallelised is matrix multiplication for sufficiently large matrices over the above fields (directly called by A*B). The most important other matrix algorithms which directly benefit from this are powering (A^{e}) and the computation of the echelon form (EchelonForm(A)) and the many operations which are closely derived from that, including inverse (A^{-1}), and the functions Rank, Determinant, Nullspace, Rowspace, Solution, IsConsistent and MinimalPolynomial(A: Al := "KellerGehrig") (which selects the asymptotically-fast Keller-Gehrig algorithm for minimal polynomial computation).

Any other algorithm of Magma which depends on large matrix computations over small finite fields will also be automatically parallelised when any of the above modes of parallelisation is selected and any of the above functions are implicitly called. For example, most of the important functions to compute with KG-modules, where K is a small finite field and G is a finite group, depend heavily on multiplication and echelonisation of matrices, and so automatically benefit from linear algebra parallelisation in large examples; in particular the Meataxe (called by CompositionFactors, etc.) and the computation of sub- and quotient modules are greatly sped up in large examples when linear algebra parallelisation is enabled.

Contents

Multi-threaded Linear Algebra

Multi-threaded linear algebra is supported in these cases:

(i)
For matrices over GF(p), where p = 2, 3, 5, 7 (POSIX threads with split-word representation).
(ii)
For matrices over GF(p), where p is prime and 10 < p < 223.5 (POSIX threads with exact floating point doubles and parallel OpenBLAS library or Intel Math Kernel Library).
(iii)
For matrices over GF(pk), where p<223.5 and k>1 (parallel Karatsuba combined with one of the above, depending on the size of p).

This POSIX threads mode will be selected for the above types of finite field when the global number of threads is first set to a number greater than 1 before calling the functions below; as usual, one simply first calls the procedure:

    SetNthreads(NT);
where NT is the number of threads desired. Note that since matrix operations are memory intensive, the resulting speedups compared with the serial algorithms depend strongly on the memory speed of the computer, and it is unlikely that the speedup obtained will be close to NT. Also, hyperthreading (setting NT to more than the number of physical cores) is usually not beneficial because it increases the contention between threads on memory access.

GPU-based Linear Algebra

GPU-based linear algebra (with NVIDIA CUDA) is supported in these cases:

(i)
For matrices over GF(p), where p = 2, 3, 5, 7 (CUDA code developed within Magma).
(ii)
For matrices over GF(p), where p is prime and 10 < p < 223.5 (CUBLAS library or Intel Math Kernel Library).
(iii)
For matrices over GF(pk), where p<223.5 and k>1 (parallel Karatsuba combined with one of the above, depending on the size of p).

Since V2.26, multiple GPUs may also be selected for matrix multiplication over GF(p), where p = 2, 3, 5, 7. This can give a significant speedup for very large dimensions. See Section GPUs for information on how to select multiple GPUs.

Hybrid Linear Algebra

This combines multi-threaded and GPU operations at the same time (for matrices defined over the same types of finite fields as above). This mode is selected by calling SetNthreads(NT); while running the CUDA executable (and with the GPU mode enabled). This mode may be not be beneficial compared to the simple POSIX threads mode or the simple GPU mode (it depends strongly on the number of threads and the type of GPU(s)), so when running the CUDA executable by default, it may be better to keep the number of threads at 1.

Example Par_LinearAlgebraThreaded (H5E1)

The following input demonstrates simple use of POSIX threads when multiplying matrices (on an Intel 4-core i7-7700 CPU @ 3.60GHz):
> SetNthreads(1);
> X := Random(MatrixRing(GF(5), 10000));
> time P := X*X; // single thread, normal CPU time
Time: 5.040
> time E,T := EchelonForm(X);
Time: 6.940
> SetNthreads(4);
> time P2 := X*X; // multi-threads, so real time shown by [r]
Time: 1.760[r]
> time E,T := EchelonForm(X);
Time: 2.660[r]
> assert P2 eq P;

Example Par_LinearAlgebraGPU (H5E2)

The following input demonstrates simple use of one GPU when multiplying matrices (NVIDIA V100 GPU; available with CUDA executable only):
> SetGPU(false);
> X := Random(MatrixRing(GF(5), 10000));
> time P := X*X; // no GPU
Time: 5.040
> time E,T := EchelonForm(X);
Time: 6.940
> SetGPU(true);
> time P2 := X*X; // 1 GPU
Time: 1.320
> time E,T := EchelonForm(X);
Time: 2.920
> assert P2 eq P;

Example Par_LinearAlgebraDualGPU (H5E3)

The following input tests very large matrix multiplication over GF(2) to see speedups when two GPUs are available (CUDA executable only with GPU selected by default; hardware used here: dual NVIDIA V100 GPUs).

> K := GF(2);
> for i in [16 .. 19] do
>    n := 2^i;
>    printf "Dim: 2^%o = %o n", i, n;
>    A := Random(MatrixRing(K, n));
>    B := Random(MatrixRing(K, n));
>    IndentPush();
>    "1 GPU:";
>    t0 := Cputime(); SetNGPUs(1); time P1 := A*B; t0 := Cputime(t0);
>    "2 GPUs:";
>    t1 := Cputime(); SetNGPUs(2); time P2 := A*B; t1 := Cputime(t1);
>    printf "Speedup: %.3o n", t0/t1;
>    assert P1 eq P2;
>    IndentPop();
> end for;
Dim: 2^16 = 65536
    1 GPU:
    Time: 2.680
    2 GPUs:
    Time: 1.550
    Speedup: 1.729
Dim: 2^17 = 131072
    1 GPU:
    Time: 18.420
    2 GPUs:
    Time: 11.170
    Speedup: 1.649
Dim: 2^18 = 262144
    1 GPU:
    Time: 134.380
    2 GPUs:
    Time: 75.840
    Speedup: 1.772
Dim: 2^19 = 524288
    1 GPU:
    Time: 1085.930
    2 GPUs:
    Time: 571.670
    Speedup: 1.900

Example Par_representations (H5E4)

In this example we compute absolutely irreducible modular representations of a finite simple group over the finite field GF(17). The server used has dual Intel 3.2GHz Gold 6146 CPUs with 24 cores total, and an NVIDIA V100 GPU, showing speedups when multiple cores or a GPU is used.
> G := PermutationGroup(ATLASGroup("O8m2"));
> K := GF(17);
> L := AbsolutelyIrreducibleModules(G, K);
> [Dimension(M): M in L];
[ 1, 34, 51, 83, 204, 204, 357, 476, 476, 595, 714, 714, 1020,
1071, 1071, 1071, 1071, 1190, 1261, 1428, 2142, 2142, 2142, 2142,
2176, 2295, 2295, 2295, 2835, 2856, 2856, 3264, 4284, 4760, 5355 ]

With one thread only:

> SetNthreads(1); SetGPU(false);
> time L := AbsolutelyIrreducibleModules(G, K);
Time: 6503.600

With 24 threads:

> SetNthreads(24); SetGPU(false);
> time L := AbsolutelyIrreducibleModules(G, K);
Time: 2432.150[r]

With one thread and a GPU:

> SetNthreads(1); SetGPU(true);
> time L := AbsolutelyIrreducibleModules(G, K);
Time: 2584.590
V2.28, 13 July 2023