An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architectures

An efficient hybrid tridiagonal divide-and-conquer algorithm on   distributed memory architectures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured matrices.Most time of the DC algorithm is cost by computing the eigenvectors via the matrix-matrix multiplications (MMM). In our parallel hybrid DC (PHDC) algorithm, MMM is accelerated by using the HSS matrix techniques when the intermediate matrix is large. All the HSS algorithms are done via the package STRUMPACK. PHDC has been tested by using many different matrices. Compared with the DC implementation in MKL, PHDC can be faster for some matrices with few deflations when using hundreds of processes. However, the gains decrease as the number of processes increases. The comparisons of PHDC with ELPA (the Eigenvalue soLvers for Petascale Applications library) are similar. PHDC is usually slower than MKL and ELPA when using 300 or more processes on Tianhe-2 supercomputer.


💡 Research Summary

The paper introduces a Parallel Hybrid Divide‑and‑Conquer (PHDC) algorithm for solving the symmetric tridiagonal eigenvalue problem on distributed‑memory systems. The authors start from the classic Cuppen’s divide‑and‑conquer (DC) approach, which is the default method in LAPACK and ScaLAPACK when eigenvectors are required. In the DC algorithm the most expensive step is the multiplication of large dense matrices (the update of the eigenvector matrix Q). The authors observe that the intermediate matrix Q has a Cauchy‑like structure and its off‑diagonal blocks are numerically low‑rank. Consequently Q can be approximated efficiently by a Hierarchically Semi‑Separable (HSS) matrix.

To exploit this property, the authors replace the ScaLAPACK PDGEMM calls inside the routines PDLAED1 and PDLAED2 with HSS matrix‑matrix multiplication routines from the STRUMPACK library. STRUMPACK provides a parallel randomized HSS construction algorithm (RandHSS) that combines Gaussian sampling with interpolative decomposition (ID). The construction cost is O(N r) flops, where N is the matrix size and r is the HSS rank (typically 50–100 for the matrices considered). Once Q is represented in HSS form, the multiplication Q·U (U being a block‑diagonal matrix of eigenvectors of the subproblems) can be performed in O(N² r) flops, a substantial reduction compared with the O(N³) cost of a conventional dense GEMM.

The algorithm includes several practical safeguards. Deflation (the removal of negligible components of the secular vector z or duplicate eigenvalues) is handled exactly as in ScaLAPACK, and when the size K of the deflated block is small the algorithm falls back to the standard PDGEMM to avoid unnecessary HSS overhead. Moreover, the authors avoid applying Gu’s permutation strategy when the HSS structure would be destroyed; instead they only permute Q when the resulting matrix remains low‑rank.

Experimental evaluation is performed on the Tianhe‑2 supercomputer using up to 600 MPI processes. Test matrices range from 2⁶ to 2⁹ in dimension, with varying numbers of deflations. The results show:

  • For matrices with few deflations and sizes ≥ 2⁸, PHDC outperforms the Intel MKL DC implementation by 1.3–1.8× when 200–300 processes are used.
  • Compared with ELPA, PHDC is slightly slower for small process counts but ELPA becomes faster as the process count exceeds 400, reflecting ELPA’s better scalability.
  • When more than ~300 processes are employed, the cost of HSS construction and the associated communication (global sampling vectors, ID reductions) dominates, causing PHDC to lose its advantage; PDGEMM‑based DC becomes faster.
  • Memory consumption grows by roughly 1.5–2× because STRUMPACK stores random samples, generators, and auxiliary buffers, but this overhead is acceptable on modern large‑scale systems.
  • Numerical accuracy remains comparable to the original DC algorithm (relative residuals ≤ 10⁻¹³).

The authors conclude that HSS‑accelerated DC is beneficial for large, well‑conditioned tridiagonal problems with few deflations, but its scalability is limited by the communication required for HSS construction and by the extra memory footprint. They suggest future work on asynchronous sampling, dynamic rank estimation, GPU‑accelerated HSS kernels, and hybrid strategies that combine ELPA’s block‑diagonalization with PHDC’s HSS compression. Overall, the paper demonstrates a viable path to reduce the dominant dense matrix multiplication in distributed DC eigenvalue solvers, while also highlighting the practical challenges that must be addressed for exascale deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment