Multi-threading with OpenMP#

OpenMP is a shared-memory application programming interface (API) whose features are based on prior efforts to facilitate shared-memory parallel programming. OpenMP is intended to be suitable for implementation on a broad range of SMP architectures, even for uniprocessor computers also.

OpenMP can be added to a sequential program in FORTRAN, C, or C++ to describe how the work is to be shared among threads that will execute on different processors or cores and to order accesses to shared data as needed. The appropriate insertion of OpenMP features into a sequential program will allow applications to benefit from shared-memory parallel architectures with minimal modification to the code. Because of these advantages, OpenMP has been added to XLiFE++ in order to exploit its considerable parallelism in heavy computation parts. In a typical Finite Element package, assembling the stiffness matrices costs most of the total runtime then solving linear equation systems.

Sparse matrix-vector multiplication#

In a Finite Element package, after repeatedly setting up the stiffness matrices, it needs to solve linear equation system, which can be done either in two ways (1) by direct methods or (2) iterative methods. In direct methods, the factorization of matrix costs nearly most of the runtime, meanwhile, in iterative methods, matrix-vector multiplication plays an important role. Since the matrices arising from the discretization are sparse, an appropriate matrix storage format, sparse format is used. XLiFE++ provides a plenty types of sparse matrix, which can serve for different purposes. Although all the sparse matrix can be taken for parallelized sparse-matrix-vector-multiplication (SpMV); among them, CSR (compressed storage row) seems to be the best candidate.

The form of SpMV is $y = A x$ , where $A$ is a sparse matrix, $x$ and $y$ are dense vectors. $x$ is source vector and $y$ is destination vector.

In the following, the algorithms of parallelization of matrix-vector multiplication for CSR and CSC (compressed storage column) are mentioned, on which parallelized matrix-vector multiplication of other sparse format (CSDual, CSSym, SkylineDual, SkylineSym) are based.

SpMV of CSR#

Because CSR is a well-known format, only a small recall about its structure is presented.

A matrix $A$ is an $m \times n$ matrix, and the number of nonzero elements is { `` nz `` }, CSR format needs to store three arrays:

$v a l u e s [n z]$ store the value of each nonzero element in matrix $A$
$c o l I n d e x [n z]$ stores the column index of each element in $v a l [n z]$ array
$r o w P o i n t e r [m + 1]$ stores the index of the first nonzero element of each row and $r o w P o i n t e r [m] = n z$

For instance, the following $5 \times 6$ matrix

\begin{array}{r} A = [\begin{array}{cccccc} 11 & 12 & 0 & 14 & 0 & 16 \\ 0 & 22 & 23 & 0 & 0 & 26 \\ 31 & 32 & 33 & 0 & 0 & 0 \\ 0 & 0 & 43 & 44 & 0 & 0 \\ 51 & 0 & 53 & 54 & 55 & 0 \end{array}] \end{array}

is stored in XLiFE++ by the three following std::vector}

$v a l u e s = (0 11 12 14 16 22 23 26 31 32 33 43 44 51 53 54 55)$
$c o l I n d e x = (0 1 3 5 1 2 5 0 1 2 2 3 0 2 3 4)$
$r o w P o i n t e r = (0 4 7 10 12 16)$

In XLiFE++, different from “standard”, the array values of CSR format contains one more element $0$ at the first position.

Since it must store the location information explicitly as well as the value of each nonzero element, extra communication time is need to access these location data. As a matrix of CSR format is store row by row, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to assign each thread of parallel region to work with one row. However, considering the problem of load balancing, scheduling overhead and synchronization overheads, it’s better to apply the row partitioning scheme. The matrix will be partitioned into blocks of row by the number of threads.

// Assign each thread an approximately equal number of nonzero
Number nnzEachThread = std::floor(nnz/numThread);

// Index of nonzero in each range for each thread
Number nnzRangeIdx = nnzEachThread;

// Lower and upper bound position for each thread
std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread);
std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread);

itp = itpb;
itThreadLower[0] = itpb;
for (int i = 0; i < numThread; i++) {
  itThreadLower[i] = itp;
  nnzRangeIdx = *itp + nnzEachThread;
  itpLower = std::lower_bound(itp, itpe, nnzRangeIdx);
  itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx);
  itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper;
  itThreadUpper[i] = itp;
}
itThreadUpper[numThread-1] = itpe;

Each block is assigned to a thread and has an approximately equal number of nonzero elements. By partitioning this way, each thread operates on its own part of the values, colIndex and rowPointer. All threads access element of the source vector $x$ . Since accesses on $x$ are read-only, there is no invalidation traffic. One advantage of row partitioning is that each thread works on its own part of destination vector $y$ , therefore, there is no need of synchronization among threads.

#pragma omp parallel for default(none)
private(i, itr, iti, itie, itim, itpLower, itpUpper) \
shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb) \
schedule(dynamic,1)
for(i = 0; i < numThread; i++) {
  itpLower = *(itbLower+i);
  itpUpper = *(itbUpper+i);
  for (;itpLower != itpUpper;itpLower++) {
    itr = itrb + (itpLower - itpb);
    *itr *=0;
    iti  = itib + *(itpLower);
    itie = itib + *(itpLower + 1);
    itim = itm + *(itpLower);

    while(iti != itie) {
      *itr += *(itim) * *(itvb + *iti);iti++; itim++;
    }
  }
}

However, the coarse-grained approach sometimes can not assure the load balancing of SpMV of a very large matrix (e.g. with one million nonzero elements), and a granularity factor is a solution to make it more fine-grained. Each block of row of a thread is divided into smaller pieces, which not only make SpMv more fine-grained but also allow a thread to take work from each other.

const Number GRANULARITY = 16;
numThread *= GRANULARITY;

SpMV of CSC#

A matrix $A$ is an $m \times n$ matrix, and the number of nonzero elements is $n z$ , CSC format needs to store three arrays:

$v a l u e s [n z]$ store the value of each nonzero element in matrix $A$
$r o w I n d e x [n z]$ stores the row index of each element in $v a l [n z]$ array
$c o l P o i n t e r [m + 1]$ stores the index of the first nonzero element of each column and $c o l P o i n t e r [m] = n z$

For instance, the following $5 \times 6$ matrix

\begin{array}{r} A = [\begin{array}{cccccc} 11 & 12 & 0 & 14 & 0 & 16 \\ 0 & 22 & 23 & 0 & 0 & 26 \\ 31 & 32 & 33 & 0 & 0 & 0 \\ 0 & 0 & 43 & 44 & 0 & 0 \\ 51 & 0 & 53 & 54 & 55 & 0 \end{array}] \end{array}

is stored as follows in XLiFE++ by the three following std::vector:

$v a l u e s = (011315112223223334353144454551626)$
$r o w I n d e x = (0240121234034401)$
$c o l P o i n t e r = (03610131416)$

In XLiFE++, different from “standard”, the array $v a l u e s$ of CSC format contains one more element $0$ at the first position

Because the “column-like” storage of CSC is contrast to the natural manner of “row-processing” matrix-vector multiplication, the approach using OpenMP to parallelize matrix-vector multiplication of CSC is to limit the affect of this “column-like” characteristic.

Since a CSC matrix is stored in column by column, the simplest way to parallelize the matrix-vector multiplication using OpenMP is to use each thread to process a column, one by one; then the result of each thread is contributed to the destination vector by a “reduction-like” operation. Obviously, this approach is not efficient. Not only each thread processes non-adjacent column, which causes cache misses in cases of many threads; but also the problem of load-balancing can happen because of the difference in the number of nonzero elements in each column. An improved approach is to divide the matrix into groups of column, each of which has an approximately equal number of nonzero elements. Because all the columns in each group are adjacent, it limits the effect of cache miss. Moreover, load-balancing hardly happens thanks to equal number of nonzero elements processed by each thread.

The following code implements the approach described above.

// Lower and upper bound position for each thread
std::vector<std::vector<Number>::const_iterator> itThreadLower(numThread);
std::vector<std::vector<Number>::const_iterator> itThreadUpper(numThread);

std::vector<Number>::const_iterator itpLower, itpUpper;
// Find the smallest and largest rowIndexs corresponding to each thread
std::vector<Number> sRowIdx(numThread);
std::vector<Number> lRowIdx(numThread);

itp = itpb;
itThreadLower[0] = itpb;
for (int i = 0; i < numThread; i++) {
  itThreadLower[i] = itp;
  nnzRangeIdx = *itp + nnzEachThread;
  itpLower = std::lower_bound(itp, itpe, nnzRangeIdx);
  itpUpper = std::upper_bound(itpLower, itpe,nnzRangeIdx);
  itp = ((nnzRangeIdx - *(itpUpper -1))< (*itpUpper-nnzRangeIdx)) ? itpUpper-1 : itpUpper;
  itThreadUpper[i] = itp;
}
itThreadUpper[numThread-1] = itpe;

std::vector<std::vector<Number>::const_iterator>::const_iterator itbLower, itbUpper;
itbLower = itThreadLower.begin();
itbUpper = itThreadUpper.begin();

The serial codes above split a CSC matrix into blocks of column, each of which has nearly the same number of nonzero elements. After that, each of thread will process a block of column, store the result into its temporary vector and write this vector into the destination vector by a “reduction-like” operation. One remark is that temporary vector of each thread have necessarily the same size as the destination vector. It is the trade-off of this method.

After finishing its own matrix-vector multiplication, each thread writes its result into the destination vector. Since array reduction isn’t supported in OpenMP C++ (some projects have tried to make this feature available, but it seems not to be ready in the near future!!). The only way, at the moment, is to use critical section. And we can easily realize that it is coarse-grained.

typedef typename IterationVectorTrait<ResIterator>::Type ResType;
#pragma omp parallel  \
private(i, tid, itr, iti, itie, itim, itpLower, itpUpper, itv) \
shared(numThread, itbLower, itbUpper, itpb, itib, itm, itvb, itrb, nRows)
{
    tid = omp_get_thread_num();
    std::vector<ResType> resTemp(nRows);
    typename std::vector<ResType>::iterator itbResTemp = resTemp.begin(), iteResTemp = resTemp.end(), itResTemp = itbResTemp;

    #pragma omp for nowait schedule(dynamic,1)
    for(i = 0; i < numThread; i++) {
         itpLower = *(itbLower+i);
         itpUpper = *(itbUpper+i);
         for (;itpLower != itpUpper;itpLower++) {
             itv = itvb + (itpLower - itpb);
             iti  = itib + *(itpLower);
             itie = itib + *(itpLower + 1);
             itim = itm + *(itpLower);

             while(iti != itie) {
               itResTemp   = itbResTemp + (*iti);
               *itResTemp += *itim * *(itv);
               iti++; itim++;
             }
         }
    }
    #pragma omp critical (updateResult)
    {
        for (itResTemp = itbResTemp;itResTemp != iteResTemp; itResTemp++) {
          itr = (itrb + (itResTemp - itbResTemp));
          *itr += *itResTemp;
        }
    }
}

Sparse matrix factorization#

As mentioned above, one way to solve a linear equation system is to use direct methods, which heavily depends on matrix factorization. Firstly, the costly runtime factorization is done, secondly, the factorized matrix is used to solve the linear equation system $A x = b$ . Till now, XLiFE++ has supported some popular factorization methods: LU, LDLt and LDL*; all of them have a parallelized version with OpenMP. Because of factorization algorithm, only skyline storage is suitable to be factorized.

Because LDLt can be considered to be a special case of LU, the following only mentions the parallelized LU factorization.

The idea behind multi-threaded factorization is simple: Instead of implementing the factorization element by element as in the serial version, we make the algorithm work on block by block. For each iteration, block on diagonal is processed then block on the row and column corresponding to this diagonal block.

First, the diagonal block is calculated

#pragma omp single
      diagBlockSolverParallel(k*blockSize, blockSizeRow[k], it_rownx,
                              k*blockSize, blockSizeCol[k], it_colnx,
                              it_fd, it_fl, it_fu,
                              it_md, it_ml, it_mu);

After that, the block of row and column can be computed in the same time

#pragma omp for nowait
    for (jj = k+1; jj < nbBlockCol; jj++) {
        #pragma omp task untied firstprivate(k, jj) \
                    shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fl, it_fu, it_mu)
        upperBlockSolverParallel(k*blockSize, blockSizeRow[k],it_rownx,
                            jj*blockSize, blockSizeCol[jj], it_colnx,
                            it_fl, it_fu, it_mu);
    }

#pragma omp for
    for (ii = k+1; ii < nbBlockRow; ii++) {
        #pragma omp task untied firstprivate(k, ii) \
                    shared(blockSize, blockSizeRow, blockSizeCol, it_rownx, it_colnx, it_fd, it_fl, it_fu, it_ml)
        lowerBlockSolverParallel(ii*blockSize, blockSizeRow[ii], it_rownx,
                            k*blockSize, blockSizeCol[k], it_colnx,
                            it_fd, it_fl, it_fu, it_ml);
    }

In these code, we use one of the latest feature of OpenMP, directive task, which has been only supported since OpenMP 3.0. By using this directive, we can take advantage of taskification of the code and prevent the potential of load balancing problem.

One common problem on working with blocking-algorithm is the block size. It is impossible to find out an optimal value for the blockSize that is able to change the performance of program largely. After some experiments with typical sparse matrices of XLiFE++, one simple formula is provided to specify blockSize.

const Number BLOCKFACTOR = 0.05*std::min(nbOfRows(),nbOfColumns());
Number numBlockMin = BLOCKFACTOR;
Number blockSize  = std::floor(diagonalSize()/numBlockMin);
Number nbBlockRow = std::ceil(nbOfRows()/blockSize);
Number nbBlockCol = std::ceil(nbOfColumns()/blockSize);
Number numBlock = std::min(nbBlockRow, nbBlockCol);

std::vector<Number> blockSizeRow(nbBlockRow, blockSize), blockSizeCol(nbBlockCol, blockSize);
blockSizeRow[nbBlockRow-1] = nbRows_ - (nbBlockRow-1)*blockSize;
blockSizeCol[nbBlockCol-1] = nbCols_ - (nbBlockCol-1)*blockSize;

Again, this value is only “good” for a certain group of sparse matrices. It maybe needs changing in the future.