dpl - Dalsoft's Parallel Library - is the library that consists of the routines that: For the users manual of this product click here.

In this article we will

functions/algorithms implemented by the library

dpl allows parallel execution of a number of sequential algorithms. As the supported algorithms are inherently sequential, significant mathematical transformations is necessary for their parallel implementation.

Basic Stencils

dpl offers the following functions that provide parallel implementation for the various serial stencils:

One-point stencils

dpl_stencil_1p

x[i] = b(i) + a(i)*x[i-STRIDE]

dpl_stencil_o_1p

x[i] = c(i)*x[i] + b(i) + a(i)*x[i-STRIDE]

dpl_stencil_div_1p

x[i] = b(i) + a(i)/x[i-STRIDE]

dpl_stencil_o_div_1p

x[i] = c(i)*x[i] + b(i) + a(i)/x[i-STRIDE]

Two-points stencils

dpl_stencil_2p

x[i] = a(i)*x[i-1] + b(i) + c(i)*x[i-2]

dpl_stencil_o_2p

x[i] = d(i)*x[i] + b(i) + a(i)*x[i-1] + c(i)*x[i-2]

Accumulative stencils

dpl_stencil_acc

acc = b(i) + a(i)*acc

dpl_stencil_div_acc

acc = b(i) + a(i)/acc

General stencil

dpl_stencil

x[i] = b(i) + ∑j<ia(i,j)*x[i]

Parallel implementation of conditional functions

Plain conditional functions

dpl_stencil_min

x[i] = min( a(i)*x[i-1], b(i) )

dpl_stencil_max

x[i] = max( a(i)*x[i-1], b(i) )

Stochastic ordering

dpl's function dpl_max_st

int dpl_max_st( N, a, b, x )
unsigned long N,
const double *a,
const double *b,
double *x
)

performs parallel execution of the stochastic ordering of the input vectors a and b of the size N calculating the output vector x of the size N as the solution of the following system of equations:

for every i = 0...N-1
∑( j=i...N-1:x[j] ) = max( ∑( j=i...N-1:a[j] ), ∑( j=i...N-1:b[j] ) )

Parallel implementation of the Gauss–Seidel method to solve a linear system of equations

dpl's function dpl_gs performs parallel execution of the Gauss-Seidel method to solve a linear system of equations ( see this for the description of the implemented Gauss-Seidel method ). The provided implementation consistently outperforms by a huge margin the serial implementation of the Gauss-Seidel method.

Parallel implementation of the general 2-D 5-points stencil

dpl's function dpl_p1p0_p0p1_p0p0_p0n1_n1p0

int dpl_p1p0_p0p1_p0p0_p0n1_n1p0(
unsigned long rows, unsigned long cols,
const double *ap1p0,
const double *ap0p1,
const double *ap0p0,
const double *ap0n1,
const double *an1p0,
const double *b,
double *x
)

performs parallel execution of the general 2-D 5-points stencil as following:
for ( r = 1 ; r < rows - 1; r++ )
 {
  for ( c = 1 ; c < cols - 1 ; c++ )
   {
    x[r][c] = x[r-1][c]*ap1p0[r][c] +
              x[r][c-1]*ap0p1[r][c] +
              x[r][c]*ap0p0[r][c] +
              x[r][c+1]*ap0n1[r][c] +
              x[r+1][c]*an1p0[r][c] +
              b[r][c];
   }
 }

Implementation Note

dpl achieves parallel implementation of supported functions by performing mathematical transformations on the underlying algorithm producing a parallel algorithm; of course, when programming this algorithm ( in C ) the care is taken of many aspects of computation such as cache utilization, load balancing, process synchronization etc.

Analysis of the performance and accuracy of the implemented functions

We performed comprehensive study of the parallel implementation of the functions from dpl analyzing the accuracy of the generated results and evaluating their performance; we compared this data to data collected while executing the corresponding serial code and/or library functions.
The following lists the results for
While comparing timing for differnt kinds of code the following algorithm was used:

1 - ( timeOfTheParallelCode / timeOfTheReferenceCode )

For example if the reference code run for 8.97 seconds and the parallel code run for 3.31 seconds the improvement of 63% was reported. Note that improvement calculated in such a way never exceed 100%. Also note that improvement of 50% means that parallel code is twice as fast as the reference code, the improvement of 75% means that parallel code is four times faster that the reference code.

While studying basic stencils we just executed code on as many multi-core CPU's as we could get access to.

Studying Gauss–Seidel method to solve a linear system of equations and general 2-D 5-points stencil as well as evaluating the use of the dpl library in benchmarks was much more comprehensive as we attempted to achieve many goals: The code code was executed under Linux operating system on the Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz with 16 threads processor with the different number of threads allowed - that was established using the standard OpenMP/bash command sequence:

OMP_NUM_THREADS=number_of_threads;export OMP_NUM_THREADS;

Such an aproach ensures that execution environments differ only by the number of threads utilized; the latency of instructions, memory access costs, sizes of caches are the same.
The threads numbers we studied are 6, 8, and 10 which corresponds to the threads numbers available on the commonly sold workstations CPU's.

The codes were executed on Microway's compute cluster; we are grateful for the HPC time provided by Microway, Inc..

Note that stencil computations are memory bound, thus stencil code efficiency is often impaired by the high memory access costs. As of now, memory access remains the weak point of the multi-core CPUs, therefore the expectations of the stencils parallel code improvements shall be reasonable.
To achieve performance gains the code that uses dpl shall be executed on a robust multi-core CPU. Running such a code on a cloud-based alternatives to physical hardware or WSL didnt show much of an improvement.

Analysis of the accuracy and performance of the implemented basic stencils

Accuracy

Because dpl implements desirable algorithms differently from the way it is done by a sequential implementation of the corresponding formula, it is necessary to verify:
How different are the results generated by parallel code from the results generated by the corresponding sequential code

Also, due to the inexact nature of the floating point calculation, the results generated by the sequential code may not be fully trusted. It is therefore necessary to verify
How different are the results generated by parallel code from the precise ( calculated with unlimited precision ) results that implement sequential algorithm

The implemented verification process is almost identical to the one described in the chapter testing dco parallelization of a stencil found here.
The results of this verification show that, in most cases,
the difference between parallel and sequential results is acceptably small.

However there are exception to this - functions that use division:
dpl_stencil_div_1p, dpl_stencil_o_div_1p and dpl_stencil_div_acc.

Let study such an exception - case with the difference between the result generated by the dpl provided parallel code and the result generated by the corresponding sequential code appears to be "big".

In most cases the results to be verified are vectors of double precision values.
Define norms that will help with evaluation. For vectors x and x1 of the size ( dimension ) n we deifine:

Maximal Relative Difference ( MaxRelDiff )
max( fabs( 2.*( x[i] - x1[i] ) )/( fabs( x[i] ) + fabs( x1[i] ) ), i=0,...,n-1

Vector Relative Difference ( VectorRelDiff )
|x - x1|/|x1|
where
|vector| =  ∑( vector[i] )2 

Vector Normal Difference ( VectorNormDiff )
max( | x[i] - x1[i] | ), for i = 0,...,n-1

Comprehensive evaluation of the function dpl_stencil_o_div_1p using drt based evaluation procedure ( see this for more information ) reported the following data for the worst case detected:

stencil_div_o_1p cmp_parallel_serial_loop
MaxRelDiff:        3.265649e-04
VectorRelDiff:     2.630082e-04
VectorNormDiff:    1.710950e-01
Happen in the test case 3804350 at the index 875 of 1000 with the stride 1
dplRslt=-523.83780351572954714
exRslt= -524.00889851128829378
Total 7119197, Attempted 7119197, Done 7119197 cases for 1000 seconds error count = 10152.
Random number generator seed used 1596510236.

The data seems to be disappointing, the result derived by using dpl parallel library ( dplRslt ) agrees with the result generated by the sequential code ( exRslt ) in less than 4 decimal places. Also, all the norms show low level of convergence.
Let further investigate this case.
First, we create serial implementation of the function dpl_stencil_o_div_1p using the routines from the Dalsoft's High Precision ( dhp ) library ( see this for details ) - the result produced by this implementation we will call preciseRslt; we assume that the preciseRslt, generated utilizing floating point numbers with 1024-bits mantissa, is fully accurate when expressed as 64-bit double precision value.
Second, taking advantage of the functionality supported by drt, we recreated the case to be studied ( establishing the used seed and skipping 3804350 cases to the case where the problem was observed ) and perform the following testing on it: comparing the result derived by using dpl parallel library - dplRslt - with the preciseRslt, generated the following data:
stencil_div_o_1p cmp_parallel_serialhp
Dim.errs.chsums 1000 0 0 -6777.462714921427505
                         -6774.995656373027487
MaxRelDiff     3.925580826263917e-04
VectorRelDiff  3.161913979263145e-04
VectorNormDiff 2.055964094892033e-01

comparing the result generated by the sequential code - exRslt - with the preciseRslt, generated the following data:
stencil_div_o_1p cmp_serial_serialhp
Dim.errs.chsums 1000 0 0 -6779.515773068097587
                         -6774.995656373027487
MaxRelDiff     7.191229955145954e-04
VectorRelDiff  5.793222860496758e-04
VectorNormDiff 3.766914050479500e-01

The following table summarizes the above results.
Each row represents the norm being used.
The first column contains values for these norms when comparing the result derived by using dpl parallel library - dplRslt - with the preciseRslt.
The second column contains values for these norms when comparing results generated by the sequential code - exRslt - with the preciseRslt.

dplRslt vs preciseRslt exRslt vs preciseRslt
MaxRelDiff 3.925580826263917e-04 7.191229955145954e-04
VectorRelDiff 3.161913979263145e-04 5.793222860496758e-04
VectorNormDiff 2.055964094892033e-01 3.766914050479500e-01

While analyzing the data from this table it becomes clear that presiceRslt is closer to dplRslt than to exRslt - all the norms for dplRslt vs preciseRslt are smaller than the corresponding norms exRslt vs preciseRslt.
The problem we observed ( low precision ) seems to be due to the fact that the function dpl_stencil_o_div_1p, that performs division, is inherently unstable and thus may not be used on an arbitrary ( random-generated ) input.

Analysis of the performance of the implemented basic stencils

Reasons for the problematic performance

The goal of this project is, by using parallelization, to achieve performance gain over performance of the serial code. That is not always achievable. In this section we provide the reasons for that and offer guidelines to obtain the desirable performance gain.

Invoking and executing parallel code causes significant performance penalties: In order to achieve performance gains ( in spite of the above described performance penalties ) the supported functions are executed in parallel mode only if certain conditions are met. These conditions depend on the size of the problem ( number of elements to be processed ) and characteristics of CPU that will execute the parallel code ( e.g. number of threads available ). The restrictions listed bellow are not exact and in certain cases may not suffice to achieve performance gains.

The following table establishes condition for the dpl implemented parallel code to be faster than the corresponding serial code: for every implemented function the minimal size of the problem ( number of elements to be processed ) is listed for the number of threads available. Note that the affect of the STRIDE parameter on the execution time of the parallel code may be detrimental, the values listed in table are for STRIDE being equal to 1.
Each row represents the basic stencil provided by dpl. Each column lists the number of threads available. The value in the table is the minimal size of the problem ( number of elements to be processed ) for the parallel code to be faster than the corresponding sequential code - dpl checks for these conditions and, if they are not fulfilled, the parallel executions is aborted and an error is returned.

2
3
4
5
> 5
dpl_stencil_1p
x[i] = b(i) + a(i)*x[i-1]
850 1000 900 900 1125
dpl_stencil_o_1p
x[i] = c(i)*x[i] + b(i) + a(i)*x[i-1]
1250 1250 900 900 950
dpl_stencil_div_1p
x[i] = b(i) + a(i)*x[i-1]
750 600 500 600 600
dpl_stencil_o_div_1p
x[i] = c(i)*x[i] + b(i) + a(i)/x[i-1]
650 600 500 500 650
dpl_stencil_acc
acc = b(i) + a(i)*acc
1250 1125 1250 1250 1450
dpl_stencil_div_acc
acc = b(i) + a(i)/acc
300 400 450 450 550
dpl_stencil_2p
x[i] = a(i)*x[i-1] + b(i) + c(i)*x[i-2]
(1) 4000 2500 1750 2250
dpl_stencil_o_2p
x[i] = d(i)*x[i] + b(i) + a(i)*x[i-1] + c(i)*x[i-2]
(1) 2000 1000 1250 1000
dpl_stencil
x[i] = b(i) + ∑j<ia(i,j)*x[i]
150 175 200 200 250

(1)  for the specified number of cores the dpl generated parallel code will be always slower than the corresponding serial code

Timing

The following table presents results of comparison of the execution time achieved by the parallel version over the execution time achieved by the sequential version of the same algorithm.
Each row represents the the basic stencil provided by dpl. Each column lists the CPU used in comparison. The value in the table is the improvement of the execution time of the parallel code over the execution time of the sequential code achieved on the specified CPU - see this for the explanation on how these values were calculated.

All codes were executed under Linux operating system on four different CPU's:
  1. vintage: Intel(R) Core(TM) Duo E8600 @ 3.33GHz CPU with 2 threads
  2. mainstream: Intel(R) Core(TM) i5-9400 @ 2.90GHz CPU with 6 threads
  3. advanced: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz with 8 threads
  4. extreme: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz with 16 threads
It was attempted to run the code under the same conditions on the system with the minimal possible load. Every code segment was executed 3 times with the execution time used for calculation of the improvement being neither the best nor the worst.

The size of the problem ( number of elements to be processed ) was set to 10000 for all the functions except dpl_stencil - for this function the size of the problem ( number of elements to be processed ) was set to 1000.
The same arguments were used when executing sequential and parallel codes. The arguments were a double precision random numbers in the range [-1.5;1.5] generated using Dalsoft's Random Testing ( drt ) package - see this this for more information.

Core(TM) Duo E8600 Core(TM) i5-9400 Xeon(R) Silver 4110(1) Xeon(R) Gold 6226R(1)
dpl_stencil_1p
x[i] = b(i) + a(i)*x[i-1]
54% 51% 71% 73%
dpl_stencil_o_1p
x[i] = c(i)*x[i] + b(i) + a(i)*x[i-1]
30% 47% 58% 74%
dpl_stencil_div_1p
x[i] = b(i) + a(i)*x[i-1]
23% 58% 76% 82%
dpl_stencil_o_div_1p
x[i] = c(i)*x[i] + b(i) + a(i)/x[i-1]
24% 52% 69% 83%
dpl_stencil_acc
acc = b(i) + a(i)*acc
26% 53% 72% 71%
dpl_stencil_div_acc
acc = b(i) + a(i)/acc
64% 68% 84% 86%
dpl_stencil_2p
x[i] = a(i)*x[i-1] + b(i) + c(i)*x[i-2]
(2) 41% 66% 72%
dpl_stencil_o_2p
x[i] = d(i)*x[i] + b(i) + a(i)*x[i-1] + c(i)*x[i-2]
(2) 41% 66% 72%
dpl_stencil
x[i] = b(i) + ∑j<ia(i,j)*x[i]
43% 73% 63% 60%

(1)  the benchmarks were executed on Microway's compute cluster; we are grateful for the HPC time provided by Microway, Inc.
(2) dpl generated parallel code will be always slower than the corresponding serial code

Analysis of the parallel implementation of conditional functions

Analysis of the parallel implementation of the plain conditional functions

Accuracy

The results of this verification show that, on the same inputs, the differences between dpl_stencil_min and dpl_stencil_max generated results and results generated by the serial implementation of the corresponding function are acceptably small.

Timing

The algorithm used by dpl to implement plain conditional functions is very complicated and execution time that takes to execute parallel functions dpl_stencil_min and dpl_stencil_max depends on the values of the a vector and, to a lesser extend, on the values of the b vector.
The following table presents improvement of the parallel function dpl_stencil_max over the serial version of the same function. Each row represents the number of elements being used; the vectors a and b consist of randomly generated double precision values; generation of random values was done using Dalsoft's Random Testing ( drt ) package - see this this for more information.

10000 54%
100000 84%
1000000 88%

However there are distributions of a and b processing which produces less spectacular improvements, although no case when serial implementation is faster that the corresponding parallel version, was observed.

Analysis of the parallel implementation of the stochastic ordering

Accuracy

The results of this verification show that, on the same inputs, the differences between dpl_stencil_max_st generated results and results generated by the serial implementation of this algorithm are acceptably small.

Timing

The following table presents improvement of the parallel function dpl_stencil_max_st over the serial version of the same function. Each row represents the number of elements being used; the vectors a and b consist of double precision values generated using Dalsoft's Random Testing ( drt ) package - see this this for more information.

10000 56%
100000 63%
1000000 54%

Analysis of the the parallel implementation of the Gauss–Seidel method to solve a linear system of equations

The following are the results of comparison of the dpl_gs function with the For every case in this study the arguments used were a double precision random numbers generated using Dalsoft's Random Testing ( drt ) package ( see this this for more information ) and a Strictly Diagonally Dominant Matrix was created using the described random numbers generator.

Comparison of the performance and accuracy for the parallel implementation of the Gauss–Seidel to the serial implementation of the Gauss-Seidel method to solve a linear system of equations

Accuracy

The results of this verification show that, on the same inputs, the differences between dpl_gs generated results and results generated by the serial implementation of the Gauss-Seidel method to solve a linear system of equations are acceptably small.

Timing

The following graph presents results of the performance comparison between parallel and serial implementations of the Gauss-Seidel method to solve a linear system of equations:

Comparison of the performance and accuracy for the parallel implementation of the Gauss–Seidel to solve a linear system of equations to the performance and accuracy of the library routine that solves a linear system of equations

Here we compare the dpl_gs function from the dpl library to the function LAPACKE_dgesv from LAPACK under OpenBLAS using LAPACKE C Interface ( to LAPACK ).
Both these functions solve linear equations Ax=b, where
A is the input matrix
b is the input right-hand vector
x is solution of the linear equations Ax=b

Every call to dpl_gs solves one linear equation, specified by the input arguments, utilizing threads available.
LAPACKE_dgesv allows for a given matrix to solve many linear equations by using different right-hand vectors - the input matrix is transformed into optimal representation and every linear equation is solved "quickly" using this optimal matrix(es).
For the comparison to be fair ( especially when monitoring execution time ) we utilize the following procedure:
  1. peek up a number nrh of right-hand vectors to process
  2. generate nrh vectors
  3. run nrh iterations performing dpl_gs for every right hand vector generated
  4. perform LAPACKE_dgesv for all right hand vector generated

Accuracy

The results of this verification show that, on the same input, the differences between dpl_gs generated results and LAPACKE_dgesv generated results are acceptably small.

Timing

The execution time of processing nrh cases by dpl_gs, that is reported by our test suite, is equal to: where We assume that the execution time of processing nrh cases by LAPACKE_dgesv, that is reported by our test suite, is equal to: where Under such assumptions, provided that Tgs > Tdgesv, for any size of the matrix there is nrh such that solving nrh equations by LAPACKE_dgesv will be faster than solving the same nrh equations by dpl_gs.
Here we will calculate the even-point nrhEven of cases when, for a given size, the execution times of processing nrhEven cases by dpl_gs and LAPACKE_dgesv are the same.
For every size of the problem we studied the following procedure was utilized:
  1. peek up reasonably large number of equations nrh to be solved - 100 in this study
  2. run the benchmark without executing dpl_gs nor LAPACKE_dgesv - the execution time reported will be equal to Tsetup_nrh
  3. run the benchmark executing dpl_gs - the execution time reported will be equal to Ttotal_nrh_gs = Tsetup_nrh + nrh*Tgs
  4. run the benchmark executing LAPACKE_dgesv - the execution time reported will be equal to Ttotal_nrh_dgesv = Tsetup_nrh + Tdgesv_preprocess + nrh*Tdgesv
  5. double the number of equations and run the benchmark without executing dpl_gs nor LAPACKE_dgesv - the execution time reported will be equal to Tsetup_2*nrh
  6. double the number of equations and run the benchmark LAPACKE_dgesv - the execution time reported will be equal to Ttotal_2*nrh_dgesv = Tsetup_2*nrh + Tdgesv_preprocess + 2*nrh*Tdgesv
To summarize, after applying the above procedure, we calculated: and we looking for nrhEven such that which can be obtained as following and, finaly,
The following graph shows the even points for the number of right-hand vectors being processed ( nrh ) when, for a given size, the execution times of processing nrh cases by dpl_gs and LAPACKE_dgesv are the same. Therefore if it is necessary, for a given matrix size, to solve more than that number of equations, LAPACKE_dgesv will be faster, otherwise dpl_gs will be faster.

Analysis of the the parallel implementation of the general 2-D 5-points stencil

The following are the results of the comparasion of the dpl_p1p0_p0p1_p0p0_p0n1_n1p0 function with the serial implementation of this function.

Accuracy

The results of this verification show that, on the same inputs, the differences between dpl_p1p0_p0p1_p0p0_p0n1_n1p0 generated results and results generated by the serial implementation of this algorithm are acceptably small.

Timing

The following graph presents results of the performance comparison between parallel and serial implementations of the general 2-D 5-points stencil:

As can be noticed from that graph, the performance gain begin to deteriorate as the size of the problems increases - somewhere approaching 5000. We run the comparison for even larger number of elements and the problem only got worse.
The following table summarizes these results.
Each row represents the number of elements being used.
Each column represents the number of threads being utilized.

6 8 10 16
20000 22.38% 24.07% 24.72% 25.49%
30000 6.81% 7.52% 7.63% 8.17%
40000 7.04% 7.75% 7.89% 8.29%

The code is reading from 7 input arrays and at some point, it seems to us, the CPU is not capable to keep up. To verify this presumption we created a simplified version of the 2-D 5-points stencil ( actually, very useful one )
for ( r = 1 ; r < rows - 1; r++ )
 {
  for ( c = 1 ; c < cols - 1 ; c++ )
   {
    x[r][c] = x[r-1][c]*0.25 +
              x[r][c-1]*0.25 +
              x[r][c] +
              x[r][c+1]*0.25 +
              x[r+1][c]*0.25 +
              b[r][c];
   }
 }
and implemented it using dpl_p1p0_p0p1_p0p0_p0n1_n1p0 routine from the dpl library. The newly created routine, as implemented, is reading from 4 input arrays ( compared to 7 input array of the original code ).
The following graph presents results of the performance comparison between parallel and serial implementations of the simplified 2-D 5-points stencil:

From that graph it is clear that newly created routine holds it ground better that the original one.
Even when we run the task for larger number of elements the problem was not so profound - the following table summarizes these results.
Each row represents the number of elements being used.
Each column represents the number of threads being available.

6 8 10 16
20000 43.08% 49.97% 52.41% 54.7%
30000 36.64% 43.29% 45.71% 47.85%
40000 30.15% 37.15% 39.21% 41.18%

using dpl library in benchmarks

Here are the results of applying routines from the dpl in order to optimize benchmarks and/or applications.

Polybench

Here we study a benchmark from the PolyBench benchmark suite - see this for information about this benchmark suite: disclaimer, copyright and how to download.

adi

We used the following code segment from the adi benchmark of the PolyBench benchmark suite:
for ( i1 = 0; i1 < _PB_N; i1++ )
  for ( i2 = 1; i2 < _PB_N; i2++ )
   {
    X[i1][i2] = X[i1][i2] - X[i1][i2-1] * A[i1][i2] / B[i1][i2-1];
    B[i1][i2] = B[i1][i2] - A[i1][i2] * A[i1][i2] / B[i1][i2-1];
   }
The code was slightly modified to allow use of the routines from the dpl library.
The following graph presents results of the performance comparison between modified and the original codes.

the availability of the product and port to different systems

The current version of dpl is available on x86 Linux systems and Windows.
However code, being written in C, can be ported to any sane multi-core architecture running under a system that provides a decent C compiler with OpenMP support. Please contact us if you are interested to port dpl to your system.