dpl - Dalsoft Parallel Library is the library that consists of routines that: For the users manual of this product click here.

In this web page we will

functions/algorithms implemented by the library

dpl allows parallel execution for a number of sequential functions/algorithms. As the implemented functions/algorithms are inherently sequential, their parallel implementation required significant mathematical transformations for the algorithms being implemented.

Stencils

dpl offers the following functions that provide parallel implementation for the various serial stencils:

One-point stencils

dpl_stencil_1p

x[i] = b(i) + a(i)*x[i-STRIDE]

dpl_stencil_o_1p

x[i] = c(i)*x[i] + b(i) + a(i)*x[i-STRIDE]

dpl_stencil_div_1p

x[i] = b(i) + a(i)/x[i-STRIDE]

dpl_stencil_o_div_1p

x[i] = c(i)*x[i] + b(i) + a(i)/x[i-STRIDE]

Two-points stencil

dpl_stencil_2p

x[i] = a(i)*x[i-1] + b(i) + c(i)*x[i-2]

Accumulative stencils

dpl_stencil_acc

acc = b(i) + a(i)*acc

dpl_stencil_div_acc

acc = b(i) + a(i)/acc

General stencil

dpl_stencil

x[i] = b(i) + ∑j<ia(i,j)*x[i]

Parallel implementation of the Gauss–Seidel method to solve a linear system of equations

dpl's function dpl_gs performs parallel execution of the Gauss-Seidel method to solve a linear system of equations ( see this for the description of the implemented Gauss-Seidel method ). The provided implementation consistently outperforms by a huge margin the serial implementation of the Gauss-Seidel method.

Implementation Note

dpl provides parallel implementation of supported functions by performing mathematical transformations on the underlying algorithm producing a parallel algorithm; of course, when programming this algorithm ( in C ) the care is taken of many aspects of computation such as cache utilization, load balancing, process synchronization etc.

Analysis of the performance of the implemented functions/algorithms

The goal of this project is, by using parallelization, to achieve performance gain over performance of a serial code. That is not always achievable. In this section we provide the reasons for that and offer guideline to obtain performance gain.

Reasons for problematic performance

Invoking and executing parallel code causes significant performance penalty. In order to avoid the above described performance penalties, the supported functions are executed in parallel mode only if certain conditions are met. These conditions depend on the size of the problem ( number of elements to be processed ) and characteristics of CPU that will execute the parallel code ( e.g. number of threads available ). The restrictions listed bellow are not exact and in certain cases may not suffice to achieve performance gains.

The following table establishes condition for parallel code to be faster than the corresponding serial code: for every implemented function the minimal size of the problem ( number of elements to be processed ) is listed for the number of threads available. Note that the affect of the STRIDE parameter on the execution time of the parallel code may be detrimental, the values listed in table are for STRIDE being equal to 1.
Each row represents the function provided by dpl. Each column lists the number of threads available. The value in the table is the minimal size of the problem ( number of elements to be processed ) for the parallel code to be faster than the corresponding sequential code - dpl checks for these conditions and if they are not fulfilled the parallel executions is aborted and an error is returned.

2
3
4
5
> 5
dpl_stencil_1p
x[i] = b(i) + a(i)*x[i-1]
9000 5000 2000 2000 2000
dpl_stencil_o_1p
x[i] = c(i)*x[i] + b(i) + a(i)*x[i-1]
9000 3000 2000 2000 2000
dpl_stencil_div_1p
x[i] = b(i) + a(i)*x[i-1]
2000 2000 1000 1000 1000
dpl_stencil_o_div_1p
x[i] = c(i)*x[i] + b(i) + a(i)/x[i-1]
1500 1000 1000 1000 1000
dpl_stencil_acc
acc = b(i) + a(i)*acc
5000 3000 2000 2000 2000
dpl_stencil_div_acc
acc = b(i) + a(i)/acc
1000 1000 1000 1000 1000
dpl_stencil_2p
x[i] = a(i)*x[i-1] + b(i) + c(i)*x[i-2]
(1) 6000 4000 3000 3000
dpl_stencil
x[i] = b(i) + ∑j<ia(i,j)*x[i]
200 200 200 200 200
dpl_gs
175 150 150 150 150

(1)  for the specified number of cores the dpl generated parallel code will be always slower than the corresponding serial code

Results of parallelization

The following table presents results of comparison of the execution time achieved by the parallel version over the execution time achieved by the sequential version of the same algorithm.
Each row represents the function provided by dpl. Each column lists the CPU used in comparison. The value in the table is the improvement of the execution time of the parallel code over the execution time of the sequential code achieved on the specified CPU. The value reported was calculates as following:

1 - ( timeOfTheParallelCode / timeOfTheSequentialCode )

For example if the sequential code run for 8.97 seconds and the parallel code run for 3.31 seconds the improvement of 63% was reported. Note that improvement calculated in such a way never exceed 100%. Also note that improvement of 50% means that parallel code is twice as fast as the sequential.
All benchmarks were executed under Linux operating system on four different CPU's:
  1. vintage: Intel(R) Core(TM) Duo E8600 @ 3.33GHz CPU with 2 cores/2 threads
  2. mainstream: Intel(R) Core(TM) i5-9400 @ 2.90GHz CPU with 6 cores/6 threads
  3. advanced: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz with 8 cores/16 threads
  4. extreme: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz with 16 cores/32 threads
It was attempted to run the benchmarks under the same conditions on the system with the minimal possible load. Every benchmark was executed 3 times with the time used being neither the best nor the worst.

The size of the problem ( number of elements to be processed ) was set to 10000 for all the functions except dpl_stencil and dpl_gs - for these functions the size of the problem ( number of elements to be processed ) was set to 1000.
The same argument were used when executing sequential and parallel codes. The arguments were a double precision random numbers in the range [-1.5;1.5] generated using Dalsoft Random Testing ( drt ) package - see this this for more information. For dpl_gs, a Strictly Diagonally Dominant Matrix was created using the described random numbers generator.

Note that stencil computations are memory bound, thus stencil code efficiency is often impaired by the high memory access costs. As of now, memory access remains the weak point of the multi-core CPUs, therefore the expectations of the stencils parallel code improvements shall be reasonable.

Core(TM) Duo E8600 Core(TM) i5-9400 Xeon(R) Silver 4110 Xeon(R) Gold 6226R
dpl_stencil_1p
x[i] = b(i) + a(i)*x[i-1]
7% 51% 71% 73%
dpl_stencil_o_1p
x[i] = c(i)*x[i] + b(i) + a(i)*x[i-1]
3% 47% 58% 74%
dpl_stencil_div_1p
x[i] = b(i) + a(i)*x[i-1]
21% 58% 76% 82%
dpl_stencil_o_div_1p
x[i] = c(i)*x[i] + b(i) + a(i)/x[i-1]
16% 52% 69% 83%
dpl_stencil_acc
acc = b(i) + a(i)*acc
27% 53% 72% 71%
dpl_stencil_div_acc
acc = b(i) + a(i)/acc
65% 68% 84% 86%
dpl_stencil_2p
x[i] = a(i)*x[i-1] + b(i) + c(i)*x[i-2]
(1) 41% 66% 72%
dpl_stencil
x[i] = b(i) + ∑j<ia(i,j)*x[i]
45% 73% 63% 60%
dpl_gs
2% 63% 78% 64%

(1) dpl generated parallel code will be always slower than the corresponding serial code

Analysis of the accuracy of results generated by the implemented functions/algorithms

As dpl implements desirable algorithms differently from the way it is done by a sequential implementation of the corresponding formula, it is necessary to verify:
How different are the results generated by parallel code from the results generated by the corresponding sequential code

Also, due to inexact nature of the floating point calculation, the result generated by the sequential code may not be fully trusted. It is therefore necessary to verify
How different are the results generated by parallel code from the precise ( calculated with unlimited precision ) results that implement sequential algorithm

The implemented verification process is almost identical to the one described in chapter testing dco parallelization of a stencil found here.
The results of this verification show that, in most cases,
the difference between parallel and sequential results is acceptably small.

However there are exception to this - functions that use division:
dpl_stencil_div_1p, dpl_stencil_o_div_1p and dpl_stencil_div_acc.

Let study such an exception - case with the difference between the result generated by the dpl derived parallel code and the result generated by the corresponding sequential code is big.

In most cases the results to be verified are vectors of double precision values.
Define norms that will help with evaluation. For vectors x and x1 of the size ( dimension ) n we deifine:

Maximal Relative Difference ( MaxRelDiff )
max( fabs( 2.*( x[i] - x1[i] ) )/( fabs( x[i] ) + fabs( x1[i] ) ), i=0,...,n-1

Vector Relative Difference ( VectorRelDiff )
|x - x1|/|x1|
where
|vector| =  ∑( vector[i] )2 

Vector Normal Difference ( VectorNormDiff )
max( | x[i] - x1[i] | ), for i = 0,...,n-1


Comprehensive evaluation of the function dpl_stencil_o_div_1p using drt based evaluation procedure ( see the reference mentioned above ) reported the following data for the worst case detected:

stencil_div_o_1p cmp_parallel_serial_loop
MaxRelDiff:        3.265649e-04
VectorRelDiff:     2.630082e-04
VectorNormDiff:    1.710950e-01
Happen in the test case 3804350 at the index 875 of 1000 with the stride 1
myRslt= -523.83780351572954714
exRslt= -524.00889851128829378
Total 7119197, Attempted 7119197, Done 7119197 cases for 1000 seconds error count = 10152.
Random number generator seed used 1596510236.

The data seems to be disappointing, the result derived by using dpl parallel library ( myRslt ) agrees with the result generated by the sequential code ( exRslt ) in less than 4 decimal places. Also, all the norms show low level of convergence.
Let further investigate this case.
First, we create serial implementation of the function dpl_stencil_o_div_1p using the routines from the Dalsoft High Precision ( dhp ) library ( see this for details ) - the result produced by this implementation we will call preciseRslt; we assume that the preciseRslt, generated with 1024-bits mantissa, is fully accurate when expressed as 64-bit double precision value.
Second, utilizing drt-supported functionality we recreated the case to be studied ( establishing the used seed and skipping 3804350 cases to the case where the problem was observed ) and perform the following testing on it: comparing the result derived by using dpl parallel library - myRslt - with the preciseRslt, generated the following data:
stencil_div_o_1p cmp_parallel_serialhp
Dim.errs.chsums 1000 0 0 -6777.462714921427505
                         -6774.995656373027487
MaxRelDiff     3.925580826263917e-04
VectorRelDiff  3.161913979263145e-04
VectorNormDiff 2.055964094892033e-01

comparing the result generated by the sequential code - exRslt - with the preciseRslt, generated the following data:
stencil_div_o_1p cmp_serial_serialhp
Dim.errs.chsums 1000 0 0 -6779.515773068097587
                         -6774.995656373027487
MaxRelDiff     7.191229955145954e-04
VectorRelDiff  5.793222860496758e-04
VectorNormDiff 3.766914050479500e-01

The following table summarizes the above results.
Each row represents the norm being used.
The first column contains values for these norms when comparing the result derived by using dpl parallel library - myRslt - with the preciseRslt.
The second column contains values for these norms when comparing results generated by the sequential code - exRslt - with the preciseRslt.

myRslt vs preciseRslt exRslt vs preciseRslt
MaxRelDiff 3.925580826263917e-04 7.191229955145954e-04
VectorRelDiff 3.161913979263145e-04 5.793222860496758e-04
VectorNormDiff 2.055964094892033e-01 3.766914050479500e-01

Analyzing the data from this table becomes clear that presiceRslt is closer to myRslt than to exRslt - all the norms for myRslt vs preciseRslt are smaller than corresponding norms exRslt vs preciseRslt.
The problem we observed ( low precision ) seems to be due to the fact that the function dpl_stencil_o_div_1p, that performs division, is inherently unstable and thus may not be used on an arbitrary ( random-generated ) input.

the availability of the product and port to different systems

The current version of dpl is available on x86 Linux systems and Windows.
However code, being written in C, can be ported to any sane architecture running under a system that provides a decent C compiler with OpenMP support.