The benchmark code used is a plasma particle-in-cell code based on the General Concurrent PIC algorithm [2]. The Fortran 77 codes have been well-benchmarked [1]. The Fortran 90 and C++ [3,4,5] versions were designed from the original Fortran 77 codes.
Machine | Language | Compiler | Particles | Time (sec) |
One-Dimensional Program | ||||
RS/6000 | Fortran 77 | IBM xlf | 450,000 | 245.49 |
RS/6000 | Fortran 90 | IBM xlf90 | 450,000 | 364.25 |
RS/6000 | C++ | IBM xlC | 450,000 | 508.00 |
Two-Dimensional Program | ||||
RS/6000 | Fortran 90 | IBM xlf90 | 327,680 | 526.71 |
RS/6000 | Fortran 77 | IBM xlf | 327,680 | 549.23 |
RS/6000 | C++ | IBM xlC | 327,680 | 667.00 |
Functions calling private data without in-lining contributed to the Fortran 90 program overhead in the one-dimensional program. A different object model, which included better abstractions, allows the Fortran 90 program to perform better than the Fortran 77 and C++ versions in the two-dimensional case as seen in the graph below.
Machine | PEs | Language | Compiler | Particles | Time (sec) |
Two-Dimensional Program | |||||
SP2 | 32 | Fortran 77 | IBM xlf | 3,571,712 | 159.08 |
SP2 | 32 | Fortran 90 | IBM xlf90 | 3,571,712 | 202.88 |
SP2 | 32 | C++ | IBM xlC | 3,571,712 | 359.00 |
Two-Dimensional Program | |||||
SP2 | 4 | Fortran 77 | IBM xlf | 327,680 | 114.31 |
SP2 | 4 | Fortran 90 | IBM xlf90 | 327,680 | 117.49 |
SP2 | 4 | C++ | IBM xlC | 327,680 | 249.00 |
Much more extensive performance comparisons are available in the publications, including comparisons among various machines and compilers from additional vendors. A plot of the 32 processor experiment is shown below.
Machine | PEs | Language | Compiler | Particles | Time (sec) |
Three-Dimensional Program | |||||
SP2 | 32 | Fortran 77 | IBM xlf90 | 7,962,624 | 1548.71 |
SP2 | 32 | Fortran 77 | IBM xlf | 7,962,624 | 1550.14 |
SP2 | 32 | Fortran 90 | IBM xlf90 | 7,962,624 | 1339.91 |
SP2 | 32 | C++ | IBM xlC | 7,962,624 | 2797.00 |
The Fortran 90 version outperformed the Fortran 77 versions due to improved cache-utilization of field components. The Fortran 90 (and C++) version encapsulates components into a single derived type, but the Fortran 77 version stores field elements in separate arrays.
The most aggressive optimizations produced the fastest timings; these are represented in the table. The KAI C++ compiler with K3 -O3 --abstract_pointer spent OVER 2 HOURS in the compilation process. The IBM F90 compiler with -O3 -qlanglvl=90std -qstrict -qalias=noaryovrlp used 5 MINUTES for compilation. (The KAI compiler generated faster executables than the IBM xlC C++ compiler.)
3D Parallel Plasma PIC Experiments - CPU Times for Various
Compilers
(KAI C++, IBM F90, and IBM F77 with IBM MPI)