CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran.

To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison.

Author Notes

Gregory Ruetsch: Senior Applied Engineer, NVIDIA
Massimiliano Fatica: Manager, Tesla HPC Group, NVIDIA

Acknowledgments	p. xi
Preface	p. xiii
Part I Cuda Fortran Programming
Chapter 1 Introduction	p. 3
1.1 A Brief History of GPU Computing	p. 3
1.2 Parallel Computation	p. 5
1.3 Basic Concepts	p. 5
1.3.1 A First CUDA Fortran Program	p. 6
1.3.2 Extending to Larger Arrays	p. 9
1.3.3 Multidimensional Arrays	p. 12
1.4 Determining CUDA Hardware Features and Limits	p. 13
1.4.1 Single and Double Precision	p. 21
1.5 Error Handling	p. 23
1.6 Compiling CUDA Fortran Code	p. 24
1.6.1 Separate Compilation	p. 27
Chapter 2 Performance Measurement and Metrics	p. 31
2.1 Measuring Kernel Execution Time	p. 31
2.1.1 Host-Device Synchronization and CPU Timers	p. 32
2.1.2 Timing via CUDA Events	p. 32
2.1.3 Command Line Profiler	p. 34
2.1.4 The nvprof Profiling Tool	p. 35
2.2 Instruction, Bandwidth, and Latency Bound Kernels	p. 36
2.3 Memory Bandwidth	p. 39
2.3.1 Theoretical Peak Bandwidth	p. 39
2.3.2 Effective Bandwidth	p. 41
2.3.3 Actual Data Throughput vs. Effective Bandwidth	p. 42
Chapter 3 Optimization	p. 43
3.1 Transfers between Host and Device	p. 44
3.1.1 Pinned Memory	p. 45
3.1.2 Batching Small Data Transfers	p. 49
3.1.3 Asynchronous Data Transfers (Advanced Topic)	p. 52
3.2 Device Memory	p. 61
3.2.1 Declaring Data in Device Code	p. 62
3.2.2 Coalesced Access to Global Memory	p. 63
3.2.3 Texture Memory	p. 74
3.2.4 Local Memory	p. 79
3.2.5 Constant Memory	p. 82
3.3 On-Chip Memory	p. 85
3.3.1 L1 Cache	p. 85
3.3.2 Registers	p. 86
3.3.3 Shared Memory	p. 87
3.4 Memory Optimization Example: Matrix Transpose	p. 93
3.4.1 Partition Camping (Advanced Topic)	p. 99
3.5 Execution Configuration	p. 102
3.5.1 Thread-Level Parallelism	p. 102
3.5.2 Instruction-Level Parallelism	p. 105
3.6 Instruction Optimization	p. 107
3.6.1 Device Intrinsics	p. 108
3.6.2 Compiler Options	p. 108
3.6.3 Divergent Warps	p. 109
3.7 Kernel Loop Directives	p. 110
3.7.1 Reductions in CUF Kernels	p. 113
3.7.2 Streams in CUF Kernels	p. 113
3.7.3 Instruction-Level Parallelism in CUF Kernels	p. 114
Chapter 4 Multi-GPU Programming	p. 115
4.1 CUDA Multi-GPU Features	p. 115
4.1.1 Peer-to-Peer Communication	p. 117
4.1.2 Peer-to-Peer Direct Transfers	p. 121
4.1.3 Peer-to-Peer Transpose	p. 131
4.2 Multi-GPU Programming with MPI	p. 140
4.2.1 Assigning Devices to MPI Ranks	p. 141
4.2.2 MPI Transpose	p. 147
4.2.3 GPU-Aware MPI Transpose	p. 149
Part II Case Studies
Chapter 5 Monte Carlo Method	p. 155
5.1 CURAND	p. 156
5.2 Computing ¿ with CUF Kernels	p. 161
5.2.1 EEEE-754 Precision (Advanced Topic)	p. 164
5.3 Computing ¿ with Reduction Kernels	p. 168
5.3.1 Reductions with Atomic Locks (Advanced Topic)	p. 173
5.4 Accuracy of Summation	p. 174
5.5 Option Pricing	p. 180
Chapter 6 Finite Difference Method	p. 189
6.1 Nine-Point ID Finite Difference Stencil	p. 189
6.1.1 Data Reuse and Shared Memory	p. 190
6.1.2 The x-Derivative Kernel	p. 191
6.1.3 Derivatives in y and z	p. 196
6.1.4 Nonuniform Grids	p. 200
6.2 2D Laplace Equation	p. 204
Chapter 7 Applications of Fast Fourier Transform	p. 211
7.1 CUFFT	p. 211
7.2 Spectral Derivatives	p. 219
7.3 Convolution	p. 222
7.4 Poisson Solver	p. 229
Part III Appendices
Appendix A Tesla Specifications	p. 237
Appendix B System and Environment Management	p. 241
B.1 Environment Variables	p. 241
B.1.1 General	p. 241
B.1.2 Command Line Profiler	p. 242
B.1.3 Just-in-Time Compilation	p. 242
B.2 nvidia-smi System Management Interface	p. 242
B.2.1 Enabling and Disabling ECC	p. 243
B.2.2 Compute Mode	p. 245
B.2.3 Persistence Mode	p. 246
Appendix C Calling CUDA C from CUDA Fortran	p. 249
C.1 Calling CUDA C Libraries	p. 249
C.2 Calling User-Written CUDA C Code	p. 252
Appendix D Source Code	p. 255
D.1 Texture Memory	p. 255
D.2 Matrix Transpose	p. 259
D.3 Thread- and Instruction-Level Parallelism	p. 267
D.4 Multi-GPU Programming	p. 271
D.4.1 Peer-to-Peer Transpose	p. 272
D.4.2 MPI Transpose with Host MPI Transfers	p. 279
D.4.3 MPI Transpose with Device MPI Transfers	p. 284
D.5 Finite Difference Code	p. 289
D.6 Spectral Poisson Solver	p. 310
References	p. 317
Index	p. 319

Available:*

On Order

Summary

Summary

Author Notes

Table of Contents