Cover image for CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming
Title:
CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming
Personal Author:
Publication Information:
Amsterdam : Boston : Morgan Kaufmann, an imprint of Elsevier, 2014
Physical Description:
xiii, 323 pages : illustrations ; 24 cm.
ISBN:
9780124169708
Added Author:

Available:*

Library
Item Barcode
Call Number
Material Type
Item Category 1
Status
Searching...
30000010329028 QA76.73.F25 R844 2014 Open Access Book Book
Searching...

On Order

Summary

Summary

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran.

To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison.


Author Notes

Gregory Ruetsch: Senior Applied Engineer, NVIDIA
Massimiliano Fatica: Manager, Tesla HPC Group, NVIDIA


Table of Contents

Acknowledgmentsp. xi
Prefacep. xiii
Part I Cuda Fortran Programming
Chapter 1 Introductionp. 3
1.1 A Brief History of GPU Computingp. 3
1.2 Parallel Computationp. 5
1.3 Basic Conceptsp. 5
1.3.1 A First CUDA Fortran Programp. 6
1.3.2 Extending to Larger Arraysp. 9
1.3.3 Multidimensional Arraysp. 12
1.4 Determining CUDA Hardware Features and Limitsp. 13
1.4.1 Single and Double Precisionp. 21
1.5 Error Handlingp. 23
1.6 Compiling CUDA Fortran Codep. 24
1.6.1 Separate Compilationp. 27
Chapter 2 Performance Measurement and Metricsp. 31
2.1 Measuring Kernel Execution Timep. 31
2.1.1 Host-Device Synchronization and CPU Timersp. 32
2.1.2 Timing via CUDA Eventsp. 32
2.1.3 Command Line Profilerp. 34
2.1.4 The nvprof Profiling Toolp. 35
2.2 Instruction, Bandwidth, and Latency Bound Kernelsp. 36
2.3 Memory Bandwidthp. 39
2.3.1 Theoretical Peak Bandwidthp. 39
2.3.2 Effective Bandwidthp. 41
2.3.3 Actual Data Throughput vs. Effective Bandwidthp. 42
Chapter 3 Optimizationp. 43
3.1 Transfers between Host and Devicep. 44
3.1.1 Pinned Memoryp. 45
3.1.2 Batching Small Data Transfersp. 49
3.1.3 Asynchronous Data Transfers (Advanced Topic)p. 52
3.2 Device Memoryp. 61
3.2.1 Declaring Data in Device Codep. 62
3.2.2 Coalesced Access to Global Memoryp. 63
3.2.3 Texture Memoryp. 74
3.2.4 Local Memoryp. 79
3.2.5 Constant Memoryp. 82
3.3 On-Chip Memoryp. 85
3.3.1 L1 Cachep. 85
3.3.2 Registersp. 86
3.3.3 Shared Memoryp. 87
3.4 Memory Optimization Example: Matrix Transposep. 93
3.4.1 Partition Camping (Advanced Topic)p. 99
3.5 Execution Configurationp. 102
3.5.1 Thread-Level Parallelismp. 102
3.5.2 Instruction-Level Parallelismp. 105
3.6 Instruction Optimizationp. 107
3.6.1 Device Intrinsicsp. 108
3.6.2 Compiler Optionsp. 108
3.6.3 Divergent Warpsp. 109
3.7 Kernel Loop Directivesp. 110
3.7.1 Reductions in CUF Kernelsp. 113
3.7.2 Streams in CUF Kernelsp. 113
3.7.3 Instruction-Level Parallelism in CUF Kernelsp. 114
Chapter 4 Multi-GPU Programmingp. 115
4.1 CUDA Multi-GPU Featuresp. 115
4.1.1 Peer-to-Peer Communicationp. 117
4.1.2 Peer-to-Peer Direct Transfersp. 121
4.1.3 Peer-to-Peer Transposep. 131
4.2 Multi-GPU Programming with MPIp. 140
4.2.1 Assigning Devices to MPI Ranksp. 141
4.2.2 MPI Transposep. 147
4.2.3 GPU-Aware MPI Transposep. 149
Part II Case Studies
Chapter 5 Monte Carlo Methodp. 155
5.1 CURANDp. 156
5.2 Computing ¿ with CUF Kernelsp. 161
5.2.1 EEEE-754 Precision (Advanced Topic)p. 164
5.3 Computing ¿ with Reduction Kernelsp. 168
5.3.1 Reductions with Atomic Locks (Advanced Topic)p. 173
5.4 Accuracy of Summationp. 174
5.5 Option Pricingp. 180
Chapter 6 Finite Difference Methodp. 189
6.1 Nine-Point ID Finite Difference Stencilp. 189
6.1.1 Data Reuse and Shared Memoryp. 190
6.1.2 The x-Derivative Kernelp. 191
6.1.3 Derivatives in y and zp. 196
6.1.4 Nonuniform Gridsp. 200
6.2 2D Laplace Equationp. 204
Chapter 7 Applications of Fast Fourier Transformp. 211
7.1 CUFFTp. 211
7.2 Spectral Derivativesp. 219
7.3 Convolutionp. 222
7.4 Poisson Solverp. 229
Part III Appendices
Appendix A Tesla Specificationsp. 237
Appendix B System and Environment Managementp. 241
B.1 Environment Variablesp. 241
B.1.1 Generalp. 241
B.1.2 Command Line Profilerp. 242
B.1.3 Just-in-Time Compilationp. 242
B.2 nvidia-smi System Management Interfacep. 242
B.2.1 Enabling and Disabling ECCp. 243
B.2.2 Compute Modep. 245
B.2.3 Persistence Modep. 246
Appendix C Calling CUDA C from CUDA Fortranp. 249
C.1 Calling CUDA C Librariesp. 249
C.2 Calling User-Written CUDA C Codep. 252
Appendix D Source Codep. 255
D.1 Texture Memoryp. 255
D.2 Matrix Transposep. 259
D.3 Thread- and Instruction-Level Parallelismp. 267
D.4 Multi-GPU Programmingp. 271
D.4.1 Peer-to-Peer Transposep. 272
D.4.2 MPI Transpose with Host MPI Transfersp. 279
D.4.3 MPI Transpose with Device MPI Transfersp. 284
D.5 Finite Difference Codep. 289
D.6 Spectral Poisson Solverp. 310
Referencesp. 317
Indexp. 319