Available:*
Library | Item Barcode | Call Number | Material Type | Item Category 1 | Status |
---|---|---|---|---|---|
Searching... | 30000010329028 | QA76.73.F25 R844 2014 | Open Access Book | Book | Searching... |
On Order
Summary
Summary
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran.
To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison.
Author Notes
Gregory Ruetsch: Senior Applied Engineer, NVIDIA
Massimiliano Fatica: Manager, Tesla HPC Group, NVIDIA
Table of Contents
Acknowledgments | p. xi |
Preface | p. xiii |
Part I Cuda Fortran Programming | |
Chapter 1 Introduction | p. 3 |
1.1 A Brief History of GPU Computing | p. 3 |
1.2 Parallel Computation | p. 5 |
1.3 Basic Concepts | p. 5 |
1.3.1 A First CUDA Fortran Program | p. 6 |
1.3.2 Extending to Larger Arrays | p. 9 |
1.3.3 Multidimensional Arrays | p. 12 |
1.4 Determining CUDA Hardware Features and Limits | p. 13 |
1.4.1 Single and Double Precision | p. 21 |
1.5 Error Handling | p. 23 |
1.6 Compiling CUDA Fortran Code | p. 24 |
1.6.1 Separate Compilation | p. 27 |
Chapter 2 Performance Measurement and Metrics | p. 31 |
2.1 Measuring Kernel Execution Time | p. 31 |
2.1.1 Host-Device Synchronization and CPU Timers | p. 32 |
2.1.2 Timing via CUDA Events | p. 32 |
2.1.3 Command Line Profiler | p. 34 |
2.1.4 The nvprof Profiling Tool | p. 35 |
2.2 Instruction, Bandwidth, and Latency Bound Kernels | p. 36 |
2.3 Memory Bandwidth | p. 39 |
2.3.1 Theoretical Peak Bandwidth | p. 39 |
2.3.2 Effective Bandwidth | p. 41 |
2.3.3 Actual Data Throughput vs. Effective Bandwidth | p. 42 |
Chapter 3 Optimization | p. 43 |
3.1 Transfers between Host and Device | p. 44 |
3.1.1 Pinned Memory | p. 45 |
3.1.2 Batching Small Data Transfers | p. 49 |
3.1.3 Asynchronous Data Transfers (Advanced Topic) | p. 52 |
3.2 Device Memory | p. 61 |
3.2.1 Declaring Data in Device Code | p. 62 |
3.2.2 Coalesced Access to Global Memory | p. 63 |
3.2.3 Texture Memory | p. 74 |
3.2.4 Local Memory | p. 79 |
3.2.5 Constant Memory | p. 82 |
3.3 On-Chip Memory | p. 85 |
3.3.1 L1 Cache | p. 85 |
3.3.2 Registers | p. 86 |
3.3.3 Shared Memory | p. 87 |
3.4 Memory Optimization Example: Matrix Transpose | p. 93 |
3.4.1 Partition Camping (Advanced Topic) | p. 99 |
3.5 Execution Configuration | p. 102 |
3.5.1 Thread-Level Parallelism | p. 102 |
3.5.2 Instruction-Level Parallelism | p. 105 |
3.6 Instruction Optimization | p. 107 |
3.6.1 Device Intrinsics | p. 108 |
3.6.2 Compiler Options | p. 108 |
3.6.3 Divergent Warps | p. 109 |
3.7 Kernel Loop Directives | p. 110 |
3.7.1 Reductions in CUF Kernels | p. 113 |
3.7.2 Streams in CUF Kernels | p. 113 |
3.7.3 Instruction-Level Parallelism in CUF Kernels | p. 114 |
Chapter 4 Multi-GPU Programming | p. 115 |
4.1 CUDA Multi-GPU Features | p. 115 |
4.1.1 Peer-to-Peer Communication | p. 117 |
4.1.2 Peer-to-Peer Direct Transfers | p. 121 |
4.1.3 Peer-to-Peer Transpose | p. 131 |
4.2 Multi-GPU Programming with MPI | p. 140 |
4.2.1 Assigning Devices to MPI Ranks | p. 141 |
4.2.2 MPI Transpose | p. 147 |
4.2.3 GPU-Aware MPI Transpose | p. 149 |
Part II Case Studies | |
Chapter 5 Monte Carlo Method | p. 155 |
5.1 CURAND | p. 156 |
5.2 Computing ¿ with CUF Kernels | p. 161 |
5.2.1 EEEE-754 Precision (Advanced Topic) | p. 164 |
5.3 Computing ¿ with Reduction Kernels | p. 168 |
5.3.1 Reductions with Atomic Locks (Advanced Topic) | p. 173 |
5.4 Accuracy of Summation | p. 174 |
5.5 Option Pricing | p. 180 |
Chapter 6 Finite Difference Method | p. 189 |
6.1 Nine-Point ID Finite Difference Stencil | p. 189 |
6.1.1 Data Reuse and Shared Memory | p. 190 |
6.1.2 The x-Derivative Kernel | p. 191 |
6.1.3 Derivatives in y and z | p. 196 |
6.1.4 Nonuniform Grids | p. 200 |
6.2 2D Laplace Equation | p. 204 |
Chapter 7 Applications of Fast Fourier Transform | p. 211 |
7.1 CUFFT | p. 211 |
7.2 Spectral Derivatives | p. 219 |
7.3 Convolution | p. 222 |
7.4 Poisson Solver | p. 229 |
Part III Appendices | |
Appendix A Tesla Specifications | p. 237 |
Appendix B System and Environment Management | p. 241 |
B.1 Environment Variables | p. 241 |
B.1.1 General | p. 241 |
B.1.2 Command Line Profiler | p. 242 |
B.1.3 Just-in-Time Compilation | p. 242 |
B.2 nvidia-smi System Management Interface | p. 242 |
B.2.1 Enabling and Disabling ECC | p. 243 |
B.2.2 Compute Mode | p. 245 |
B.2.3 Persistence Mode | p. 246 |
Appendix C Calling CUDA C from CUDA Fortran | p. 249 |
C.1 Calling CUDA C Libraries | p. 249 |
C.2 Calling User-Written CUDA C Code | p. 252 |
Appendix D Source Code | p. 255 |
D.1 Texture Memory | p. 255 |
D.2 Matrix Transpose | p. 259 |
D.3 Thread- and Instruction-Level Parallelism | p. 267 |
D.4 Multi-GPU Programming | p. 271 |
D.4.1 Peer-to-Peer Transpose | p. 272 |
D.4.2 MPI Transpose with Host MPI Transfers | p. 279 |
D.4.3 MPI Transpose with Device MPI Transfers | p. 284 |
D.5 Finite Difference Code | p. 289 |
D.6 Spectral Poisson Solver | p. 310 |
References | p. 317 |
Index | p. 319 |