Introduction to HPC with MPI for Data Science (Undergraduate Topics in Computer Science)

Frank Nielsen

商品描述

This gentle introduction to High Performance Computing (HPC) for Data Science using the Message Passing Interface (MPI) standard has been designed as a first course for undergraduates on parallel programming on distributed memory models, and requires only basic programming notions.

Divided into two parts the first part covers high performance computing using C++ with the Message Passing Interface (MPI) standard followed by a second part providing high-performance data analytics on computer clusters.

In the first part, the fundamental notions of blocking versus non-blocking point-to-point communications, global communications (like broadcast or scatter) and collaborative computations (reduce), with Amdalh and Gustafson speed-up laws are described before addressing parallel sorting and parallel linear algebra on computer clusters. The common ring, torus and hypercube topologies of clusters are then explained and global communication procedures on these topologies are studied. This first part closes with the MapReduce (MR) model of computation well-suited to processing big data using the MPI framework.

In the second part, the book focuses on high-performance data analytics. Flat and hierarchical clustering algorithms are introduced for data exploration along with how to program these algorithms on computer clusters, followed by machine learning classification, and an introduction to graph analytics. This part closes with a concise introduction to data core-sets that let big data problems be amenable to tiny data problems.

Exercises are included at the end of each chapter in order for students to practice the concepts learned, and a final section contains an overall exam which allows them to evaluate how well they have assimilated the material covered in the book.

商品描述(中文翻譯)

這本書是一本關於使用訊息傳遞介面(MPI)標準進行高性能計算(HPC)和資料科學的入門指南,旨在作為本科生在分散式記憶模型上進行並行編程的第一門課程,只需要基本的編程概念。

本書分為兩部分,第一部分介紹了使用C++和訊息傳遞介面(MPI)標準進行高性能計算,接著第二部分介紹了在計算機集群上進行高性能數據分析。

在第一部分中,首先介紹了阻塞和非阻塞點對點通信、全局通信(如廣播或散射)和協同計算(reduce)的基本概念,並描述了Amdalh和Gustafson加速定律,然後介紹了在計算機集群上進行並行排序和並行線性代數。接著解釋了集群的常見環、圓環和超立方體拓撲結構,並研究了這些拓撲結構上的全局通信程序。本部分最後介紹了MapReduce(MR)計算模型,該模型非常適合使用MPI框架處理大數據。

在第二部分中,本書專注於高性能數據分析。介紹了平面和分層聚類算法,以及如何在計算機集群上編程這些算法,接著介紹了機器學習分類和圖分析的基礎知識。本部分最後簡要介紹了使大數據問題可處理為小數據問題的數據核心集。

每章末尾都包含練習題,供學生練習所學概念,最後一節包含一個總體考試,讓學生評估他們對本書所涵蓋內容的理解程度。