Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments
暫譯: 數據密集型工作流程管理:針對雲端及數據密集型可擴展計算環境

de Oliveira, Daniel C. M., Liu, Ji, Pacitti, Esther

  • 出版商: Morgan & Claypool
  • 出版日期: 2019-05-13
  • 售價: $2,830
  • 貴賓價: 9.5$2,689
  • 語言: 英文
  • 頁數: 180
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1681735571
  • ISBN-13: 9781681735573
  • 相關分類: Spark
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Workflows may be defined as abstractions used to model the coherent flow of activities in the context of an in silico scientific experiment.

They are employed in many domains of science such as bioinformatics, astronomy, and engineering. Such workflows usually present a considerable number of activities and activations (i.e., tasks associated with activities) and may need a long time for execution. Due to the continuous need to store and process data efficiently (making them data-intensive workflows), high-performance computing environments allied to parallelization techniques are used to run these workflows. At the beginning of the 2010s, cloud technologies emerged as a promising environment to run scientific workflows. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines.

More recently, Data-Intensive Scalable Computing (DISC) frameworks (e.g., Apache Spark and Hadoop) and environments emerged and are being used to execute data-intensive workflows. DISC environments are composed of processors and disks in large-commodity computing clusters connected using high-speed communications switches and networks. The main advantage of DISC frameworks is that they support and grant efficient in-memory data management for large-scale applications, such as data-intensive workflows. However, the execution of workflows in cloud and DISC environments raise many challenges such as scheduling workflow activities and activations, managing produced data, collecting provenance data, etc.

Several existing approaches deal with the challenges mentioned earlier. This way, there is a real need for understanding how to manage these workflows and various big data platforms that have been developed and introduced. As such, this book can help researchers understand how linking workflow management with Data-Intensive Scalable Computing can help in understanding and analyzing scientific big data.

In this book, we aim to identify and distill the body of work on workflow management in clouds and DISC environments. We start by discussing the basic principles of data-intensive scientific workflows. Next, we present two workflows that are executed in a single site and multi-site clouds taking advantage of provenance. Afterward, we go towards workflow management in DISC environments, and we present, in detail, solutions that enable the optimized execution of the workflow using frameworks such as Apache Spark and its extensions.

商品描述(中文翻譯)

工作流程可以定義為用於建模在in silico 科學實驗中活動一致流動的抽象概念。它們被應用於許多科學領域,如生物資訊學、天文學和工程學。這些工作流程通常包含相當多的活動和啟動(即與活動相關的任務),並且可能需要較長的執行時間。由於持續需要有效地存儲和處理數據(使其成為數據密集型工作流程),因此使用高效能計算環境結合平行化技術來執行這些工作流程。在2010年代初期,雲技術作為執行科學工作流程的有前景環境而出現。通過使用雲,科學家們已經從單一的平行計算機擴展到數百甚至數千個虛擬機器。

最近,數據密集型可擴展計算(Data-Intensive Scalable Computing, DISC)框架(例如,Apache Spark 和 Hadoop)和環境出現並被用來執行數據密集型工作流程。DISC 環境由處理器和磁碟組成,這些組件在大型商用計算集群中通過高速通信交換機和網絡連接。DISC 框架的主要優勢在於它們支持並提供高效的內存數據管理,適用於大規模應用程序,如數據密集型工作流程。然而,在雲和 DISC 環境中執行工作流程會帶來許多挑戰,例如調度工作流程活動和啟動、管理產生的數據、收集來源數據等。

幾種現有的方法處理上述挑戰。因此,實際上需要了解如何管理這些工作流程以及已開發和引入的各種大數據平台。因此,本書可以幫助研究人員理解如何將工作流程管理與數據密集型可擴展計算相結合,以幫助理解和分析科學大數據。

在本書中,我們旨在識別和提煉有關雲和 DISC 環境中工作流程管理的研究成果。我們首先討論數據密集型科學工作流程的基本原則。接下來,我們介紹兩個在單一站點和多站點雲中執行的工作流程,並利用來源數據。然後,我們將重點放在 DISC 環境中的工作流程管理,並詳細介紹使用 Apache Spark 及其擴展來實現工作流程優化執行的解決方案。