Genomics in the Cloud: GATK, Spark, and Docker

Brian D. O'Connor, Geraldine van der Auwera

  • 出版商: O'Reilly
  • 出版日期: 2020-05-12
  • 定價: $2,800
  • 售價: 9.5$2,660
  • 語言: 英文
  • 頁數: 475
  • 裝訂: Paperback
  • ISBN: 1491975199
  • ISBN-13: 9781491975190
  • 相關分類: DockerSpark
  • 立即出貨 (庫存=1)

買這商品的人也買了...

商品描述

Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytes—or 52.4 million gigabytes—of genomic data, and they’re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that data in the cloud?

With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Brian O’Connor of the UC Santa Cruz Genomics Institute and Geraldine Van der Auwera, longtime custodian of the GATK user community, guide you through the process. You’ll learn by working with real data and genomics algorithms from the field.

This book takes you through:

  • Essential genomics and computing technology background
  • Basic cloud computing operations
  • Getting started with GATK
  • Three major GATK best practices for variant discovery pipelines
  • Automating analysis with scripted workflows using WDL and Cromwell
  • Scaling up workflow execution in the cloud, including parallelization and cost optimization
  • Interactive analysis in the cloud using Jupyter notebooks
  • Secure collaboration and computational reproducibility using Terra

商品描述(中文翻譯)

基因組學領域的數據正在蓬勃發展。在短短幾年內,像是美國國家衛生研究院(NIH)等組織將會擁有50多個寵比特(即52.4百萬吉比特)的基因組數據,並且他們正在轉向雲基礎架構,以便將這些數據提供給研究社群。那麼,如何適應分析工具和協議,以在雲端中存取和分析這些數據呢?

這本實用書籍將教導研究人員如何使用開源工具(包括基因組分析工具包(GATK)、Docker、WDL和Terra)來處理基因組學算法。加州大學聖塔克魯茲分校基因組學研究所的Brian O'Connor和長期負責GATK使用者社群的Geraldine Van der Auwera將引導您完成這個過程。您將通過使用來自該領域的真實數據和基因組學算法來學習。

本書將帶您深入了解以下內容:
- 基本的基因組學和計算技術背景
- 基本的雲計算操作
- 開始使用GATK
- 三個主要的GATK最佳實踐,用於變異發現流程
- 使用WDL和Cromwell自動化分析工作流程
- 在雲端中擴展工作流程執行,包括並行處理和成本優化
- 使用Jupyter筆記本在雲端中進行互動式分析
- 使用Terra進行安全協作和計算可重現性

作者簡介

Dr. Geraldine A. Van der Auwera is the Director of Outreach and Communication for the Data Sciences Platform (DSP) at the Broad Institute of MIT and Harvard. As part of her outreach role, she serves as an educator and advocate for researchers who use DSP software and services including GATK, the Broad's industry-leading toolkit for variant discovery analysis; the Cromwell/WDL workflow management system; and Terra.bio, a cloud-based analysis platform that integrates computational resources, methods repository and data management in a user-friendly environment. Van der Auwera was originally trained as a microbiologist, earning her Ph.D. in Biological Engineering from the Université catholique de Louvain (UCL) in Belgium in 2007, then surviving a 4-year postdoctoral stint at Harvard Medical School. She joined the Broad Institute in 2012 to become Benevolent Dictator For Life of the GATK user community, leaving behind the bench and pipette work forever.

Dr. Brian O’Connor is the Technical Director of the UCSC Genomics Institute Analysis Core. There he focuses on the development and deployment of large-scale, cloud-based systems for analyzing genomics data. This includes the Toil workflow execution platform, which is designed to run genomic pipelines on a wide range of cloud environments including AWS, Azure, Google and OpenStack, and ADAM, a distributed genomics platform developed in collaboration with UC Berkeley. He is also the co-chair of the Containers and Workflows task team of the Global Alliance for Genomics and Health (GA4GH) where he works on tool and workflow container standards. Brian recently joined UCSC from the Ontario Institute for Cancer Research (OICR) where his previous projects included leading the technical implementation of cloud-based analysis systems for the PanCancer Analysis of Whole Genomes (PCAWG) effort, the creation of the Dockstore project (http://dockstore.org), and the development of the International Cancer Genome Consortium’s Data Portal (http://dcc.icgc.org).

作者簡介(中文翻譯)

Dr. Geraldine A. Van der Auwera是麻省理工學院和哈佛大學廣域研究所數據科學平台(DSP)的外展和溝通主任。作為外展角色的一部分,她作為教育者和倡導者,為使用DSP軟件和服務的研究人員提供支持,包括GATK,廣域領先的變異發現分析工具包;Cromwell/WDL工作流管理系統;以及Terra.bio,一個基於雲的分析平台,集成了計算資源、方法庫和用戶友好的數據管理環境。Van der Auwera最初接受微生物學培訓,於2007年在比利時魯汶大學獲得生物工程博士學位,然後在哈佛醫學院進行了為期4年的博士後研究。她於2012年加入廣域研究所,成為GATK用戶社區的終身權威獨裁者,永遠離開了實驗室和移液管工作。

Dr. Brian O'Connor是加州大學聖塔克魯茲分校基因組學研究所分析核心的技術主任。他專注於開發和部署用於分析基因組數據的大規模基於雲的系統。這包括Toil工作流執行平台,該平台旨在在包括AWS、Azure、Google和OpenStack在內的各種雲環境上運行基因組流程;以及ADAM,這是與加州大學伯克利分校合作開發的分佈式基因組平台。他還是全球基因組和健康聯盟(GA4GH)的容器和工作流任務小組的聯席主席,該小組致力於工具和工作流容器標準的研究。Brian最近從安大略省癌症研究所(OICR)加入了加州大學聖塔克魯茲分校,他之前的項目包括領導基於雲的分析系統的技術實施,用於全基因組的全球癌症分析(PCAWG)項目,創建Dockstore項目(http://dockstore.org),以及國際癌症基因組聯盟的數據門戶(http://dcc.icgc.org)的開發。