Serverless ETL and Analytics with AWS Glue: Your comprehensive reference guide to learning about AWS Glue and its features

Pathak, Vishal, Vajiraya, Subramanya, Sekiyama, Noritaka

  • 出版商: Packt Publishing
  • 出版日期: 2022-08-30
  • 售價: $1,960
  • 貴賓價: 9.5$1,862
  • 語言: 英文
  • 頁數: 434
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1800564988
  • ISBN-13: 9781800564985
  • 相關分類: Amazon Web ServicesServerless
  • 下單後立即進貨 (約3~4週)

商品描述

Build efficient data lakes that can scale to virtually unlimited size using AWS Glue

Key Features

- Learn to work with AWS Glue to overcome typical implementation challenges in data lakes
- Create and manage serverless ETL pipelines that can scale to manage big data
- Written by AWS Glue community members, this practical guide shows you how to implement AWS Glue in no time

Book Description

Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes.

Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You'll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you'll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options.

By the end of this AWS book, you'll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.

What you will learn

- Apply various AWS Glue features to manage and create data lakes
- Use Glue DataBrew and Glue Studio for data preparation
- Optimize data layout in cloud storage to accelerate analytics workloads
- Manage metadata including database, table, and schema definitions
- Secure your data during access control, encryption, auditing, and networking
- Monitor AWS Glue jobs to detect delays and loss of data
- Integrate Spark ML and SageMaker with AWS Glue to create machine learning models

Who this book is for

This book is for ETL developers, data engineers, and data analysts who want to understand how AWS Glue can help you solve your business problems. Basic knowledge of AWS data services is assumed.

商品描述(中文翻譯)

建立高效能的資料湖,使用 AWS Glue 可以無限擴展的規模。

主要特點:
- 學習如何使用 AWS Glue 解決資料湖實施中的典型挑戰
- 建立和管理可擴展處理大數據的無伺服器 ETL 流程
- 由 AWS Glue 社群成員撰寫,這本實用指南將教你如何快速實施 AWS Glue

書籍描述:
現今組織趨向於使用 AWS Glue 等服務,以處理繁重的工作並提供無伺服器的 Spark,讓你能夠以無伺服器方式建立和管理資料湖。本指南將展示 AWS Glue 如何解決實際問題,並幫助你學習資料處理、資料整合和建立資料湖的技巧。

從 AWS Glue 基礎知識開始,本書將教你如何使用此服務進行各種資料分析,例如即席查詢、資料視覺化和實時分析。它還提供了 AWS Glue 的 CI/CD 指南,以及如何使用自動回歸測試來提高質量。你將了解到如何實施資料安全方面的存取控制、加密、審計和網路連接,並掌握選擇適當的檔案格式、壓縮、分區和存儲桶等有用技巧。隨著進一步的學習,你將發現 AWS Glue 的功能,如爬蟲、Lake Formation、受管表格、血緣、DataBrew、Glue Studio 和自訂連接器。最後的章節將幫助你了解各種性能調優、故障排除和監控選項。

通過閱讀本書,你將能夠使用 AWS Glue 建立、管理、排除故障和部署 ETL 流程。

你將學到:
- 應用各種 AWS Glue 功能來管理和建立資料湖
- 使用 Glue DataBrew 和 Glue Studio 進行資料準備
- 優化雲端儲存中的資料佈局,加速分析工作負載
- 管理包括資料庫、表格和架構定義在內的元數據
- 在存取控制、加密、審計和網路連接方面保護你的資料
- 監控 AWS Glue 工作以檢測延遲和資料損失
- 將 Spark ML 和 SageMaker 與 AWS Glue 整合,建立機器學習模型

本書適合 ETL 開發人員、資料工程師和資料分析師,想要了解 AWS Glue 如何幫助你解決業務問題。預設假設讀者具備 AWS 資料服務的基本知識。

作者簡介

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures.

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.

作者簡介(中文翻譯)

Vishal Pathak 是 AWS 的數據實驗室解決方案架構師。Vishal 與客戶合作解決他們的使用案例,設計解決方案來解決他們的業務問題,並幫助他們建立可擴展的原型。在加入 AWS 之前,Vishal 在美國和澳大利亞幫助客戶實施商業智能、數據倉庫和數據湖項目。

Subramanya Vajiraya 是 AWS Sydney 的大數據雲工程師,專注於 AWS Glue。他於2015年獲得了專攻信息科學與工程的工學學士學位,畢業於印度 Nitte 的 NMAM Institute of Technology(Visvesvaraya Technological University, Belgaum),並於2017年獲得了專攻互聯網工程的信息技術碩士學位,畢業於澳大利亞悉尼的新南威爾士大學。他熱衷於幫助客戶解決與 ETL 工作負載相關的技術問題,並在 AWS 上實施可擴展的數據集成和分析流程。

Noritaka Sekiyama 是 AWS Glue 和 AWS Lake Formation 團隊的高級大數據架構師。他在軟件行業有11年的工作經驗。他位於日本東京,負責實施軟件工件、構建庫、解決複雜問題並幫助指導客戶架構。

Tomohiro Tanaka 是 AWS 的高級雲支援工程師。他致力於幫助客戶解決問題並在 AWS Glue、AWS IoT 和大數據技術(如 Apache Spark、Hadoop 和 Iceberg)上構建數據湖。

Albert Quiroga 在亞馬遜擔任高級解決方案架構師,正在協助設計和架構全球最大的數據湖之一。在此之前,他在 AWS 工作了四年,專注於 EMR 和 Athena 等大數據技術,並成為 AWS Glue 的專家。Albert 曾與幾家財富500強公司合作,參與了全球最大的數據湖項目,並幫助推出和開發了幾個 AWS 服務的功能。

Ishan Gaur 在軟件開發和數據工程方面擁有超過13年的 IT 經驗,構建使用 Apache Spark、Scala 和各種 ETL 工具(如 Ab Initio 和 Datastage)的分佈式系統和高度可擴展的 ETL 流程。他目前在 AWS 擔任高級大數據雲工程師,是 AWS Glue 的專家。他負責幫助客戶構建大型可擴展的分佈式系統,並在 AWS 雲環境中使用各種大數據服務(包括 EMR、Glue 和 Athena)以及其他技術(如 Apache Spark、Hadoop 和 Hive)實施這些系統。

目錄大綱

1. Data Management – Introduction and Concepts
2. Introduction to Important AWS Glue Features
3. Data Ingestion
4. Data Preparation
5. Designing Data Layouts
6. Data Management
7. Metadata Management
8. Data Security
9. Data Sharing
10. Data Pipeline Management
11. Monitoring
12. Tuning, Debugging, and Troubleshooting
13. Data Analysis
14. Machine Learning Integration
15. Architecting Data Lakes for Real-World Scenarios and Edge Cases

目錄大綱(中文翻譯)

1. 資料管理 - 簡介與概念
2. 重要 AWS Glue 功能介紹
3. 資料載入
4. 資料準備
5. 設計資料佈局
6. 資料管理
7. 元資料管理
8. 資料安全
9. 資料分享
10. 資料管線管理
11. 監控
12. 調整、除錯和疑難排解
13. 資料分析
14. 機器學習整合
15. 為實際情境和邊緣案例架構資料湖