Data Ingestion with Python Cookbook: A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

Esppenchutz, Gláucia

  • 出版商: Packt Publishing
  • 出版日期: 2023-05-31
  • 售價: $1,590
  • 貴賓價: 9.5$1,511
  • 語言: 英文
  • 頁數: 414
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 183763260X
  • ISBN-13: 9781837632602
  • 相關分類: Python程式語言
  • 海外代購書籍(需單獨結帳)

商品描述

Deploy your data ingestion pipeline, orchestrate, and monitor efficiently to prevent loss of data and quality

Purchase of the print or Kindle book includes a free PDF eBook


Key Features:

  • Harness best practices to create a Python and PySpark data ingestion pipeline
  • Seamlessly automate and orchestrate your data pipelines using Apache Airflow
  • Build a monitoring framework by integrating the concept of data observability into your pipelines


Book Description:

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges.

You'll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you'll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation.

By the end of the book, you'll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.


What You Will Learn:

  • Implement data observability using monitoring tools
  • Automate your data ingestion pipeline
  • Read analytical and partitioned data, whether schema or non-schema based
  • Debug and prevent data loss through efficient data monitoring and logging
  • Establish data access policies using a data governance framework
  • Construct a data orchestration framework to improve data quality


Who this book is for:

This book is for data engineers and data enthusiasts seeking a comprehensive understanding of the data ingestion process using popular tools in the open source community. For more advanced learners, this book takes on the theoretical pillars of data governance while providing practical examples of real-world scenarios commonly encountered by data engineers.

商品描述(中文翻譯)

部署您的數據輸入管道,高效地進行協調和監控,以防止數據丟失和質量問題。

購買印刷版或Kindle書籍將包括一本免費的PDF電子書。

主要特點:
- 利用最佳實踐創建Python和PySpark數據輸入管道
- 使用Apache Airflow無縫自動化和協調數據管道
- 通過將數據可觀察性概念整合到管道中來構建監控框架

書籍描述:
《Python數據輸入手冊》提供了一種實用的設計和實施數據輸入管道的方法。它使用市場上最廣泛認可的開源工具提供了真實世界的示例,以回答常見問題並克服挑戰。

您將了解如何設計和使用數據模式,以及如何使用Airflow和數據可觀察性原則創建受監控的管道,同時遵循行業最佳實踐。本書還解決了讀取不同數據源和數據格式所面臨的挑戰。隨著您閱讀本書,您將更廣泛地了解錯誤日誌記錄的最佳實踐、故障排除技巧、數據協調、監控和存儲日誌以供進一步查閱。

通過閱讀本書,您將擁有一套完全自動化的工具,可以輕鬆開始輸入和監控數據管道,並實現與ETL過程後續階段的無縫集成。

學到的內容:
- 使用監控工具實現數據可觀察性
- 自動化數據輸入管道
- 讀取分析和分區數據,無論是基於模式還是非模式的
- 通過高效的數據監控和日誌記錄進行調試和防止數據丟失
- 使用數據治理框架建立數據訪問策略
- 構建數據協調框架以提高數據質量

本書適合數據工程師和數據愛好者,他們希望全面了解使用開源社區中的流行工具進行數據輸入過程。對於更高級的學習者,本書涉及數據治理的理論基礎,同時提供數據工程師常遇到的實際場景的實用示例。