In-Memory Analytics with Apache Arrow: Perform fast and efficient data analytics on both flat and hierarchical structured data

Topol, Matthew

  • 出版商: Packt Publishing
  • 出版日期: 2022-06-24
  • 定價: $1,690
  • 售價: 9.0$1,521
  • 語言: 英文
  • 頁數: 392
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1801071039
  • ISBN-13: 9781801071031
  • 相關分類: Data Science
  • 立即出貨 (庫存=1)

商品描述

Process tabular data and build high-performance query engines on modern CPUs and GPUs using Apache Arrow, a standardized language-independent memory format, for optimal performance

Key Features

• Learn about Apache Arrow's data types and interoperability with pandas and Parquet
• Work with Apache Arrow Flight RPC, Compute, and Dataset APIs to produce and consume tabular data
• Reviewed, contributed, and supported by Dremio, the co-creator of Apache Arrow

Book Description

Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.

In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow's versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about Dremio's usage of Apache Arrow to enhance SQL analytics and discover how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve.

By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.

What you will learn

• Use Apache Arrow libraries to access data files both locally and in the cloud
• Understand the zero-copy elements of the Apache Arrow format
• Improve read performance by memory-mapping files with Apache Arrow
• Produce or consume Apache Arrow data efficiently using a C API
• Use the Apache Arrow Compute APIs to perform complex operations
• Create Arrow Flight servers and clients for transferring data quickly
• Build the Arrow libraries locally and contribute back to the community

Who this book is for

This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics and query engines, or otherwise working with tabular data, regardless of the programming language. Some familiarity with basic concepts of data analysis will help you to get the most out of this book but isn't required. Code examples are provided in the C++, Go, and Python programming languages.

商品描述(中文翻譯)

使用Apache Arrow,一種標準化的語言獨立記憶體格式,可以在現代CPU和GPU上處理表格數據並建立高效能的查詢引擎,以達到最佳效能。

主要特點:
- 了解Apache Arrow的數據類型以及與pandas和Parquet的互操作性
- 使用Apache Arrow Flight RPC、Compute和Dataset API來生成和消費表格數據
- 經Dremio審查、貢獻和支持,Dremio是Apache Arrow的共同創建者

書籍描述:
Apache Arrow旨在加速分析並使大數據系統之間的數據交換更加容易。

《使用Apache Arrow進行內存分析》首先快速概述了Apache Arrow格式,然後通過多種實際應用案例幫助您了解Arrow的多功能性和優勢。您將學習如何使用Arrow增強數據科學工作流程,使用Arrow和Apache Parquet與Apache Spark和Jupyter實現更好的性能和無縫數據轉換,以及使用Perspective進行工作,這是一個開源的用於瀏覽器的交互式圖形和表格分析工具。隨著進一步的學習,您將探索不同的數據交換和存儲格式,並熟悉Arrow、Parquet、Feather、Protobuf、Flatbuffers、JSON和CSV之間的關係。除了了解Arrow Flight和Flight SQL協議的基本結構外,您還將了解Dremio如何使用Apache Arrow來增強SQL分析,以及Arrow如何在基於Web的瀏覽器應用程序中使用。最後,您將掌握Arrow的即將推出的功能,以幫助您保持領先。

通過閱讀本書,您將擁有使用Apache Arrow創建有用、高效且強大的分析服務和工具所需的所有基礎知識。

學到的內容:
- 使用Apache Arrow庫在本地和雲端訪問數據文件
- 了解Apache Arrow格式的零拷貝元素
- 使用Apache Arrow對文件進行內存映射以提高讀取性能
- 使用C API高效地生成或消費Apache Arrow數據
- 使用Apache Arrow Compute API執行複雜操作
- 建立Arrow Flight服務器和客戶端以快速傳輸數據
- 在本地構建Arrow庫並回饋給社區

本書適合開發人員、數據分析師和數據科學家從頭開始探索Apache Arrow的能力。無論使用哪種編程語言,對於正在構建用於數據分析和查詢引擎的工具或以其他方式處理表格數據的工程師也很有用。對於本書,一些對於數據分析的基本概念的熟悉將有助於您充分利用本書的內容,但不是必需的。書中提供了C++、Go和Python編程語言的代碼示例。

目錄大綱

1. Getting Started with Apache Arrow
2. Working with Key Arrow Specifications
3. Data Science with Apache Arrow
4. Format and Memory Handling
5. Crossing the Language Barrier with the Arrow C Data API
6. Leveraging the Arrow Compute APIs
7. Using the Arrow Datasets API
8. Exploring Apache Arrow Flight RPC
9. Powered By Apache Arrow
10. How to Leave Your Mark on Arrow
11. Future Development and Plans

目錄大綱(中文翻譯)

1. 開始使用 Apache Arrow
2. 使用關鍵的 Arrow 規格
3. 使用 Apache Arrow 進行資料科學
4. 格式和記憶體處理
5. 使用 Arrow C Data API 跨越語言障礙
6. 利用 Arrow Compute APIs
7. 使用 Arrow Datasets API
8. 探索 Apache Arrow Flight RPC
9. 由 Apache Arrow 提供支援
10. 如何在 Arrow 上留下你的印記
11. 未來的發展和計劃