Data Analytics with Spark Using Python (Addison-Wesley Data & Analytics Series)

Jeffrey Aven

買這商品的人也買了...

商品描述

Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools

Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.

Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.

Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.

Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib
 

商品描述(中文翻譯)

使用Spark、PySpark和相關的開源工具解決數據分析問題

Spark是當今大數據革命的核心,幫助數據專業人員在各種數據處理和分析任務中提高效率和性能。在這本指南中,大數據專家Jeffrey Aven涵蓋了您需要了解的所有內容,以利用Spark及其擴展、子項目和更廣泛的生態系統。

Aven結合了對基礎Spark概念的語言不限的介紹,並使用流行且直觀的PySpark開發環境進行了廣泛的編程示例。這本指南專注於Python,使其廣泛適用於大量的數據專業人員、分析師和開發人員,即使他們對Hadoop或Spark的經驗很少。

Aven的廣泛涵蓋範圍從基礎到高級的Spark編程,以及Spark SQL到機器學習。您將學習如何使用Spark高效地管理各種形式的數據:流式、結構化、半結構化和非結構化。在整個過程中,簡潔的主題概述能夠快速讓您上手,而廣泛的實踐練習則準備您解決真實問題。

內容包括:
- 了解Spark在大數據和Hadoop生態系統中的不斷演變的角色
- 使用各種部署模式創建Spark集群
- 控制和優化Spark集群和應用程序的操作
- 掌握Spark Core RDD API編程技巧
- 使用高級API平台結構(包括共享變量、RDD存儲和分區)擴展、加速和優化Spark例程
- 高效地將Spark與SQL和非關聯數據存儲集成
- 使用Spark Streaming和Apache Kafka進行流式處理和消息傳遞
- 使用SparkR和Spark MLlib實現預測建模