Learning Spark: Lightning-Fast Data Analytics 2nd Edition

Damji, Jules S., Wenig, Brooke, Das, Tathagata

  • 出版商: O'Reilly
  • 出版日期: 2020-08-25
  • 定價: $2,700
  • 售價: 8.5$2,295
  • 語言: 英文
  • 頁數: 300
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1492050040
  • ISBN-13: 9781492050049
  • 相關分類: SparkData Science
  • 相關翻譯: Spark快速大數據分析 第2版 (簡中版)
  • 立即出貨 (庫存 < 3)

買這商品的人也買了...

商品描述

Data is bigger, arrives faster, and comes in a variety of formats--and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you'll be able to:

  • Learn Python, SQL, Scala, or Java high-level Structured APIs
  • Understand Spark operations and SQL Engine
  • Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow

商品描述(中文翻譯)

數據變得更大、到達速度更快,並以各種格式呈現 - 所有這些都需要以大規模進行處理,以進行分析或機器學習。但是,如何高效處理如此多樣的工作負載呢?這就是 Apache Spark 的用途。

第二版已更新至 Spark 3.0,本書向數據工程師和數據科學家展示了為什麼在 Spark 中結構和統一性很重要。具體而言,本書解釋了如何執行簡單和複雜的數據分析,並應用機器學習算法。通過逐步引導、代碼片段和筆記本,您將能夠:

- 學習 Python、SQL、Scala 或 Java 的高級結構化 API
- 了解 Spark 操作和 SQL 引擎
- 使用 Spark 配置和 Spark UI 檢查、調整和調試 Spark 操作
- 連接到數據源:JSON、Parquet、CSV、Avro、ORC、Hive、S3 或 Kafka
- 使用結構化流進行批處理和流式數據分析
- 使用開源 Delta Lake 和 Spark 構建可靠的數據管道
- 使用 MLlib 開發機器學習管道,並使用 MLflow 將模型投入生產環境

作者簡介

Jules S. Damji is a senior developer advocate at Databricks and an MLflow contributor. He is a hands-on developer with over 20 years of experience and has worked as a software engineer at leading companies such as Sun Microsystems, Netscape, @Home, Loudcloud/Opsware, Verisign, ProQuest, and Hortonworks, building large scale distributed systems. He holds a B.Sc. and an M.Sc. in computer science and an MA in political advocacy and communication from Oregon State University, Cal State, and Johns Hopkins University, respectively.

Brooke Wenig is a machine learning practice lead at Databricks. She leads a team of data scientists who develop large-scale machine learning pipelines for customers, as well as teaching courses on distributed machine learning best practices. Previously, she was a principal data science consultant at Databricks. She holds an M.S. in computer science from UCLA with a focus on distributed machine learning.

Tathagata Das is a staff software engineer at Databricks, an Apache Spark committer, and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake. Tathagata holds an M.S. in computer science from UC Berkeley.

Denny Lee is a staff developer advocate at Databricks who has been working with Apache Spark since 0.6. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premises and cloud environments. He also has an M.S. in biomedical informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers.

作者簡介(中文翻譯)

Jules S. Damji 是 Databricks 的高級開發者倡導者,也是 MLflow 的貢獻者。他是一位有超過 20 年經驗的實踐開發者,曾在 Sun Microsystems、Netscape、@Home、Loudcloud/Opsware、Verisign、ProQuest 和 Hortonworks 等領先公司擔任軟體工程師,建立大規模分散式系統。他擁有 Oregon State University、Cal State 和 Johns Hopkins University 分別頒發的計算機科學學士、碩士學位,以及政治倡議和傳播碩士學位。

Brooke Wenig 是 Databricks 的機器學習實踐負責人。她帶領一個團隊的數據科學家為客戶開發大規模機器學習流程,同時教授分散式機器學習最佳實踐課程。之前,她是 Databricks 的首席數據科學顧問。她擁有 UCLA 頒發的計算機科學碩士學位,專攻分散式機器學習。

Tathagata Das 是 Databricks 的高級軟體工程師,Apache Spark 的貢獻者,也是 Apache Spark 專案管理委員會 (PMC) 的成員之一。他是 Apache Spark 的原始開發人員之一,Spark Streaming (DStreams) 的首席開發人員,目前也是 Structured Streaming 和 Delta Lake 的核心開發人員之一。Tathagata 擁有 UC Berkeley 頒發的計算機科學碩士學位。

Denny Lee 是 Databricks 的高級開發者倡導者,從 Apache Spark 0.6 開始就一直與其合作。他是一位實踐分散式系統和數據科學工程師,擁有在企業內部和雲環境中開發互聯網規模基礎設施、數據平台和預測分析系統的豐富經驗。他還擁有 Oregon Health and Sciences University 頒發的生物醫學信息學碩士學位,並為企業醫療保健客戶架構和實施了強大的數據解決方案。