Learning Spark: Lightning-Fast Big Data Analysis (Paperback)

Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

買這商品的人也買了...

商品描述

The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. You’ll learn how to run programs faster, using primitives for in-memory cluster computing. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Written by the developers of Spark, this book will have you up and running in no time. You’ll learn how to express MapReduce jobs with just a few simple lines of Spark code, instead of spending extra time and effort working with Hadoop’s raw Java API.

  • Quickly dive into Spark capabilities such as collect, count, reduce, and save
  • Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
  • Learn how to run interactive, iterative, and incremental analyses
  • Integrate with Scala to manipulate distributed datasets like local collections
  • Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
  • Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming

商品描述(中文翻譯)

網路速度越來越快,所傳送的資料也越來越大。你該如何有效處理這一切呢?這本書介紹了一個名為Spark的開源叢集計算系統,它能夠使資料分析運行速度更快,編寫速度更快。你將學習如何使用內存叢集計算的基本功能來更快地運行程式。使用Spark,你的工作可以將資料加載到內存中並重複查詢,比使用基於磁碟的系統如Hadoop MapReduce更快。

這本書由Spark的開發人員撰寫,將使你能夠迅速上手。你將學習如何使用僅需幾行簡單的Spark程式碼來表達MapReduce任務,而不需要花費額外的時間和精力使用Hadoop的原始Java API。

- 快速深入了解Spark的功能,如collect、count、reduce和save
- 使用一種編程範式,而不是混合使用Hive、Hadoop、Mahout和S4/Storm等工具
- 學習如何運行互動式、迭代式和增量分析
- 集成Scala以操作分佈式資料集,如本地集合
- 解決分區問題、資料本地性、默認哈希分區、用戶定義分區器和自定義序列化
- 使用pipe()方法使用其他語言,實現與Hadoop streaming相等的功能