Learning Hadoop 2

Garry Turkington, Gabriele Modena

  • 出版商: Packt Publishing
  • 出版日期: 2015-01-31
  • 售價: $2,110
  • 貴賓價: 9.5$2,005
  • 語言: 英文
  • 頁數: 316
  • 裝訂: Paperback
  • ISBN: 1783285516
  • ISBN-13: 9781783285518
  • 相關分類: Hadoop
  • 海外代購書籍(需單獨結帳)
    無現貨庫存(No stock available)


Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2

About This Book

  • Construct state-of-the-art applications using higher-level interfaces and tools beyond the traditional MapReduce approach
  • Use the unique features of Hadoop 2 to model and analyze Twitter's global stream of user generated data
  • Develop a prototype on a local cluster and deploy to the cloud (Amazon Web Services)

Who This Book Is For

If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. You are expected to be familiar with the Unix/Linux command-line interface and have some experience with the Java programming language. Familiarity with Hadoop would be a plus.

What You Will Learn

  • Write distributed applications using the MapReduce framework
  • Go beyond MapReduce and process data in real time with Samza and iteratively with Spark
  • Familiarize yourself with data mining approaches that work with very large datasets
  • Prototype applications on a VM and deploy them to a local cluster or to a cloud infrastructure (Amazon Web Services)
  • Conduct batch and real time data analysis using SQL-like tools
  • Build data processing flows using Apache Pig and see how it enables the easy incorporation of custom functionality
  • Define and orchestrate complex workflows and pipelines with Apache Oozie
  • Manage your data lifecycle and changes over time

In Detail

This book introduces you to the world of building data-processing applications with the wide variety of tools supported by Hadoop 2. Starting with the core components of the framework―HDFS and YARN―this book will guide you through how to build applications using a variety of approaches.

You will learn how YARN completely changes the relationship between MapReduce and Hadoop and allows the latter to support more varied processing approaches and a broader array of applications. These include real-time processing with Apache Samza and iterative computation with Apache Spark. Next up, we discuss Apache Pig and the dataflow data model it provides. You will discover how to use Pig to analyze a Twitter dataset.

With this book, you will be able to make your life easier by using tools such as Apache Hive, Apache Oozie, Hadoop Streaming, Apache Crunch, and Kite SDK. The last part of this book discusses the likely future direction of major Hadoop components and how to get involved with the Hadoop community.


設計並實現使用 Hadoop 2 的先進工具箱進行數據處理、生命周期管理和分析工作流程。

- 使用高級接口和傳統 MapReduce 方法以外的工具構建最先進的應用程序。
- 利用 Hadoop 2 的獨特功能對 Twitter 的全球用戶生成數據進行建模和分析。
- 在本地集群上開發原型並部署到雲端(Amazon Web Services)。

本書適合對使用 Hadoop 框架解決實際問題感興趣的系統或應用程序開發人員。您應該熟悉 Unix/Linux 命令行界面,並具有一些 Java 編程語言的經驗。熟悉 Hadoop 將是一個加分項目。

- 使用 MapReduce 框架編寫分佈式應用程序。
- 超越 MapReduce,使用 Samza 實時處理數據,使用 Spark 迭代處理數據。
- 熟悉處理大型數據集的數據挖掘方法。
- 在虛擬機上創建原型應用程序,並將其部署到本地集群或雲基礎設施(Amazon Web Services)。
- 使用類似 SQL 的工具進行批處理和實時數據分析。
- 使用 Apache Pig 構建數據處理流程,並了解它如何輕鬆集成自定義功能。
- 使用 Apache Oozie 定義和協調複雜的工作流程和管道。
- 管理數據的生命周期和變化。

本書將介紹使用 Hadoop 2 支持的各種工具構建數據處理應用程序的世界。從框架的核心組件 HDFS 和 YARN 開始,本書將指導您如何使用各種方法構建應用程序。

您將了解到 YARN 如何完全改變 MapReduce 和 Hadoop 之間的關係,使 Hadoop 能夠支持更多種處理方法和更廣泛的應用程序。這些包括使用 Apache Samza 進行實時處理和使用 Apache Spark 進行迭代計算。接下來,我們將討論 Apache Pig 和它提供的數據流數據模型。您將發現如何使用 Pig 分析 Twitter 數據集。

通過本書,您將能夠使用 Apache Hive、Apache Oozie、Hadoop Streaming、Apache Crunch 和 Kite SDK 等工具使生活更輕鬆。本書的最後一部分討論了主要 Hadoop 組件的未來發展方向以及如何參與 Hadoop 社區。