Data Analytics with Hadoop: An Introduction for Data Scientists

Benjamin Bengfort, Jenny Kim

  • 出版商: O'Reilly
  • 出版日期: 2016-07-12
  • 定價: $1,220
  • 售價: 5.0$610
  • 語言: 英文
  • 頁數: 288
  • 裝訂: Paperback
  • ISBN: 1491913703
  • ISBN-13: 9781491913703
  • 相關分類: HadoopData Science
  • 立即出貨 (庫存=1)

買這商品的人也買了...

商品描述

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.

Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.

  • Understand core concepts behind Hadoop and cluster computing
  • Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
  • Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
  • Use Sqoop and Apache Flume to ingest data from relational databases
  • Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
  • Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib

商品描述(中文翻譯)

準備好在大型數據集上使用統計和機器學習技術了嗎?這本實用指南將向您展示為什麼Hadoop生態系統非常適合這項工作。與分散式計算通常涉及的部署、操作或軟件開發不同,您將專注於可以構建的特定分析、Hadoop提供的數據倉儲技術以及該框架可以生成的高階數據工作流程。

數據科學家和分析師將學習如何執行各種技術,從使用Python編寫MapReduce和Spark應用程序,到使用Spark MLlib、Hive和HBase進行高級建模和數據管理。您還將了解可用於構建和賦能能夠處理大量數據(實際上需要大量數據)的數據產品的分析流程和數據系統。

本書內容包括:
- 理解Hadoop和集群計算的核心概念
- 使用設計模式和並行分析算法創建分散式數據分析任務
- 在分散式環境中使用Apache Hive和HBase進行數據管理、挖掘和倉儲
- 使用Sqoop和Apache Flume從關聯數據庫中提取數據
- 使用Apache Pig和Spark DataFrames編寫複雜的Hadoop和Spark應用程序
- 使用Spark的MLlib執行分類、聚類和協同過濾等機器學習技術