Big Data Analytics with Spark and Hadoop

Venkat Ankam



Key Features

  • This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools.
  • Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR.
  • Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall.

Book Description

Big Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters.

It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark.

Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.

What you will learn

  • Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop
  • Understand all the Hadoop and Spark ecosystem components
  • Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx
  • See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming
  • Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall.

About the Author

Venkat Ankam has over 18 years of IT experience and over 5 years in big data technologies, working with customers to design and develop scalable big data applications. Having worked with multiple clients globally, he has tremendous experience in big data analytics using Hadoop and Spark.

He is a Cloudera Certified Hadoop Developer and Administrator and also a Databricks Certified Spark Developer. He is the founder and presenter of a few Hadoop and Spark meetup groups globally and loves to share knowledge with the community.

Venkat has delivered hundreds of trainings, presentations, and white papers in the big data sphere. While this is his first attempt at writing a book, many more books are in the pipeline.

Table of Contents

  1. Big Data Analytics at 10,000 foot view
  2. Getting Started with Apache Hadoop and Apache Spark
  3. Deep Dive into Apache Spark
  4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
  5. Real-Time Analytics with Spark Streaming and Structured Streaming
  6. Notebooks and Dataflows with Spark and Hadoop
  7. Machine Learning with Spark and Hadoop
  8. Building Recommendation Systems with Spark and Mahout
  9. Graph Analytics with GraphX
  10. Interactive Analytics with SparkR



  • 本書基於最新的Apache Spark 2.0版本和Hadoop 2.7版本,並整合了最常用的工具。

  • 學習所有Spark堆疊組件,包括最新的主題,如DataFrames、DataSets、GraphFrames、Structured Streaming、基於DataFrame的ML Pipelines和SparkR。

  • 與HDFS、YARN等框架以及Jupyter、Zeppelin、NiFi、Mahout、HBase Spark Connector、GraphFrames、H2O和Hivemall等工具的整合。


《大數據分析》一書旨在提供Apache Spark和Hadoop的基礎知識。深入探討了所有Spark組件,包括Spark Core、Spark SQL、DataFrames、DataSets、傳統流式處理、結構化流式處理、MLlib、Graphx以及Hadoop核心組件HDFS、MapReduce和Yarn,並提供了在Spark + Hadoop集群上實施的示例。

本書將從MapReduce轉向Spark。因此,詳細解釋了Spark相對於MapReduce的優勢,以實現內存速度的好處。解釋了DataFrames API、Data Sources API和新的Data set API,用於構建大數據分析應用程序。介紹了使用Spark Streaming和Apache Kafka、HBase進行實時數據分析,以幫助構建流式應用程序。使用IOT(物聯網)用例解釋了新的結構化流式處理概念。使用MLLib、ML Pipelines和SparkR進行機器學習技術,使用GraphX和Spark的GraphFrames組件進行圖形分析。

讀者還將有機會開始使用基於Web的筆記本,如Jupyter、Apache Zeppelin和數據流工具Apache NiFi進行數據分析和可視化。


  • 了解使用Spark在Hadoop集群上進行大數據分析的工具和技術,以及與Spark和Hadoop一起使用的各種工具

  • 瞭解所有Hadoop和Spark生態系統組件

  • 熟悉所有Spark組件:Spark Core、Spark SQL、DataFrames、DataSets、傳統和結構化流式處理、MLlib、ML Pipelines和Graphx

  • 使用Spark Core、Spark SQL和傳統和結構化流式處理進行批處理和實時數據分析

  • 掌握使用MLLib、ML Pipelines、H2O、Hivemall、Graphx、SparkR和Hivemall進行數據科學和機器學習


Venkat Ankam擁有超過18年的IT經驗,其中5年專注於大數據技術,與客戶合作設計和開發可擴展的大數據應用程序。他曾與全球多個客戶合作,對使用Hadoop和Spark進行大數據分析擁有豐富的經驗。

他是Cloudera認證的Hadoop開發人員和管理員,也是Databricks認證的Spark開發人員。他是全球幾個Hadoop和Spark meetup團體的創始人和演講者,熱衷於與社區分享知識。



  1. 從一萬英尺高度看大數據分析

  2. 開始使用Apache Hadoop和Apache Spark

  3. 深入探索Apache Spark

  4. 使用Spark SQL、DataFrames和DataSets進行大數據分析

  5. 使用Spark Streaming和結構化流式處理進行實時分析

  6. 使用Spark和Hadoop的筆記本和數據流

  7. 使用Spark和Hadoop進行機器學習

  8. 使用Spark和Mahout構建推薦系統

  9. 使用GraphX進行圖形分析

  10. 使用SparkR進行交互式分析