Apache Spark 2.x Machine Learning Cookbook

Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei


Simplify machine learning model implementations with Spark

About This Book

  • Solve the day-to-day problems of data science with Spark
  • This unique cookbook consists of exciting and intuitive numerical recipes
  • Optimize your work by acquiring, cleaning, analyzing, predicting, and visualizing your data

Who This Book Is For

This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and ecosystem.

What You Will Learn

  • Get to know how Scala and Spark go hand-in-hand for developers when developing ML systems with Spark
  • Build a recommendation engine that scales with Spark
  • Find out how to build unsupervised clustering systems to classify data in Spark
  • Build machine learning systems with the Decision Tree and Ensemble models in Spark
  • Deal with the curse of high-dimensionality in big data using Spark
  • Implement Text analytics for Search Engines in Spark
  • Streaming Machine Learning System implementation using Spark

In Detail

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks.

This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

Style and approach

This book is packed with intuitive recipes supported with line-by-line explanations to help you understand how to optimize your work flow and resolve problems when working with complex data modeling tasks and predictive algorithms. This is a valuable resource for data scientists and those working on large scale data projects.


簡化使用 Spark 實現機器學習模型

- 以 Spark 解決日常數據科學問題的獨特食譜
- 通過獲取、清理、分析、預測和可視化數據來優化工作

本書適合對機器學習技術有相當了解和理解,但缺乏 Spark 實際應用經驗的 Scala 開發人員。假設您具備扎實的機器學習算法知識,並具有使用 Scala 實現機器學習算法的實踐經驗。然而,您不需要熟悉 Spark ML 库和生態系統。

- 了解 Scala 和 Spark 在開發 Spark 機器學習系統時的配合
- 使用 Spark 構建可擴展的推薦引擎
- 在 Spark 中構建無監督分類系統以對數據進行分類
- 在 Spark 中使用決策樹和集成模型構建機器學習系統
- 使用 Spark 解決大數據中高維度問題
- 在 Spark 中實現搜索引擎的文本分析
- 使用 Spark 實現流式機器學習系統

機器學習旨在從數據中提取知識,依賴於計算機科學、統計學、概率論和優化等基本概念。學習算法可以應用於各種應用,從日常任務(如產品推薦和垃圾郵件過濾)到尖端應用(如自動駕駛汽車和個性化醫學)。通過使用 Apache Spark,一個適合大規模機器學習任務的可靠集群計算系統,您將獲得實踐經驗。

本書首先快速概述了設置必要的集成開發環境,以便在各個章節中執行代碼示例。它還強調了開發人員在 Spark 平台上使用機器學習算法時遇到的一些關鍵問題。我們通過揭示各種 Spark API 和實現機器學習算法來發展分類系統、推薦引擎、文本分析、聚類和學習系統。在最後幾章中,我們將專注於構建高端應用程序,並解釋實現大數據機器學習系統時的各種無監督方法和挑戰。