Mastering Spark for Data Science

Andrew Morgan, Antoine Amend, David George, Matthew Hallett

商品描述

Unlock the complexities of lightning fast data science

About This Book

  • Develop and apply advanced analytical techniques with Spark
  • Learn how to tell a compelling story in data science using Spark's ecosystem
  • Explore data at a scale and work with cutting edge data science methods

Who This Book Is For

This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.

What You Will Learn

  • Learn the design patterns that integrate Spark into with industrialized data science pipelines
  • Understand how commercial data scientists design scalable code and reusable code for data science services
  • Get a grasp of the new cutting edge data science methods so you can study trends and causality
  • Find out how to use Spark as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Grasp advanced Spark concepts, as well as solution design patterns and integration architectures
  • Demonstrate powerful data science pipelines
  • Get detailed guidance on how to run Spark in production

In Detail

The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at this level, you need to be able to build data science solutions of substance; ones that solve real problems, and that can run reliably enough for people to trust and act on. Spark has emerged as the big data platform of choice for data scientists.

This book deep dives into Spark to deliver production-grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. We demonstrate the process through exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights. We use the core Spark APIs and take a deep-dive into advanced libraries including: Spark SQL, visual streaming, MLlib, and more.

We introduce advanced techniques and methods to help you build data science solutions, and show you how to construct commercial grade data products. Using a sequence of tutorials that deliver a working news intelligence service, we explain advanced Spark architectures, unveil sophisticated data science methods, demonstrate how to work with geographic data in Spark, and explain how to tune Spark algorithms so they scale linearly.

商品描述(中文翻譯)

解鎖快速數據科學的複雜性

關於本書

- 使用Spark開發和應用高級分析技術
- 學習如何在Spark的生態系統中講述引人入勝的數據科學故事
- 探索大規模數據並使用尖端數據科學方法

本書適合對Spark架構和數據科學應用有初級熟悉的讀者,他們正在尋找挑戰並希望學習尖端技術。本書假設讀者具有數據科學、常見機器學習方法和流行數據科學工具的工作知識,並假設您之前已經運行過概念驗證研究並建立了原型。

你將學到什麼

- 學習將Spark集成到工業化數據科學流程中的設計模式
- 了解商業數據科學家如何設計可擴展和可重用的代碼以用於數據科學服務
- 掌握新的尖端數據科學方法,以便研究趨勢和因果關係
- 瞭解如何將Spark用作通用的數據輸入引擎工具和網絡爬蟲
- 實踐圖形處理中的高級主題,如社區檢測和聯繫鏈
- 瞭解在商業數據科學團隊中常用的擴展探索性數據分析的最佳實踐
- 掌握高級Spark概念,以及解決方案設計模式和集成架構
- 展示強大的數據科學流程
- 獲得有關如何在生產環境中運行Spark的詳細指導

詳細內容

數據科學的目的是通過數據改變世界,而這個目標主要通過干擾和改變真實行業中的真實流程來實現。要在這個層次上運作,您需要能夠構建具有實質價值的數據科學解決方案,這些解決方案解決真實問題,並且可靠到足以讓人們信任並採取行動。Spark已經成為數據科學家選擇的大數據平台。

本書深入探討Spark,提供創新、具有破壞性且可靠到足以受人信任的生產級數據科學解決方案。我們通過探索構建一個複雜的全球新聞分析服務的過程來演示這一過程,該服務使用Spark生成持續的地緣政治和時事見解。我們使用核心的Spark API,並深入研究包括Spark SQL、可視化流、MLlib等高級庫。

我們介紹高級技術和方法,幫助您構建數據科學解決方案,並向您展示如何構建商業級數據產品。通過一系列提供工作中的新聞智能服務的教程,我們解釋了高級Spark架構,揭示了複雜的數據科學方法,演示了如何在Spark中處理地理數據,並解釋了如何調整Spark算法以實現線性擴展。