Mastering Spark for Data Science
Andrew Morgan, Antoine Amend, David George, Matthew Hallett
Unlock the complexities of lightning fast data science
About This Book
- Develop and apply advanced analytical techniques with Spark
- Learn how to tell a compelling story in data science using Spark's ecosystem
- Explore data at a scale and work with cutting edge data science methods
Who This Book Is For
This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.
What You Will Learn
- Learn the design patterns that integrate Spark into with industrialized data science pipelines
- Understand how commercial data scientists design scalable code and reusable code for data science services
- Get a grasp of the new cutting edge data science methods so you can study trends and causality
- Find out how to use Spark as a universal ingestion engine tool and as a web scraper
- Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
- Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
- Grasp advanced Spark concepts, as well as solution design patterns and integration architectures
- Demonstrate powerful data science pipelines
- Get detailed guidance on how to run Spark in production
The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at this level, you need to be able to build data science solutions of substance; ones that solve real problems, and that can run reliably enough for people to trust and act on. Spark has emerged as the big data platform of choice for data scientists.
This book deep dives into Spark to deliver production-grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. We demonstrate the process through exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights. We use the core Spark APIs and take a deep-dive into advanced libraries including: Spark SQL, visual streaming, MLlib, and more.
We introduce advanced techniques and methods to help you build data science solutions, and show you how to construct commercial grade data products. Using a sequence of tutorials that deliver a working news intelligence service, we explain advanced Spark architectures, unveil sophisticated data science methods, demonstrate how to work with geographic data in Spark, and explain how to tune Spark algorithms so they scale linearly.