Big Data Science & Analytics: A Hands-On Approach (快遞進口)

Arshdeep Bahga, Vijay Madisetti

  • 出版商: VPT
  • 出版日期: 2016-04-15
  • 售價: $2,859
  • 貴賓價: 9.5$2,716
  • 語言: 英文
  • 頁數: 544
  • 裝訂: Hardcover
  • ISBN: 0996025545
  • ISBN-13: 9780996025546
  • 相關分類: 大數據 Big-dataData Science
  • 立即出貨 (庫存=1)

商品描述

Data and information are fuel of this new age where powerful analytics algorithms burn this fuel to generate decisions that are expected to create a smarter and more efficient world for all of us to live in. This new area of technology has been defined as Big Data Science and Analytics, and the industrial and academic communities are realizing this as a competitive technology that can generate significant new wealth and opportunity.

Big data is defined as collections of datasets whose volume, velocity or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools. Big data science and analytics deals with collection, storage, processing and analysis of massive-scale data. Industry surveys, by Gartner and e-Skills, for instance, predict that there will be over 2 million job openings for engineers and scientists trained in the area of data science and analytics alone, and that the job market is in this area is growing at a 150 percent year-over-year growth rate.

We have written this textbook, as part of our expanding "A Hands-On Approach"(TM) series, to meet this need at colleges and universities, and also for big data service providers who may be interested in offering a broader perspective of this emerging field to accompany their customer and developer training programs. The typical reader is expected to have completed a couple of courses in programming using traditional high-level languages at the college-level, and is either a senior or a beginning graduate student in one of the science, technology, engineering or mathematics (STEM) fields. An accompanying website for this book contains additional support for instruction and learning (www.big-data-analytics-book.com)

The book is organized into three main parts, comprising a total of twelve chapters. Part I provides an introduction to big data, applications of big data, and big data science and analytics patterns and architectures. A novel data science and analytics application system design methodology is proposed and its realization through use of open-source big data frameworks is described. This methodology describes big data analytics applications as realization of the proposed Alpha, Beta, Gamma and Delta models, that comprise tools and frameworks for collecting and ingesting data from various sources into the big data analytics infrastructure, incorporating distributed filesystems and non-relational (NoSQL) databases for data storage, and processing frameworks for batch and real-time analytics. This new methodology forms the pedagogical foundation of this book.

Part II introduces the reader to various tools and frameworks for big data analytics, and the architectural and programming aspects of these frameworks, with examples in Python. We describe Publish-Subscribe messaging frameworks (Kafka & Kinesis), Source-Sink connectors (Flume), Database Connectors (Sqoop), Messaging Queues (RabbitMQ, ZeroMQ, RestMQ, Amazon SQS) and custom REST, WebSocket and MQTT-based connectors. The reader is introduced to data storage, batch and real-time analysis, and interactive querying frameworks including HDFS, Hadoop, MapReduce, YARN, Pig, Oozie, Spark, Solr, HBase, Storm, Spark Streaming, Spark SQL, Hive, Amazon Redshift and Google BigQuery. Also described are serving databases (MySQL, Amazon DynamoDB, Cassandra, MongoDB) and the Django Python web framework.

Part III introduces the reader to various machine learning algorithms with examples using the Spark MLlib and H2O frameworks, and visualizations using frameworks such as Lightning, Pygal and Seaborn.

商品描述(中文翻譯)

數據和資訊是這個新時代的燃料,強大的分析演算法將這些燃料燃燒,生成預計能為我們所有人創造更智能和高效的世界的決策。這個新興技術領域被定義為大數據科學和分析,工業和學術界都意識到這是一個能夠產生重大新財富和機會的競爭性技術。

大數據被定義為數據集的集合,其容量、速度或多樣性非常大,以至於使用傳統的數據庫和數據處理工具來存儲、管理、處理和分析數據變得困難。大數據科學和分析涉及到大規模數據的收集、存儲、處理和分析。例如,Gartner和e-Skills的行業調查預測,僅在數據科學和分析領域,將有超過200萬個工程師和科學家的職位空缺,而該領域的就業市場以每年150%的增長率增長。

我們撰寫這本教科書,作為我們擴大的“A Hands-On Approach”(TM)系列的一部分,以滿足大學和學院的需求,同時也為可能有興趣為客戶和開發者培訓計劃提供更廣泛視角的大數據服務提供商提供。預計典型的讀者已經在大學程度完成了幾門使用傳統高級語言的編程課程,並且是科學、技術、工程或數學(STEM)領域的高年級或初級研究生。本書的附帶網站提供了額外的教學和學習支持(www.big-data-analytics-book.com)。

本書分為三個主要部分,共十二章。第一部分介紹了大數據、大數據應用和大數據科學與分析的模式和架構。提出了一種新的數據科學和分析應用系統設計方法論,並描述了如何使用開源大數據框架實現該方法論。該方法論將大數據分析應用描述為提出的Alpha、Beta、Gamma和Delta模型的實現,這些模型包括從各種來源收集和輸入數據到大數據分析基礎設施的工具和框架,包括用於數據存儲的分佈式文件系統和非關聯(NoSQL)數據庫,以及用於批處理和實時分析的處理框架。這種新的方法論形成了本書的教學基礎。

第二部分介紹了各種用於大數據分析的工具和框架,以及這些框架的架構和編程方面,並以Python為例。我們描述了發布-訂閱消息框架(Kafka和Kinesis)、源-接收器連接器(Flume)、數據庫連接器(Sqoop)、消息隊列(RabbitMQ、ZeroMQ、RestMQ、Amazon SQS)和自定義REST、WebSocket和MQTT-based連接器。讀者將介紹數據存儲、批處理和實時分析,以及包括HDFS、Hadoop、MapReduce、YARN、Pig、Oozie、Spark、Solr、HBase、Storm、Spark Streaming、Spark SQL、Hive、Amazon Redshift和Google BigQuery在內的交互式查詢框架。還描述了服務數據庫(MySQL、Amazon DynamoDB、Cassandra、MongoDB)和Django Python Web框架。

第三部分介紹了各種機器學習算法,並使用Spark MLlib和H2O框架進行示例,並使用Lightning、Pygal和Seaborn等框架進行可視化。