構建數據湖倉

(美)比爾·恩門(Bill Inmon),(美)瑪麗·萊文斯(Mary Levins),(美)蘭吉特·斯里瓦斯塔瓦(Ranjeet Srivastava)著;上海市靜安區

  • 出版商: 清華大學
  • 出版日期: 2023-03-01
  • 定價: $408
  • 售價: 8.5$347
  • 語言: 簡體中文
  • ISBN: 730262447X
  • ISBN-13: 9787302624479
  • 下單後立即進貨 (約2週~3週)

  • 構建數據湖倉-preview-1
  • 構建數據湖倉-preview-2
  • 構建數據湖倉-preview-3
構建數據湖倉-preview-1

買這商品的人也買了...

商品描述

在數據湖倉的所有新增要素中,排名第一的就是可以利於數據分析和機器學習所用的分析基礎設施。分析基礎設施包括一眾大家廣為熟悉的東西,當然也包括一些可能對大家還有些陌生或略帶新鮮感的概念。比如包括:元數據、數據血緣、 數據體量的度量 、數據創建的歷史記錄、數據轉換描述。 數據湖倉的第二個新增要素,是識別和使用通用連接器。通用連接器允許合並和比較所有不同來源的數據。如果沒有通用連接器,就很難(實際上是幾乎不可能)將數據湖倉中的不同數據關聯起來。但有了這個中西,就可以關聯任何類型的數據。 使用數據湖倉,就有可能實現以往任何其它方式都不可行或不可能實現的某種程度的數據分析和機器學習。 但與其它架構一樣,我們需要理解數據湖倉的架構以及它的能力,以便於我們基於這種架構創建數據分析藍圖和開展數據分析規劃。

目錄大綱

目 錄 

引 言  

第一章 向數據湖倉演進  

1. 技術的演進 ······································································3 

2. 組織內的全部數據 ······························································8 

3. 商業價值在哪裡? ··························································· 12 

4. 數據湖 ··········································································· 13 

5. 當前數據架構的挑戰 ························································· 14 

6. 數據湖倉的出現 ······························································· 15 

第二章 數據科學家和終端用戶 

1. 數據湖 ·········································································· 20 

2. 分析基礎設施 ································································· 21 

3. 不同的受眾 ····································································· 21 

4. 分析工具不同 ·································································· 22 

5. 分析目的不同 ·································································· 23 

6. 分析方法不同 ·································································· 24 

7. 數據類型不同 ·································································· 24 

第三章 數據湖倉中的不同類型數據  

1. 數據的類型 ····································································· 28 

2. 不同數據的容量 ······························································· 31 

3. 跨越不同類型數據的關聯數據 ············································· 32 

4. 基於訪問概率對數據進行分片 ············································· 33 

5. 模擬和物聯網環境中的關聯數據 ·········································· 33 

6. 分析基礎設施 ································································· 35

第四章 開放的湖倉環境

 

1. 開放系統的演進 ······························································· 38 

2. 與時俱進的創新 ······························································ 39 

3. 建立在開放、標準文件格式之上的非結構化湖倉 ······················ 39 

4. 開源數據湖倉軟件 ···························································· 40 

5. 數據湖倉提供超越 SQL 的開放 API······································· 41 

6. 數據湖倉支持開放數據共享 ················································ 42 

7. 數據湖倉支持開放數據探索 ················································ 43 

8. 數據湖倉通過開放數據目錄簡化數據發現 ······························ 44 

9. 利用雲原生架構的數據湖倉 ················································ 45 

10. 向開放的數據湖倉演進 ···················································· 46 

第五章 機器學習和數據湖倉  

1. 機器學習 ········································································ 47 

2. 機器學習需要湖倉提供什麽? ············································· 48 

3. 從數據中挖掘出新價值 ····················································· 48 

4. 解決這個難題 ·································································· 48 

5. 非結構化數據問題 ··························································· 49 

6. 開源的重要性 ·································································· 51 

7. 發揮雲的彈性優勢 ··························································· 51 

8. 為數據平臺設計“MLOps”··················································52 

9. 案例:運用機器學習對胸透 X 光片進行分類 ··························· 53 

10. 數據湖倉的非結構化組件的演進 ········································· 55 

第六章 數據湖倉中的分析基礎設施  

1. 元數據 ··········································································· 58 

2. 數據模型 ······································································· 59 

3. 數據質量 ······································································· 60 

4. ETL ·············································································· 61 

5. 文本 ETL········································································ 62 

6. 分類標準 ········································································ 62 

7. 數據體量 ······································································· 63 

8. 數據血緣 ········································································ 64 

9. KPI ··············································································· 65

10. 數據的粒度 ··································································· 66 

11. 事務 ············································································ 66 

12. 鍵 ··············································································· 66 

13. 處理計劃 ······································································ 67 

14. 匯總數據 ····································································· 67 

15. 最低要求 ······································································ 68 

第七章 數據湖倉中的數據融合 

1. 湖倉和數據湖倉 ······························································ 69 

2. 數據的源頭 ···································································· 70 

3. 不同類型的分析 ······························································ 70 

4. 通用標識符 ····································································· 72 

5. 結構化標識符 ································································· 72 

6. 重復數據 ······································································· 73 

7. 文本環境中的標識符 ························································ 74 

8. 文本數據和結構化數據的融合 ············································· 76 

9. 匹配的重要性 ································································· 81 

第八章 跨數據湖倉架構的分析類型 

1. 已知查詢 ········································································ 83 

2. 啟發式分析 ····································································· 85 

第九章 數據湖倉倉務管理

1. 數據集成和互操作 ···························································· 92 

2. 數據湖倉的主數據及參考數據 ············································· 94 

3. 數據湖倉的隱私、保密和數據保護 ········································ 96 

4. 數據湖倉中面向未來的數據 ················································ 97 

5. 面向未來的數據的五個階段 ··············································· 101 

6. 數據湖倉的例行維護 ························································ 108 

第十章 可視化 

1. 將數據轉化為信息 ··························································· 110 

2. 什麽是數據可視化?為什麽它很重要? ································· 112

3. 數據可視化、數據分析和數據解釋之間的差異 ························ 113 

4. 數據可視化的優勢 ··························································· 115 

第十一章 數據湖倉架構中的數據血緣  

1. 計算鏈 ·········································································· 124 

2. 數據選取 ······································································· 126 

3. 算法差異 ······································································· 126 

4. 文本數據血緣 ································································· 127 

5. 其他非結構化環境的數據血緣 ············································ 128 

6. 數據血緣 ······································································· 129 

第十二章 數據湖倉架構中的訪問概率 

1. 數據的高效排列 ······························································ 131 

2. 數據的訪問概率 ······························································ 131 

3. 數據湖倉中不同的數據類型 ··············································· 133 

4. 數據量的相對差異 ··························································· 133 

5. 數據分片的優勢 ······························································ 134 

6. 使用大容量存儲 ······························································ 134 

7. 附加索引 ······································································· 135 

第十三章 跨越鴻溝  

1. 合並數據 ······································································· 136 

2. 不同種類的數據 ······························································ 137 

3. 不同的業務需求 ······························································ 137 

4. 跨越鴻溝 ······································································· 137 

第十四章 數據湖倉中的海量數據  

1. 海量數據的分佈 ······························································ 145 

2. 高性能、大容量的數據存儲 ··············································· 146 

3. 附加索引和摘要 ······························································ 146 

4. 周期性的數據過濾 ··························································· 148 

5. 數據標記法 ···································································· 148 

6. 分離文本和數據庫 ··························································· 149

7. 歸檔存儲 ······································································· 149 

8. 監測活動 ······································································· 150 

9. 並行處理 ······································································· 151 

第十五章 數據治理與數據湖倉  

1. 數據治理的目的 ······························································ 152 

2. 數據生命周期管理 ··························································· 154 

3. 數據質量管理 ································································· 156 

4. 元數據管理的重要性 ························································ 157 

5. 隨著時間推移的數據治理 ·················································· 157 

6. 數據治理的類型 ······························································ 158 

7. 貫穿數據湖倉的數據治理 ·················································· 159 

8. 數據治理的註意事項 ························································ 160 

第十六章 現代數據倉庫 

1. 應用程序的普及 ······························································ 162 

2. 信息孤島 ······································································· 163 

3. 復雜網絡環境 ································································· 164 

4. 數據倉庫 ······································································· 165 

5. 數據倉庫的定義 ······························································ 166 

6. 歷史數據 ······································································· 167 

7. 關系模型 ······································································· 167 

8. 數據的本地形式 ······························································ 168 

9. 集成數據的需要 ······························································ 169 

10. 時過境遷 ····································································· 170 

11. 當今世界 ····································································· 170 

12. 不同體量的數據····························································· 172 

13. 數據與業務的關系 ·························································· 173 

14. 將數據納入數據倉庫 ······················································· 173 

15. 現代數據倉庫 ······························································· 174 

16. 什麽時候我們不再需要數據倉庫? ····································· 175 

17. 數據湖 ········································································ 176 

18. 以數據倉庫作為基礎 ······················································· 177 

19. 數據堆棧 ····································································· 178