Kafka Troubleshooting in Production: Stabilizing Kafka Clusters in the Cloud and On-Premises

Eldor, Elad

  • 出版商: Apress
  • 出版日期: 2023-11-30
  • 售價: $1,500
  • 貴賓價: 9.5$1,425
  • 語言: 英文
  • 頁數: 216
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1484294890
  • ISBN-13: 9781484294895
  • 相關分類: Message Queue
  • 海外代購書籍(需單獨結帳)

商品描述

This book provides Kafka administrators, site reliability engineers, and DataOps and DevOps practitioners with a list of real production issues that can occur in Kafka clusters and how to solve them. The production issues covered are assembled into a comprehensive troubleshooting guide for those engineers who are responsible for the stability and performance of Kafka clusters in production, whether those clusters are deployed in the cloud or on-premises. This book teaches you how to detect and troubleshoot the issues, and eventually how to prevent them.
Kafka stability is hard to achieve, especially in high throughput environments, and the purpose of this book is not only to make troubleshooting easier, but also to prevent production issues from occurring in the first place. The guidance in this book is drawn from the author's years of experience in helping clients and internal customers diagnose and resolve knotty production problems and stabilize their Kafka environments. The book is organized into recipe-style troubleshooting checklists that field engineers can easily follow when under pressure to fix an unstable cluster. This is the book you will want by your side when the stakes are high, and your job is on the line.
What You Will Learn
  • Monitor and resolve production issues in your Kafka clusters
  • Provision Kafka clusters with the lowest costs and still handle the required loads
  • Perform root cause analyses of issues affecting your Kafka clusters
  • Know the ways in which your Kafka cluster can affect its consumers and producers
  • Prevent or minimize data loss and delays in data streaming
  • Forestall production issues through an understanding of common failure points
  • Create checklists for troubleshooting your Kafka clusters when problems occur
Who This Book Is For
Site reliability engineers tasked with maintaining stability of Kafka clusters, Kafka administrators who troubleshoot production issues around Kafka, DevOps and DataOps experts who are involved with provisioning Kafka (whether on-premises or in the cloud), developers of Kafka consumers and producers who wish to learn more about Kafka

商品描述(中文翻譯)

本書為Kafka管理員、網站可靠性工程師、DataOps和DevOps從業人員提供了一份真實生產環境中可能出現的Kafka集群問題清單以及如何解決這些問題。所涵蓋的生產問題被整理成一份全面的故障排除指南,針對那些負責生產環境中Kafka集群的穩定性和性能的工程師,無論這些集群是部署在雲端還是本地。本書教導您如何檢測和排除問題,最終如何預防它們的發生。
Kafka的穩定性很難實現,特別是在高吞吐量的環境中,本書的目的不僅是使故障排除變得更容易,還要在首次發生生產問題之前預防它們。本書中的指導是根據作者多年來幫助客戶和內部客戶診斷和解決棘手的生產問題並穩定其Kafka環境的經驗而得出的。本書按照食譜式的故障排除檢查清單進行組織,使現場工程師在壓力下修復不穩定的集群時能夠輕鬆遵循。當局勢危急,您的工作岌岌可危時,這本書將成為您的得力助手。
您將學到什麼


  • 監控並解決Kafka集群中的生產問題

  • 以最低成本提供Kafka集群並應對所需負載

  • 對影響Kafka集群的問題進行根本原因分析

  • 了解Kafka集群如何影響其消費者和生產者

  • 預防或減少數據丟失和數據流延遲

  • 通過了解常見故障點來預防生產問題

  • 在問題發生時創建故障排除Kafka集群的檢查清單

本書適合對象
負責維護Kafka集群穩定性的網站可靠性工程師、解決Kafka生產問題的Kafka管理員、參與Kafka供應(無論是本地還是雲端)的DevOps和DataOps專家、希望了解更多關於Kafka的Kafka消費者和生產者的開發人員。

作者簡介

Elad Eldor is a DataOps team leader in the Grow division of Unity (formerly ironSource), working on handling stability issues, improving performance, and reducing the cost of high-scale Kafka, Druid, Presto, and Spark clusters on AWS. He has 12 years of experience as a backend software engineer and six years handling DataOps of big data Linux-based clusters.

Prior to working at Unity, Elad was a Site Reliability Engineer (SRE) at Cognyte, where he developed big data applications and handled the reliability and scalability of Spark and Kafka clusters in production. His main interests are performance tuning and cost reduction of big data clusters.

作者簡介(中文翻譯)

Elad Eldor 是 Unity(前身為 ironSource)Grow 部門的 DataOps 團隊負責人,負責處理穩定性問題、提升性能,以及降低 AWS 上高規模 Kafka、Druid、Presto 和 Spark 集群的成本。他擁有 12 年的後端軟體工程師經驗,以及 6 年處理基於 Linux 的大數據集群的 DataOps 經驗。

在加入 Unity 之前,Elad 在 Cognyte 擔任網站可靠性工程師(SRE),開發大數據應用程式,並負責生產環境中 Spark 和 Kafka 集群的可靠性和可擴展性。他主要關注大數據集群的性能調優和成本降低。