9781787283985 Flipbook PDF


60 downloads 117 Views 585KB Size

Recommend Stories


Porque. PDF Created with deskpdf PDF Writer - Trial ::
Porque tu hogar empieza desde adentro. www.avilainteriores.com PDF Created with deskPDF PDF Writer - Trial :: http://www.docudesk.com Avila Interi

EMPRESAS HEADHUNTERS CHILE PDF
Get Instant Access to eBook Empresas Headhunters Chile PDF at Our Huge Library EMPRESAS HEADHUNTERS CHILE PDF ==> Download: EMPRESAS HEADHUNTERS CHIL

Story Transcript

Manish Kumar, Chanchal Singh

Building Data Streaming Applications with Apache Kafka

Designing and deploying enterprise messaging queues

FOR SALE IN INDIA ONLY

Building Data Streaming Applications with Apache Kafka

Designing and deploying enterprise messaging queues

Manish Kumar Chanchal Singh

BIRMINGHAM - MUMBAI

Building Data Streaming Applications with Apache Kafka Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2017 Production reference: 1170817 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78728-398-5 www.packtpub.com

Credits Authors Manish Kumar Chanchal Singh

Copy Editor Manisha Sinha

Reviewer Anshul Joshi

Project Coordinator Manthan Patel

Commissioning Editor Amey Varangaonkar

Proofreader Safis Editing

Acquisition Editor Tushar Gupta

Indexer Tejal Daruwale Soni

Content Development Editor Tejas Limkar

Graphics Tania Dutta

Technical Editor Dinesh Chaudhary

Production Coordinator Deepika Naik

About the Authors Manish Kumar is a Technical Architect at DataMetica Solution Pvt. Ltd.. He has approximately 11 years, experience in data management, working as a Data Architect and Product Architect. He has extensive experience in building effective ETL pipelines, implementing security over Hadoop, and providing the best possible solutions to Data Science problems. Before joining the world of big data, he worked as an Tech Lead for Sears Holding, India. He is a regular speaker on big data concepts such as Hadoop and Hadoop Security in various events. Manish has a Bachelor's degree in Information Technology. I would like to thank my parents, Dr. N.K. Singh and Mrs. Rambha Singh, for their support and blessings, my wife; Mrs. Swati Singh, for her successfully keeping me healthy and happy; and my adorable son, Master Lakshya Singh, for teaching me how to enjoy the small things in life. I would like to extend my gratitude to Mr. Prashant Jaiswal, whose mentorship and friendship will remain gems of my life, and Chanchal Singh, my esteemed friend, for standing by me in times of trouble and happiness. This note will be incomplete if I do not mention Mr. Anand Deshpande, Mr. Parashuram Bastawade, Mr. Niraj Kumar, Mr. Rajiv Gupta, and Dr. Phil Shelley for giving me exciting career opportunities and showing trust in me, no matter how adverse the situation was.

Chanchal Singh is a Software Engineer at DataMetica Solution Pvt. Ltd.. He has over three years' experience in product development and architect design, working as a Product Developer, Data Engineer, and Team Lead. He has a lot of experience with different technologies such as Hadoop, Spark, Storm, Kafka, Hive, Pig, Flume, Java, Spring, and many more. He believes in sharing knowledge and motivating others for innovation. He is the co-organizer of the Big Data Meetup - Pune Chapter. He has been recognized for putting innovative ideas into organizations. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. I would like to thank my parents, Mr. Parasnath Singh and Mrs. Usha Singh, for showering their blessings on me and their loving support. I am eternally grateful to my love, Ms. Jyoti, for being with me in every situation and encouraging me. I would also like to express my gratitude to all the mentors I've had over the years. Special thanks to Mr Abhijeet Shingate who helped me as a mentor and guided me in the right direction during the initial phase of my career. I am highly indebted to Mr. Manish Kumar, without whom writing this book would have been challenging, for always enlightening me and sharing his knowledge with me. I would like to extend my sincere thanks by mentioning a few great personalities: Mr Rajiv Gupta, Mr. Niraj Kumar, Mr. Parashuram Bastawade, and Dr.Phil Shelley for giving me ample opportunities to explore solutions for real customer problems and believing in me.

About the Reviewer Anshul Joshi is a Data Scientist with experience in recommendation systems, predictive modeling, neural networks, and high performance computing. His research interests are deep learning, artificial intelligence, computational physics, and biology. Most of the time, he can be caught exploring GitHub or trying anything new that he can get his hands on. He blogs on https://anshuljoshi.com/.

www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at; www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser

Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787283984. If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents Preface Chapter 1: Introduction to Messaging Systems Understanding the principles of messaging systems Understanding messaging systems Peeking into a point-to-point messaging system Publish-subscribe messaging system Advance Queuing Messaging Protocol Using messaging systems in big data streaming applications Summary

Chapter 2: Introducing Kafka the Distributed Messaging Platform Kafka origins Kafka's architecture Message topics Message partitions Replication and replicated logs Message producers Message consumers Role of Zookeeper Summary

Chapter 3: Deep Dive into Kafka Producers Kafka producer internals Kafka Producer APIs Producer object and ProducerRecord object Custom partition Additional producer configuration Java Kafka producer example Common messaging publishing patterns Best practices Summary

Chapter 4: Deep Dive into Kafka Consumers Kafka consumer internals Understanding the responsibilities of Kafka consumers Kafka consumer APIs

1 7 8 9 12 15 18 19 23 25 26 27 29 31 34 37 37 38 39 41 42 46 48 51 52 54 56 58 59 61 62 62 65

Consumer configuration Subscription and polling Committing and polling Additional configuration Java Kafka consumer Scala Kafka consumer Rebalance listeners Common message consuming patterns Best practices Summary

Chapter 5: Building Spark Streaming Applications with Kafka Introduction to Spark Spark architecture Pillars of Spark The Spark ecosystem

Spark Streaming Receiver-based integration Disadvantages of receiver-based approach Java example for receiver-based integration Scala example for receiver-based integration

Direct approach Java example for direct approach Scala example for direct approach

Use case log processing - fraud IP detection Maven Producer Property reader Producer code Fraud IP lookup Expose hive table Streaming code

Summary

Chapter 6: Building Storm Applications with Kafka Introduction to Apache Storm Storm cluster architecture The concept of a Storm application Introduction to Apache Heron Heron architecture Heron topology architecture

Integrating Apache Kafka with Apache Storm - Java

[]

65 67 68 70 71 73 74 75 78 79 81 82 82 84 86 88 88 90 91 92 93 95 96 97 97 101 101 102 104 105 106 108 109 110 110 112 114 114 115 117

Example Integrating Apache Kafka with Apache Storm - Scala Use case – log processing in Storm, Kafka, Hive Producer Producer code Fraud IP lookup

Running the project Summary

Chapter 7: Using Kafka with Confluent Platform Introduction to Confluent Platform Deep driving into Confluent architecture Understanding Kafka Connect and Kafka Stream Kafka Streams Playing with Avro using Schema Registry Moving Kafka data to HDFS Camus Running Camus

Gobblin Gobblin architecture

Kafka Connect Flume Summary

Chapter 8: Building ETL Pipelines Using Kafka Considerations for using Kafka in ETL pipelines Introducing Kafka Connect Deep dive into Kafka Connect Introductory examples of using Kafka Connect Kafka Connect common use cases Summary

Chapter 9: Building Streaming Applications Using Kafka Streams Introduction to Kafka Streams Using Kafka in Stream processing Kafka Stream - lightweight Stream processing library Kafka Stream architecture Integrated framework advantages Understanding tables and Streams together Maven dependency Kafka Stream word count

[]

118 122 125 129 130 132 141 141 143 143 145 149 149 150 151 152 153 154 154 157 157 160 161 162 163 165 167 170 171 173 174 174 175 177 180 180 181 181

KTable Use case example of Kafka Streams Maven dependency of Kafka Streams Property reader IP record producer IP lookup service Fraud detection application Summary

Chapter 10: Kafka Cluster Deployment Kafka cluster internals Role of Zookeeper Replication Metadata request processing Producer request processing Consumer request processing Capacity planning Capacity planning goals Replication factor Memory Hard drives Network CPU Single cluster deployment Multicluster deployment Decommissioning brokers Data migration Summary

183 184 184 185 186 188 190 191 193 194 194 195 197 198 198 199 200 200 200 201 202 202 202 203 205 206 207

Chapter 11: Using Kafka in Big Data Applications Managing high volumes in Kafka Appropriate hardware choices Producer read and consumer write choices Kafka message delivery semantics At least once delivery At most once delivery Exactly once delivery Big data and Kafka common usage patterns Kafka and data governance Alerting and monitoring

[]

209 210 210 212 213 214 217 219 220 222 224

Useful Kafka matrices Producer matrices Broker matrices Consumer metrics Summary

224 225 226 226 227

Chapter 12: Securing Kafka

229

An overview of securing Kafka Wire encryption using SSL Steps to enable SSL in Kafka Configuring SSL for Kafka Broker Configuring SSL for Kafka clients

Kerberos SASL for authentication Steps to enable SASL/GSSAPI - in Kafka Configuring SASL for Kafka broker Configuring SASL for Kafka client - producer and consumer

Understanding ACL and authorization Common ACL operations List ACLs

Understanding Zookeeper authentication Apache Ranger for authorization Adding Kafka Service to Ranger Adding policies Best practices Summary

Chapter 13: Streaming Application Design Considerations Latency and throughput Data and state persistence Data sources External data lookups Data formats Data serialization Level of parallelism Out-of-order events Message processing semantics Summary

229 230 231 232 232 233 235 236 237 238 239 240 241 242 242 244 246 247 249 250 251 252 252 253 254 254 255 255 256

Index

257

[]

Preface Apache Kafka is a popular distributed streaming platform that acts as a messaging queue or an enterprise messaging system. It lets you publish and subscribe to a stream of records and process them in a fault-tolerant way as they occur. This book is a comprehensive guide to designing and architecting enterprise-grade streaming applications using Apache Kafka and other big data tools. It includes best practices for building such applications and tackles some common challenges such as how to use Kafka efficiently to handle high data volumes with ease. This book first takes you through understanding the type messaging system and then provides a thorough introduction to Apache Kafka and its internal details. The second part of the book takes you through designing streaming application using various frameworks and tools such as Apache Spark, Apache Storm, and more. Once you grasp the basics, we will take you through more advanced concepts in Apache Kafka such as capacity planning and security. By the end of this book, you will have all the information you need to be comfortable with using Apache Kafka and to design efficient streaming data applications with it.

What this book covers Chapter 1, Introduction to Messaging System, introduces concepts of messaging systems. It

covers an overview of messaging systems and their enterprise needs. It further emphasizes the different ways of using messaging systems such as point to point or publish/subscribe. It introduces AMQP as well. Chapter 2, Introducing Kafka - The Distributed Messaging Platform, introduces distributed

messaging platforms such as Kafka. It covers the Kafka architecture and touches upon its internal component. It further explores the roles and importance of each Kafka components and how they contribute towards low latency, reliability, and the scalability of Kafka Message Systems. Chapter 3, Deep Dive into Kafka Producers, is about how to publish messages to Kafka

Systems. This further covers Kafka Producer APIs and their usage. It showcases examples of using Kafka Producer APIs with Java and Scala programming languages. It takes a deep dive into Producer message flows and some common patterns for producing messages to Kafka Topics. It walks through some performance optimization techniques for Kafka Producers.

Building Data Streaming Applications with Apache Kafka Apache Kafka is a popular distributed streaming platform that acts as a messaging queue or an enterprise messaging system. It lets you publish and subscribe to a stream of records, and process them in a fault-tolerant way as they occur. This book is a comprehensive guide to designing and architecting enterprise-grade streaming applications using Apache Kafka and other big data tools. It includes best practices for building such applications, and tackles some common challenges, such as how to use Kafka efficiently and handle high data volumes with ease. This book first takes you through the type messaging system and then provides a thorough introduction to Apache Kafka and its internal details. The second part of the book takes you through designing streaming application using various frameworks and tools such as Apache Spark, Apache Storm, and more. Once you grasp the basics, we will take you through more advanced concepts in Apache Kafka, such as capacity planning and security.

Things you will learn:

• Learn the basics of Apache Kafka from scratch

• Use the basic building blocks of a streaming application

• Design effective streaming applications with Kafka using Spark, Storm, and Heron

• Understand the importance of a low-latency, high-throughput, and fault-tolerant messaging system

• Make effective capacity planning while deploying your Kafka application

• Understand and implement the best

By the end of this book, you will have all the information you need to be comfortable with using Apache Kafka, and to design efficient streaming data applications with it.

www.packtpub.com

FOR SALE IN INDIA ONLY

security practices

Get in touch

Social

© Copyright 2013 - 2024 MYDOKUMENT.COM - All rights reserved.