Story Transcript
Snowflake Cookbook
Techniques for building modern cloud data warehousing solutions
Hamid Mahmood Qureshi | Hammad Sharif FOR SALE IN INDIA ONLY
Snowflake Cookbook Techniques for building modern cloud data warehousing solutions
Hamid Mahmood Qureshi Hammad Sharif
BIRMINGHAM—MUMBAI
Snowflake Cookbook Copyright © 2021 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Group Product Manager: Kunal Parikh Publishing Product Manager: Ali Abidi Commissioning Editor: Sunith Shetty Acquisition Editor: Ali Abidi Senior Editor: Roshan Kumar Content Development Editors: Athikho Rishana, Sean Lobo Technical Editor: Sonam Pandey Copy Editor: Safis Editing Project Coordinator: Aishwarya Mohan Proofreader: Safis Editing Indexer: Priyanka Dhadke Production Designer: Vijay Kamble First published: February 2021 Production reference: 1230221 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80056-061-1
www.packt.com
To my father, whose authoring of countless books was an inspiration. To my mother, who dedicated her life to her children's education and well-being. – Hamid Qureshi
To my dad and mom for unlimited prayers and (according to my siblings, a bit extra) love. I cannot thank and appreciate you enough. To my wife and the mother of my children for her support and encouragement throughout this and other treks made by us. – Hammad Sharif
Contributors About the authors Hamid Qureshi is a senior cloud and data warehouse professional with almost two decades of total experience, having architected, designed, and led the implementation of several data warehouse and business intelligence solutions. He has extensive experience and certifications across various data analytics platforms, ranging from Teradata, Oracle, and Hadoop to modern, cloud-based tools such as Snowflake. Having worked extensively with traditional technologies, combined with his knowledge of modern platforms, he has accumulated substantial practical expertise in data warehousing and analytics in Snowflake, which he has subsequently captured in his publications. I want to thank the people who have helped me on this journey: my co-author Hammad, our technical reviewer, Hassaan, the Packt team, and my loving wife and children for their support throughout this journey. Hammad Sharif is an experienced data architect with more than a decade of experience in the information domain, covering governance, warehousing, data lakes, streaming data, and machine learning. He has worked with a leading data warehouse vendor for a decade as part of a professional services organization, advising customers in telco, retail, life sciences, and financial industries located in Asia, Europe, and Australia during presales and post-sales implementation cycles. Hammad holds an MSc. in computer science and has published conference papers in the domains of machine learning, sensor networks, software engineering, and remote sensing. I would like to first and foremost thank my loving wife and children for their patience and encouragement throughout the long process of writing this book. I'd also like to thank Hamid for inviting me to be his partner in crime and for his patience, my publishing team for their guidance, and the reviewers for helping improve this work.
About the reviewers Hassaan Sajid has around 12 years of experience in data warehousing and business intelligence in the retail, telecommunications, banking, insurance, and government sectors. He has worked with various clients in Australia, UAE, Pakistan, Saudi Arabia, and the USA in multiple BI/data warehousing roles, including BI architect, as a BI developer, ETL developer, data modeler, operations analyst, data analyst, and technical trainer. He holds a master's degree in BI and is a professional Scrum Master. He is also certified in Snowflake, MicroStrategy, Tableau, Power BI, and Teradata. His hobbies include reading, traveling, and photography. Buvaneswaran Matheswaran has a bachelor's degree in electronics and communication engineering from the Government College of Technology, Coimbatore, India. He had the opportunity to work on Snowflake in its very early stages and has more than 4 years of Snowflake experience. He has done lots of work and research on Snowflake as an enterprise admin. He has worked mainly in retail- and Consumer Product Goods (CPG)-based Fortune 500 companies. He is immensely passionate about cloud technologies, data security, performance tuning, and cost optimization. This is the first time he has done a technical review for a book, and he enjoyed the experience immensely. He has learned a lot as a user and also shared his experience as a veteran Snowflake admin. Daan Bakboord is a self-employed data and analytics consultant from the Netherlands. His passion is collecting, processing, storing, and presenting data. He has a simple motto: a customer must be able to make decisions based on facts and within the right context. DaAnalytics is his personal (online) label. He provides data and analytics services, having been active in Oracle Analytics since the mid-2000s. Since the end of 2017, his primary focus has been in the area of cloud analytics. Focused on Snowflake and its ecosystem, he is Snowflake Core Pro certified and, thanks to his contributions to the community, has been recognized as a Snowflake Data Hero. Also, he is Managing Partner Data and Analytics at Pong, a professional services provider that focuses on data-related challenges.
Table of Contents Preface
1
Getting Started with Snowflake Technical requirements Creating a new Snowflake instance
2
Using SnowSQL to connect to Snowflake
11
2
Getting ready How to do it… How it works…
2 3 5
Getting ready How to do it… How it works… There's more…
11 11 14 14
Creating a tailored multi-cluster virtual warehouse
5
Connecting to Snowflake with JDBC
14
Getting ready How to do it… How it works… There's more…
5 6 6 7
Getting ready How to do it… How it works… There's more…
15 15 20 22
8
Creating a new account admin user and understanding built-in roles 22
Using the Snowflake WebUI and executing a query Getting ready How to do it… How it works…
8 8 10
How to do it… How it works… There's more…
23 23 24
Getting ready How to do it…
26 26
2
Managing the Data Life Cycle Technical requirements Managing a database
26 26
ii Table of Contents How it works… There's more…
28 29
There's more…
40
Managing a schema
29
Managing external tables and stages
40
Getting ready How to do it… How it works… There's more…
29 29 31 32
Getting ready How to do it… How it works… There's more…
40 41 44 44
Managing tables
33
Managing views in Snowflake
45
Getting ready How to do it… How it works…
34 34 39
Getting ready How to do it… How it works… There's more…
45 45 47 48
3
Loading and Extracting Data into and out of Snowflake Technical requirements Configuring Snowflake access to private S3 buckets
50
Snowflake
64
50
Getting ready How to do it… How it works…
50 51 56
Getting ready How to do it… How it works…
64 65 67
Making sense of JSON semi-structured data and transforming to a relational view
68
Getting ready How to do it… How it works…
68 69 72
Processing newline-delimited JSON (or NDJSON) into a Snowflake table
72
Getting ready How to do it… How it works…
72 73 75
Processing near real-time data into a Snowflake table using Snowpipe
75
Loading delimited bulk data into Snowflake from cloud storage
57
Getting ready How to do it… How it works…
57 57 59
Loading delimited bulk data into Snowflake from your local machine
61
Getting ready How to do it… How it works…
61 61 63
Loading Parquet files into
Table of Contents iii Getting ready How to do it… How it works…
76 76 80
Extracting data from Snowflake 80 Getting ready How to do it… How it works…
81 81 83
4
Building Data Pipelines in Snowflake Technical requirements Creating and scheduling a task
86 86
Getting ready How it works…
86 91
Conjugating pipelines through a task tree 91 Getting ready How to do it… How it works…
91 92 96
Querying and viewing the task history
96
Getting ready How to do it… How it works…
96 97 99
Exploring the concept of streams to capture table-level changes Getting ready How to do it…
100 100 100
How it works…
104
Combining the concept of streams and tasks to build pipelines that process changed data on a schedule 104 How to do it… How it works…
Converting data types and Snowflake's failure management How to do it… How it works… There's more…
Managing context using different utility functions Getting ready How to do it… How it works… There's more…
104 108
109 109 112 113
113 113 113 116 116
5
Data Protection and Security in Snowflake Technical requirements Setting up custom roles and completing the role hierarchy Getting ready How to do it…
118 118 118 118
How it works… There's more…
Configuring and assigning a default role to
121 121
iv Table of Contents
a user Getting ready How to do it… How it works… There's more…
122 122 122 124 125
Delineating user management from security and role management 125 Getting ready How to do it… How it works…
Configuring custom roles for managing access to highly secure data Getting ready
126 126 128
128 128
How to do it… How it works…
Setting up development, testing, pre-production, and production database hierarchies and roles Getting ready How to do it… How it works…
129 131
132 132 132 134
Safeguarding the ACCOUNTADMIN role and users in the ACCOUNTADMIN role 134 Getting ready How to do it… How it works…
135 135 143
6
Performance and Cost Optimization Technical requirements Examining table schemas and deriving an optimal structure for a table Getting ready How to do it… How it works…
Identifying query plans and bottlenecks Getting ready How to do it… How it works…
146
Identifying and reducing unnecessary Fail-safe and Time Travel storage usage 159
146
Getting ready How to do it… How it works…
146 146 149
149 149 150 154
Weeding out inefficient queries through analysis 155 Getting ready How to do it… How it works…
155 155 158
Projections in Snowflake for performance Getting ready How to do it… How it works… There's more…
Reviewing query plans to modify table clustering Getting ready How to do it… How it works…
159 159 163
163 163 163 167 168
168 169 169 173
Table of Contents v
Optimizing virtual warehouse scale
173
Getting ready How to do it… How it works…
173 174 181
How to do it… How it works…
197 200
7
Secure Data Sharing Technical requirements Sharing a table with another Snowflake account Getting ready How to do it… How it works…
Sharing data through a view with another Snowflake account Getting ready How to do it… How it works…
Sharing a complete database with another Snowflake account and setting up future objects to be shareable Getting ready
184 184 184 184 189
189 190 190 196
196 196
Creating reader accounts and configuring them for nonSnowflake sharing Getting ready How to do it… How it works… Getting ready How to do it… How it works…
Keeping costs in check when sharing data with nonSnowflake users Getting ready How to do it… How it works…
200 201 201 205 206 206 209
210 210 210 214
8
Back to the Future with Time Travel Technical requirements 216 Using Time Travel to return to the state of data at a particular time 216 Getting ready How to do it… How it works…
Using Time Travel to recover
216 216 219
from the accidental loss of table data Getting ready How to do it… How it works…
Identifying dropped databases, tables, and other objects and restoring them using Time
220 220 220 223
vi Table of Contents
Travel Getting ready How to do it… How it works…
Using Time Travel in conjunction with cloning to improve debugging Getting ready
223 223 223 228
228 228
How to do it… How it works…
Using cloning to set up new environments based on the production environment rapidly
228 232
233
Getting ready How to do it… How it works…
233 233 237
Getting ready How to do it… How it works…
255 255 261
9
Advanced SQL Techniques Technical requirements Managing timestamp data Getting ready How to do it… How it works…
240 240 240 240 244
Shredding date data to extract Calendar information 245 Getting ready How to do it… How it works…
Unique counts and Snowflake Getting ready How to do it… How it works…
Managing transactions in Snowflake
245 245 250
251 251 251 254
255
Ordered analytics over window frames 261 Getting ready How to do it… How it works…
Generating sequences in Snowflake Getting ready How to do it… How it works…
261 261 265
265 265 266 270
Table of Contents vii
10
Extending Snowflake Capabilities Technical requirements Creating a Scalar user-defined function using SQL Getting ready How to do it... How it works...
Creating a Table user-defined function using SQL Getting ready How to do it How it works
Creating a Scalar user-defined function using JavaScript Getting ready How to do it How it works
272 272 272 272 275
275 275 275 280
280 280 280 282
Other Books You May Enjoy Index
Creating a Table user-defined function using JavaScript Getting ready How to do it How it works
Connecting Snowflake with Apache Spark Getting ready How to do it How it works
282 283 283 287
288 288 288 292
Using Apache Spark to prepare data for storage on Snowflake 293 Getting ready How to do it How it works
Why subscribe?
293 293 298
299
Preface Understanding a technology for analytics is an important aspect before embarking on delivering data analytic solutions, particularly in the cloud. This book introduces Snowflake tools and techniques you can use to tame challenges associated with data management, warehousing, and analytics. The cloud provides a quick onboarding mechanism, but at the same time, for novice users who lack the knowledge to efficiently use Snowflake to build and maintain a data warehouse, using trial and error can lead to higher bills. This book provides a practical introduction and guidance for those who have used other technologies, either on-premise or in the cloud for analytics and data warehousing, and those who are keen on transferring their skills to the new technology. The book provides practical examples that are typically involved in data warehousing and analytics in a simple way supported by code examples. It takes you through the user interface and management console offered by Snowflake and how to get started by creating an account. It also takes you through examples of how to load data and how to deliver analytics using different Snowflake capabilities and touches on extending the capabilities of Snowflake using stored procedures and user-defined functions. The book also touches on integrating Snowflake with Java and Apache Spark to allow it to coexist with a data lake. By the end of this book, you will be able to build applications on Snowflake that can serve as the building blocks of a larger solution, alongside security, governance, the data life cycle, and the distribution of data on Snowflake.
Who this book is for The book acts as a reference for users who want to learn about Snowflake using a practical approach. The recipe-based approach allows the different personas in data management to pick and choose what they want to learn, as and when required. The recipes are independent and start by helping you to understand the environment. The recipes require basic SQL and data warehousing knowledge.
Snowflake Cookbook
Snowflake is a unique cloud-based data warehousing platform built from scratch to tackle data management on the cloud. This book introduces Snowflake’s unique architecture, which places it at the forefront of cloud data warehouses. We will explore the compute model available with Snowflake and how Snowflake allows extensive scaling through virtual warehouses. You will learn how to configure a virtual warehouse for optimizing cost and performance. You will explore the data ecosystem and discover how Snowflake integrates with other technologies for staging and loading data. As you progress through the chapters, you will leverage Snowflake’s capabilities to process a series of SQL statements using tasks to build data pipelines and find out how you can create modern data solutions and pipelines designed to provide high performance and scalability. You will also get to grips with creating role hierarchies, adding custom roles, and setting default roles for users before covering advanced topics such as data sharing, cloning, and performance optimization. By the end of this Snowflake book, you will be well-versed in Snowflake’s architecture for building modern analytical solutions and understand best practices for solving commonly faced problems using practical recipes.
Things you will learn: • •
•
Data warehousing techniques aligned with Snowflake’s cloud architecture Broad skills for data warehouse designers to cover Snowflake ecosystem and tooling Transfer skills from on-premise data warehousing to the Snowflake cloud analytics platform
• • • • •
FOR SALE IN INDIA ONLY
Optimize performance and costs associated with a Snowflake solution Stage data on object stores and load it into Snowflake Secure data and sharing it efficiently for access in a controlled manner Manage transactions and extend Snowflake using stored procedures Extend cloud data applications using Spark Connector