Librería Portfolio Librería Portfolio

Búsqueda avanzada

TIENE EN SU CESTA DE LA COMPRA

0 productos

en total 0,00 €

LEARNING SPARK 2E
Título:
LEARNING SPARK 2E
Subtítulo:
Autor:
DAMJI, J
Editorial:
O´REILLY
Año de edición:
2020
Materia
PROGRAMACION INTERNET
ISBN:
978-1-4920-5004-9
Páginas:
400
67,50 €

 

Sinopsis

Data is bigger, arrives faster, and comes in a variety of formats-and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.

Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you'll be able to:

Learn Python, SQL, Scala, or Java high-level Structured APIs
Understand Spark operations and SQL Engine
Inspect, tune, and debug Spark operations with Spark configurations and Spark UI
Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
Perform analytics on batch and streaming data using Structured Streaming
Build reliable data pipelines with open source Delta Lake and Spark
Develop machine learning pipelines with MLlib and productionize models using MLflow



Table of contents

Foreword
Preface
Who This Book Is For
How the Book Is Organized
How to Use the Code Examples
Software and Configuration Used
Conventions Used in This Book
Using Code Examples
O'Reilly Online Learning
How to Contact Us
Acknowledgments
1. Introduction to Apache Spark: A Unified Analytics Engine
The Genesis of Spark
Big Data and Distributed Computing at Google
Hadoop at Yahoo!
Spark's Early Years at AMPLab
What Is Apache Spark?
Speed
Ease of Use
Modularity
Extensibility
Unified Analytics
Apache Spark Components as a Unified Stack
Apache Spark's Distributed Execution
The Developer's Experience
Who Uses Spark, and for What?
Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Downloading Apache Spark
Spark's Directories and Files
Step 2: Using the Scala or PySpark Shell
Using the Local Machine
Step 3: Understanding Spark Application Concepts
Spark Application and SparkSession
Spark Jobs
Spark Stages
Spark Tasks
Transformations, Actions, and Lazy Evaluation
Narrow and Wide Transformations
The Spark UI
Your First Standalone Application
Counting M&Ms for the Cookie Monster
Building Standalone Applications in Scala
Summary
3. Apache Spark's Structured APIs
Spark: What's Underneath an RDD?
Structuring Spark
Key Merits and Benefits
The DataFrame API
Spark's Basic Data Types
Spark's Structured and Complex Data Types
Schemas and Creating DataFrames
Columns and Expressions
Rows
Common DataFrame Operations
End-to-End DataFrame Example
The Dataset API
Typed Objects, Untyped Objects, and Generic Rows
Creating Datasets
Dataset Operations
End-to-End Dataset Example
DataFrames Versus Datasets
When to Use RDDs
Spark SQL and the Underlying Engine
The Catalyst Optimizer
Summary
4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
Using Spark SQL in Spark Applications
Basic Query Examples
SQL Tables and Views
Managed Versus UnmanagedTables
Creating SQL Databases and Tables
Creating Views
Viewing the Metadata
Caching SQL Tables
Reading Tables into DataFrames
Data Sources for DataFrames and SQL Tables
DataFrameReader
DataFrameWriter
Parquet
JSON
CSV
Avro
ORC
Images
Binary Files
Summary
5. Spark SQL and DataFrames: Interacting with External Data Sources
Spark SQL and Apache Hive
User-Defined Functions
Querying with the Spark SQL Shell, Beeline, and Tableau
Using the Spark SQL Shell
Working with Beeline
Working with Tableau
External Data Sources
JDBC and SQL Databases
PostgreSQL
MySQL
Azure Cosmos DB
MS SQL Server
Other External Sources
Higher-Order Functions in DataFrames and Spark SQL
Option 1: Explode and Collect
Option 2: User-Defined Function
Built-in Functions for Complex Data Types
Higher-Order Functions
Common DataFrames and Spark SQL Operations
Unions
Joins
Windowing
Modifications
Summary
6. Spark SQL and Datasets
Single API for Java and Scala
Scala Case Classes and JavaBeans for Datasets
Working with Datasets
Creating Sample Data
Transforming Sample Data
Memory Management for Datasets and DataFrames
Dataset Encoders
Spark's Internal Format Versus Java Object Format
Serialization and Deserialization (SerDe)
Costs of Using Datasets
Strategies to Mitigate Costs
Summary
7. Optimizing and Tuning Spark Applications
Optimizing and Tuning Spark for Efficiency
Viewing and Setting Apache Spark Configurations
Scaling Spark for Large Workloads
Caching and Persistence of Data
DataFrame.cache()
DataFrame.persist()
When to Cache and Persist
When Not to Cache and Persist
A Family of Spark Joins
Broadcast Hash Join
Shuffle Sort Merge Join
Inspecting the Spark UI
Journey Through the Spark UI Tabs
Summary
8. Structured Streaming
Evolution of the Apache Spark Stream Processing Engine
The Advent of Micro-Batch Stream Processing
Lessons Learned from Spark Streaming (DStreams)
The Philosophy of Structured Streaming
The Programming Model of Structured Streaming
The Fundamentals of a Structured Streaming Query
Five Steps to Define a Streaming Query
Under the Hood of an Active Streaming Query
Recovering from Failures with Exactly-Once Guarantees
Monitoring an Active Query
Streaming Data Sources and Sinks
Files
Apache Kafka
Custom Streaming Sources and Sinks
Data