Video description
"Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide."
Jonathan Sharley, Pandora Media
Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code.
Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades.
Inside:
- Updated for Spark 2.0
- Real-life case studies
- Spark DevOps with Docker
- Examples in Scala, and online in Java and Python
Made for experienced programmers with some background in big data or machine learning.
Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community.
Must-have! Speed up your learning of Spark as a distributed computing framework.
Robert Ormandi, Yahoo!
An easy-to-follow, step-by-step guide.
Gaurav Bhardwaj, 3Pillar Global
An ambitiously comprehensive overview of Spark and its diverse ecosystem.
Jonathan Miller, Optensity
NARRATED BY KYLE JACKSON AND MARK THOMAS
Table of Contents
PART 1: FIRST STEPS
Chapter 1. Introduction to Apache Spark
Chapter 1. What Spark brings to the table
Chapter 1. Spark components
Chapter 1. Spark program flow
Chapter 1. Setting up the spark-in-action VM
Chapter 2. Spark fundamentals
Chapter 2. Using the VM’s Hadoop installation
Chapter 2. Using Spark shell and writing your first Spark program
Chapter 2. Basic RDD actions and transformations
Chapter 2. Using the distinct and flatMap transformations
Chapter 2. Obtaining RDD’s elements with the sample, take, and takeSample operations
Chapter 2. Double RDD functions
Chapter 3. Writing Spark applications
Chapter 3. Developing the application
Chapter 3. Running the application from Eclipse
Chapter 3. Broadcast variables
Chapter 3. Submitting the application
Chapter 3. Using spark-submit
Chapter 4. The Spark API in depth
Chapter 4. Basic pair RDD functions
Chapter 4. Using the flatMapValues transformation to add values to keys
Chapter 4. Understanding data partitioning and reducing data shuffling
Chapter 4. Understanding and avoiding unnecessary shuffling
Chapter 4. Repartitioning RDDs
Chapter 4. Joining, sorting, and grouping data
Chapter 4. Joining data
Chapter 4. Sorting data
Chapter 4. Grouping data
Chapter 4. Understanding RDD dependencies
Chapter 4. Using accumulators and broadcast variables to communicate with Spark executors
Chapter 4. Sending data to executors using broadcast variables
PART 2: MEET THE SPARK FAMILY
Chapter 5. Sparkling queries with Spark SQL
Chapter 5. Creating DataFrames from RDDs
Chapter 5. Creating a DataFrame from an RDD of tuples
Chapter 5. DataFrame API basics
Chapter 5. Using SQL functions to perform calculations on data
Chapter 5. Working with missing values
Chapter 5. Grouping and joining data
Chapter 5. Beyond DataFrames: introducing DataSets
Chapter 5. Table catalog and Hive metastore
Chapter 5. Executing SQL queries
Chapter 5. Saving and loading DataFrame data
Chapter 5. Saving data
Chapter 5. Catalyst optimizer
Chapter 6. Ingesting data with Spark Streaming
Chapter 6. Creating a discretized stream
Chapter 6. Saving the results to a file
Chapter 6. Saving the computation state over time
Chapter 6. Specifying the checkpointing directory
Chapter 6. Using window operations for time-limited calculations
Chapter 6. Using external data sources
Chapter 6. Changing the streaming application to use Kafka
Chapter 6. Performance of Spark Streaming jobs
Chapter 6. Structured Streaming
Chapter 7. Getting smart with MLlib
Chapter 7. Classification of machine-learning algorithms
Chapter 7. Linear algebra in Spark
Chapter 7. Distributed matrices
Chapter 7. Linear regression
Chapter 7. Expanding the model to multiple linear regression
Chapter 7. Analyzing and preparing the data
Chapter 7. Fitting and using a linear regression model
Chapter 7. Tweaking the algorithm
Chapter 7. Plotting residual plots
Chapter 7. Optimizing linear regression
Chapter 8. ML: classification and clustering
Chapter 8. Logistic regression
Chapter 8. Preparing data to use logistic regression in Spark
Chapter 8. Training the model
Chapter 8. Performing k-fold cross-validation
Chapter 8. Decision trees and random forests
Chapter 8. Decision trees
Chapter 8. Random forests
Chapter 8. Using k-means clustering
Chapter 8. K-means clustering
Chapter 8. Summary
Chapter 9. Connecting the dots with GraphX
Chapter 9. Transforming graphs
Chapter 9. Graph algorithms
Chapter 9. Implementing the A* search algorithm
Chapter 9. Implementing the A* algorithm
Chapter 9. Summary
PART 3: SPARK OPS
Chapter 10. Running Spark
Chapter 10. Job and resource scheduling
Chapter 10. Data-locality considerations
Chapter 10. Configuring Spark
Chapter 10. Spark web UI
Chapter 10. Running Spark on the local machine
Chapter 11. Running on a Spark standalone cluster
Chapter 11. Starting the standalone cluster
Chapter 11. Viewing Spark processes
Chapter 11. Standalone cluster web UI
Chapter 11. Specifying extra classpath entries and files
Chapter 11. Spark History Server and event logging
Chapter 11. Creating an EC2 standalone cluster
Chapter 11. Using the EC2 cluster
Chapter 12. Running on YARN and Mesos
Chapter 12. Resource scheduling in YARN
Chapter 12. Configuring Spark on YARN
Chapter 12. Configuring resources for Spark jobs
Chapter 12. Finding logs on YARN
Chapter 12. Running Spark on Mesos
Chapter 12. Installing and configuring Mesos
Chapter 12. Mesos resource scheduling
Chapter 12. Running Spark with Docker
PART 4: BRINGING IT TOGETHER
Chapter 13. Case study: real-time dashboard
Chapter 13. Running the application
Chapter 13. Starting the application manually
Chapter 13. Understanding the source code
Chapter 13. The StreamingLogAnalyzer project
Chapter 14. Deep learning on Spark with H2O
Chapter 14. Using H2O with Spark
Chapter 14. Performing regression with H2O’s deep learning
Chapter 14. Building and evaluating a deep-learning model using the Sparkling Water API
Chapter 14. Performing classification with H2O’s deep learning