Video description
6+ Hours of Video Instruction
Overview
Data Engineering Foundations Part1: Using Spark, Hive, and Hadoop
Scalable Tools LiveLessons
provides over six hours of video introducing you to the Apache Hadoop big
data ecosystem. The tutorial includes background information and
demonstrates the core components of data engineering and scalability,
including Apache PySpark, Hadoop, Hadoop Distributed File Systems (HDFS),
MapReduce, Hive, and the
Zeppelin web notebook. It also covers the use of basic Linux command line
analytic tools. All lesson examples and open-source software used in these
LiveLessons are freely available on a companion virtual machine that
enables continued exploration of the lesson examples.
About the Instructor
Doug Eadline
, Ph.D., began his career as Analytical Chemist with an interest in
computer methods. Starting with the first Beowulf how-to document, Doug has
written instructional documents covering many aspects of Linux HPC, Hadoop,
and Analytics computing. Currently Doug serves as editor of theClusterMonkey.net website and lead architect for Limulus-Computing.com maker of desk-side cluster appliances.
Previously he was editor of ClusterWorld Magazine and senior HPC
Editor for Linux Magazine. He is also a writer and consultant to the
scalable HPC/Analytics industry. His recent video tutorials and books
include of the Hadoop Fundamentals LiveLessons (Addison Wesley)
video, Hadoop 2 Quick Start Guide (Addison Wesley),High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison
Wesley).
Skill Level
● Beginner
● Intermediate
Learn How To
● Understand basic data engineering concepts
● Understand Apache Hadoop, MapReduce, and Spark operation
● Understand scalable systems
● Use Linux command line analytic tools
● Use Apache Zeppelin web notebooks with different tools
● Use Apache Hadoop and the Hadoop Distributed File System
● Use Apache Hadoop MapReduce with Python
● Use the Apache Hive Scalable Database
● Use Apache PySpark with MapReduce
● Use Apache PySpark with dataframes and Hive tables
Who Should Take This Course
● Users, developers, and administrators interested in learning the
fundamental aspects and operations of date engineering and scalable systems
Course Requirements
● Basic understanding of programming and application development
● A working knowledge of Linux systems, command line, and tools
● Familiarity with Python, SQL, and the Bash shell
Table of Contents
Introduction
Lesson 1: Background Concepts
Learning objectives:
1.1 Understand big data and data analytics concepts
1.2 Understand Hadoop as a big data platform
1.3 Understand Hadoop MapReduce basics
1.4 Understand Spark language basics
Lesson 2: Working with Scalable Systems
Learning objectives:
2.1 Understand scalable concepts
2.2 Emulate scalable systems
2.3 Use Linux command line analytics tools
2.4 Use the Zeppelin web notebook
Lesson 3: Using the Hadoop HDFS File System
Learning objectives:
3.1 Understand HDFS basics
3.2 Use HDFS command line tools
3.3 Use the HDF web interface
Lesson 4: Using Hadoop MapReduce
Learning objectives:
4.1 Understand the MapReduce paradigm and platform
4.2 Understand parallel MapReduce
4 3 Run MapReduce examples
4.4 Use the streaming interface
4.5 Use the MapReduce (YARN) web interface
Lesson 5: Using the Hive Scalable Database
Learning objectives:
5.1 Run a Hive SQL example using the command line
5.2 Run a Hive example using a Zeppelin notebook
Lesson 6 : Using the Apache PySpark
Learning objectives:
6.1 Understand Spark language basics
6.2 Understand SparkSession and Context
6.3 Use PySpark for MapReduce programing
6.4 Run a PySpark example using a Zeppelin notebook
Summary
Lesson Descriptions
Lesson 1: Background ConceptsIn Lesson 1, Doug introduces you to the important concepts you need to know
to understand the big data, Hadoop, and Spark ecosystem. He begins with a
description of big data and big data analytic concepts and then presents
Hadoop as a big data platform. He then turns to the basics of Hadoop and
the Spark language to finish up the lesson.
Lesson 2: Working with Scalable SystemsIn Lesson 2, Doug introduces you to working with scalable systems. The
lessons start with Doug covering scalable computing concepts and then turns
to a freely-available Linux-based virtual machine that is runnable on most
laptop and desktop systems. Using this virtual machine, you can run most of
the examples in the lessons. Doug also uses the virtual machine to
demonstrate some of the Linux command line analytic tools and introduce the
Zeppelin web notebook.
Lesson 3: Using the Hadoop HDFS File SystemDoug explains the Hadoop Distributed File System (HDFS) in Lesson 3. He
also presents a quick-start on how to use HDFS command line tools. Finally,
he finishes up the lesson by explaining how to use the HDFS web interface.
Lesson 4: Using Hadoop MapReduceIn this lesson Doug explains and demonstrates how to use Hadoop MapReduce.
He begins with an explanation of the MapReduce algorithm and how it
operates in a clustered parallel environment. Doug then demonstrates how to
run MapReduce examples and use the Hadoop streaming interface on your local
machine. He concludes the lesson by demonstrating Hadoop performance using
a four-node Hadoop cluster and the web-based MapReduce jobs interface.
Lesson 5: Using the Hive Scalable DatabaseIn Lesson 5, Doug introduces the Hive scalable database. Based on Hadoop
MapReduce, Hive is used to derive a new feature from an existing dataset.
This important data engineering process is demonstrated from both the
command line and the Zeppelin web notebook,
Lesson 6 : Using the Apache PySparkIn the final lesson of Part 1, Doug introduces PySpark. Based on the
underlying Spark language, PySpark enables Python programmers to learn
scalable data engineering. Before the hand-on lessons, Doug provides a
solid introduction to Spark and PySpark operations. This background
includes using the Spark web interface and demonstrates how to manage a
SparkSession and a SparkContext for distributed operation. Examples of
MapReduce programming and DataFrame operations are presented from both the
command line and a Zeppelin notebook. The lesson concludes with the
operations needed to transfer data to and from PySpark and Hive database
tables.
About Pearson Video Training
Pearson publishes expert-led video tutorials covering a wide selection of
technology topics designed to teach you the skills you need to succeed.
These professional and personal technology videos feature world-leading
author instructors published by your trusted technology brands:
Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams,
and Que. Topics include IT certification, network security, programming,
web development, mobile development, data analytics, and more. Learn more
about Pearson video training at http://www.informit.com/video.
Table of Contents
Introduction
Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools: Introduction
Lesson 1: Background Concepts
Learning objectives
1.1 Understand big data and data analytics concepts
1.2 Understand Hadoop as a big data platform
1.3 Understand Hadoop MapReduce basics
1.4 Understand Spark language basics
Lesson 2: Working with Scalable Systems
Learning objectives
2.1 Understand scalable concepts
2.2 Emulate scalable systems
2.3 Use Linux command line analytics tools
2.4 Use the Zeppelin web notebook
Lesson 3: Using the Hadoop HDFS File System
Learning objectives
3.1 Understand HDFS basics
3.2 Use HDFS command line tools
3.3 Use the HDFS web interface
Lesson 4: Using Hadoop MapReduce
Learning objectives
4.1 Understand the MapReduce paradigm and platform
4.2 Understand parallel MapReduce
4.3 Run MapReduce examples
4.4 Use the Streaming interface
4.5 Use the MapReduce (YARN) web interface
Lesson 5: Using the Hive Scalable Database
Learning objectives
Lesson 6: Using Apache PySpark
Learning objectives
Summary
Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools: Summary