Hortonworks HDP Developer Quick Start

Name: training4it.com
Address: 9913 Shelbyville Rd #200, Louisville, KY, 40223
Telephone: 502.265.3057

Retail Price: $2,800.00

Next Date: Request Date

Course Days: 4

Request a Date

Request Custom Course

About this Course

This 4 day training course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Apache Pig and Apache Hive, and developing applications on Apache Spark.

Topics include: Essential understanding of HDP and its capabilities, Hadoop, YARN, HDFS, MapReduce/Tez, data ingestion, using Pig and Hive to perform data analytics on Big Data and an introduction to Spark Core, Spark SQL, Apache Zeppelin, and additional Spark features.

Audience Profile

Developers and data engineers who need to understand and develop applications on HDP.

Prerequisites

Students should be familiar with programming principles and have experience in software development. SQL and light scripting knowledge is also helpful. No prior Hadoop knowledge is required. Target Audience Developers and data engineers who need to understand and develop applications on HDP.

Course Outline

Day 1: An Introduction to Apache Hadoop and HDFS

Describe the Case for Hadoop
Describe the Trends of Volume, Velocity and Variety
Discuss the Importance of Open Enterprise Hadoop
Describe the Hadoop Ecosystem Frameworks Across the Following Five Architectural Categories:
- Data Management
- Data Access
- Data Governance & Integration
- Security
- Operations
Describe the Function and Purpose of the Hadoop Distributed File System (HDFS)
List the Major Architectural Components of HDFS and their Interactions
Describe Data Ingestion
Describe Batch/Bulk Ingestion Options
Describe the Streaming Framework Alternatives
Describe the Purpose and Function of MapReduce
Describe the Purpose and Components of YARN
Describe the Major Architectural Components of YARN and their Interactions
Define the Purpose and Function of Apache Pig
Work with the Grunt Shell
Work with Pig Latin Relation Names and Field Names
Describe the Pig Data Types and Schema

Day 1 Labs and Demonstrations

Starting an HDP Cluster
Using HDFS Commands
Demonstration: Understanding Apache Pig
Getting Started with Apache Pig
Exploring Data with Pig

Day 2: Advanced Apache Pig Programming

Demonstrate Common Operators Such as:
- Order by
- Case
- Distinct
- Parallel
- Foreach
Understand how Hive Tables are Defined and Implemented
Use Hive to Explore and Analyze Data Sets
Explain and Use the Various Hive File Formats
Create and Populate a Hive Table that Uses ORC File Formats
Use Hive to Run SQL-like Queries to Perform Data Analysis
Use Hive to Join Datasets Using a Variety of Techniques
Write Efficient Hive Queries
Explain the Uses and Purpose of HCatalog
Use HCatalog with Pig and Hive

Day 2 Labs and Demonstrations

Splitting a Dataset
Joining Datasets
Preparing Data for Apache Hive
Understanding Apache Hive Tables
Demonstration: Understanding Partitions and Skew
Analyzing Big Data with Apache Hive
Demonstration: Computing Ngrams
Joining Datasets in Apache Hive
Computing Ngrams of Emails in Avro Format
Using HCatalog with Apache Pig

Day 3: Advanced Apache Pig Programming

Describe How to Perform a Multi-Table/File Insert
Define and Use Views
Define and Use Clauses and Windows
List the Hive File Formats Including:
- Text Files
- SequenceFile
- RCFile
- ORC File
Define Hive Optimization
Use Apache Zeppelin to Work with Spark
Describe the Purpose and Benefits of Spark
Define Spark REPLs and Application Architecture
Explain the Purpose and Function of RDDs
Explain Spark Programming Basics
Define and Use Basic Spark Transformations
Define and Use Basic Spark Actions
Invoke Functions for Multiple RDDs, Create Named Functions and Use Numeric Operations

Day 3 Labs

Advanced Apache Hive Programming
Introduction to Apache Spark REPLs and Apache Zeppelin
Creating and Manipulating RDDs
Creating and Manipulating Pair RDDs

Day 4: Working with Pair RDDS and Building Yarn Applications

Define and Create Pair RDDs
Perform Common Operations on Pair RDDs
Name the Various Components of Spark SQL and Explain their Purpose
Describe the Relationship Between DataFrames, Tables and Contexts
Use Various Methods to Create and Save DataFrames and Tables
Understand Caching, Persisting and the Different Storage Levels
Describe and Implement Checkpointing
Create an Application to Submit to the Cluster
Describe Client vs Cluster Submission with YARN
Submit an Application to the Cluster
List and Set Important Configuration Items

Day 4 Labs

Creating and Saving DateFrames and Tables
Working with DataFrames
Building and Submitting Applications to YARN

Sorry! It looks like we haven’t updated our dates for the class you selected yet. There’s a quick way to find out. Contact us at 502.265.3057 or email info@training4it.com

Request a Date