Phase 1: Hadoop Fundamentals with single node setup (Day 1)
Laying the foundation
Introduction to Hadoop and Spark
- Ecosystem
- Big Data Overview
- Key Roles in Big Data Project
- Key Business Use cases
- Hadoop and Spark Logical Architecture
- Typical Big Data Project Pipeline
Basic Concepts of HDFS
- HDFS Overview
- Physical Architectures of HDFS
- The Hadoop Distributed File System Hands-on.
Hadoop Ecosystem
Introduction to Sqoop
- What is sqoop
- Import / Import all tables Data
- Sqoop Job/Eval and Sqoop Code-gen
- List databases/tables
Hadoop Hands On
- Running HDFS commands
- Running Sqoop Import and Sqoop Export
Introduction to Spark
- Spark Overview
- Detailed discussion on “Why Spark”
- Quick Recap of MapReduce
- Spark vs MapReduce
- Why Python for Spark
- Just Enough Python for Spark
- Understanding of CDH Spark and Apache Spark
Phase 2: Hadoop Development (Day 2)
Become a Pro developer with Spark and Hive Datawarehouse
Spark Core Framework and API
- High level Spark Architecture
- Role of Executor, Driver, SparkSession etc.
- Resilient Distributed Datasets
- Basic operations in Spark Core API i.e.
- Actions and Transformations
- Using the Spark REPL for performing interactive data analysis
- Hands-on Exercises
- Integrating with Hive
Delving Deeper into Spark API
- Pair RDDs
- Implementing Map Reduce Algorithms using Spark
- Ways to create Pair RDDs JSON Processing Code Example on JSON Processing
- XML Processing
- Joins
- Playing with Regular Expressions
- Log File Processing using Regular Expressions
- Hands-on Exercises
Executing a Spark Application
- Writing Standalone Spark Application
- Various commands to execute and configure
- Spark Applications in various modes
- Discussion on Application, Job, Stage,
- Executor, Tasks
- Interpreting RDD Metadata/Lineage/DAG
- Controlling degree of Parallelism in Spark Job
- Physical execution of a Spark application
- Discussion on: How Spark is better than
- MapReduce?
- Hands-on Exercises
Phase 3: Hadoop with Spark Dataframes and park SQL (Day 3 and 4)
Spark SQL
- Dataframes in Depth
- Creating Dataframes
- Discussion on Different file formatORC, Sequence, Avro, Parquet and sequence
- Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
- Load data into Spark from external data sources like relational databases
- Saving dataframe to external sources like HDFS, RDBMS
- SQL features of Data frame
- Data formats – text format such csv, json, xml, binary formats such as parquet,orc
- UDF in Spark Dataframe
- When to use UDF of hive or not?
- CDC use cases
- Spark optimization techniques-joins?
- Integration with Teradata- use case
Understanding of Hive
- Hive as a Data Warehouse
- Creating Tables for Analysis of data
- Techniques of Loading Data into Tables
- Difference between Internal and External Tables
- Understanding Hive Data Types Joining,Union datasets
- Join Optimizations
- Partitions and Bucketing
- Running a Spark SQL Application
- Dataframes on a JSON file
- Dataframes on hive tables
- Dataframes on JSON
- Querying operations dataframes
- Hive Writing HSQL queries for data retrieval
Phase 4: NoSQL and Cluster Walkthrough (Day 5)
Know Kafka Tool and Spark Streaming
Introduction to Kafka
- Kafka Overview
- Salient Features of Kafka
- Topics, Brokers and Partitions
- Kafka Use cases
Kafka Connect and Spark Streaming
- Kafka Connect
- Hands-on Exercise
Structured Streaming
- Structured Streaming Overview
- How it is better than Kafka streaming?
- Hands-on Exercises integrating with Kafka using Spark Streaming