Course Outline:
DURATION: 3 Days
Day 1:
- Introduction to apache spark.
- Understanding spark core components and its usages.
- Spark & PySpark shell.
- Working spark shell & PySpark shell.
- Understanding and using RDD.
- Writing code with multiple functions with python.
- Introduction to HDFS in big data world.
- HDFS architecture.
- HDFS usage.
Day 2:
- Spark and Hadoop ecosystem.
- Spark and MapReduce .
- RDD operations.
- Deploying multi node spark cluster.
- Spark standalone cluster and webUI.
- RDD partitions and HDFS data locality.
- Dataframes and its operations.
- Writing python PySpark code for implementing Map vs Flatmap.
- Understanding and using transformation and action in RDD.
Day 3:
- SparkSQL and Spark stream introduction.
- Loading data using spark SQL and implementing it.
- Welcome to real time streaming of data with spark stream framework.
- Building and running spark application.
- Logging spark.
- Streaming Overview.
- Sliding window operations.
- Live chat count or real-time twitter count with spark.