Spark and Scala by Akkem Sreenivasulu
–| It is Bigdata Processing Framework
–| Spark can implmented using
–| It can process any Filesystem.
–| It is a InMemory(RAM) processing Framework.
–| It is from “Apache”
Before Spark we have already
MapReduce, Pig etc.. Bigdata Processing Frameworks.
MapReduce vs Spark:-
1. MapReduce we can implment using Java, Python, etc..
Spark we can implment using Java, Scala, Python and R
2. In Hadoop Along with MapReduce we have
Pig — Scripting Framework — It is very simple and easy and very less
Hive — is like SQL
— HQL — Hive Query Language
Flume — It is Streaming Framework — Performs only streaming.
Spark Core — RDD Programming — Java/Python/Scala
Spark SQL — DataFrames, Tables, Datasets
— DSL — Domain Specific Language or Native SQL Queries
SQL: select * from df;
Spark Streaming — It is Streaming — Performs streaming + Live Analytics
Spark MLib — It is a Machine Learning Library
Spark GraphX — Graph Data Processing.
3. MapReduce tightly coupled with HDFS Filesystem
Spark can Process any Filesystem.
4. MapReduce uses Disk Memory and InMemory for Processing.
Spark by default uses InMemory for Processing
Spark is 100 times faster than MapReduce in InMemory Processing
Spark is 10 times faster than MapReduce in Disk Processing
MapReduce is a Fastest Processing Framework before Spark.
5. In Hadoop we have
MapReduce, Pig, Hive, Flume, Sqoop etc.. but we cannot combine
all these in single application to meet my requirement.
But In Spark we can combine Spark Core, Spark SQL, Spark Streaming,
Spark MLib, Spark Graphx in a Single Application.
6. MapReduce we can run only on “YARN” Environment
YARN means it is a runtime environment in Hadoop
ResourceManager — Master — Takes request
NodeManager — Slave — Process Requests
Spark we can run on multiple environments:
Spark Standalone Cluster(Apache)
YARN Cluster (Apache)
What we are going to discuss as part of this course,
1. Spark Foundation
3. Spark Core
4. Spark SQL
5. Spark Streaming
6. Spark Integrations
different filesystems like hdfs,csv,json,xml, etc..
NoSQL — Cassandra and HBase
Duration: min:35days max:40days — 35hrs to 40hrs
Spark Machine Learning Developer:
Duration: 35hrs to 40hrs
Spark Machine Learning Developer