What is Apache Spark?
2 min read
In simple terms, Apache spark is an in-memory unified parallel compute engine.
Most of the operations in apache spark happen in memory and there is very less disk IO operation giving rise to faster data transformation and computation, unlike Hadoop Mapreduce where Disk I/O was happening after every transformation resulting in bad performance because of slow-running MapReduce tasks.
In Hadoop MapReduce, if we need SQL-like queries to run then we need to learn HIVE, for data cleaning - PIG, for NoSQL - HBASE, Mahout for ML-related requirements, STORM for stream data processing, and OOZIE for scheduling jobs.
But In Apache Spark, you only need to learn spark and everything else is available in spark itself hence it is a unified tool for solving most of the real-world Big Data problems.
Parallel compute Engine,
Spark is a distributed computing engine meaning it can be used for processing large amounts of data parallelly on multiple machines. It only needs two components 1. Distributed storage such as - HDFS, S3, Cassandra, HBase, etc 2. a resource manager like YARN, Mesos, Kubernetes, etc., and based on the requirement spark will do the processing on multiple machines parallelly where data is present.
Basic properties of spark
The basic unit of storage in apache spark is RDD(resilient distributed datasets).
you can think of the distributed in-memory collection. while loading data into an rdd each block of data of a file is brought into memory and becomes part of the rdd distributed over multiple machines.
Spark transformations are lazy.
spark transformations such as map, flatMap, mapValues, reduceByKey, groupByKey, etc. are not executed unless we hit an action statement like collect countByValues, etc while running a spark program.
While encountering a transformation an entry to a Lineage graph (think it like an execution plan) is created and on encountering an action it gets executed. This is to maintain the optimizations that can be done in an execution plan before executing.
RDDs are immutable.
rdd are immutable to maintain the resiliency of rdd so that whenever there is a failure in an rdd we can recreate it using its parent rdd and the transformation used to create the failed rdd by looking into the execution plan.