Untitled Publication

Untitled Publication

Untitled Publication

Untitled Publication
11 posts
Spark Stages, Tasks, and Jobs
There are mainly 3 components in spark UI Jobs A spark application can have multiple jobs based on the number of actions (#jobs =#actions) in the application. These jobs can have a common rdd somewhere in the execution map or they can be separate al...
Jan 24, 20232 min read
Basic Spark RDD transformations
RDD(resilient distributed datasets) are the basic unit of storage in spark. you can think of an rdd as a collection distributed over multiple machines.Most of the time higher level structured APIs are used in spark applications which under the hood g...
Jan 18, 20234 min read
Spark on YARN architecture
When we talk about spark on top of Hadoop its generally Hadoop core with Spark compute engine instead of MapReduce, i.e (HDFS, Spark, YARN) Spark follows a master-slave architecture where the master is called a Driver in spark and is responsible for ...
Jan 9, 20232 min read
Shared variables in spark
Sometimes in a spark application, we need to share small data across all the machines for processing. For example, if you want to filter some set of words from a large dataset residing in a datalake. Or if we simply just want to know how many blank l...
Jan 9, 20232 min read
What is Apache Spark?
In simple terms, Apache spark is an in-memory unified parallel compute engine. In Memory,Most of the operations in apache spark happen in memory and there is very less disk IO operation giving rise to faster data transformation and computation, unlik...
Jan 4, 20232 min read
Introduction to Hive
We cannot use an analytical storage system for transactional requirements and vice versa. But have you ever wondered why is that so? Transactional vs Analytical storage system Transactional storage (ex - MySQL, Postgres, etc.) is used to work with da...
Dec 30, 20224 min read
Introduction to SQOOP in Hadoop
Data ingestion is one of the crucial steps in the data lifecycle and when the source is a relational database, Sqoop can be a very easy and simple tool for this. What is Sqoop? Sqoop is a general tool for transferring bulk data from relational databa...
Dec 27, 20224 min read
MapReduce in Hadoop
Although MapReduce is not much used in solving Big Data problems nowadays because of its poor performance compared to spark. But it's still a very good approach to understanding how distributed computing works in Big Data. What is Hadoop MapReduce? H...
Dec 25, 20224 min read
HDFS Recovery mechanisms
Data nodes in HDFS are generally of commodity hardware which is low-priced but goes down very frequently. What happens to the data when a Data Node goes down?whenever a block is entered in HDFS, it is always replicated to multiple nodes. the number o...
Dec 22, 20222 min read