Shared variables in spark

Sometimes in a spark application, we need to share small data across all the machines for processing. For example, if you want to filter some set of words from a large dataset residing in a datalake. Or if we simply just want to know how many blank lines are present in the whole dataset in these cases shared variables in spark can be used with ease.

There are mainly two types of shared variables in spark

1. Broadcast variable
A broadcast variable is shared across all the machines where executors are executing. so that each executor program can read it during run time.
you can think of a broadcast variable as a small table in hive during a map-side join

Usage -

from pyspark import SparkContext 

sc = SparkContext("local[*]","application_name")
sc.setLoggingLevel("ERROR")

input_rdd = sc.textFile("File Path") # materializing the rdd
filter_words_set = sc.broadcast({"apple", "banana", "cherry"})

required_column = input_rdd.map(lambda x: (x.split(",")[0]))
filtered_data = required_column.filter(lambda x: x[0] is not in filter_words_set)

filtered_data.saveAsTextFile("save directory path")

2. Accumulator
Spark follows a master-slave architecture. here master is called a driver and the slaves are executors. an accumulator is like a counter variable that is present on the master machine and every executor machine can do updates on it but they cannot read it.
Usage -

from pyspark import  SparkContext

sc = SparkContext("local[*]","accumulator")
sc.setLogLevel("ERROR")

input_rdd = sc.textFile("file path")

my_accumulator = sc.accumulator(0)

input_rdd.foreach( lambda x: my_accumulator.add(1) if x == "" else None)

print(my_accumulator.value)

Shared variables in spark

Comments

More from this blog

Spark Stages, Tasks, and Jobs

Basic Spark RDD transformations

Spark on YARN architecture

What is Apache Spark?

Command Palette

Comments

More from this blog