Optimizing the data reads in Spark: Part 1 — PySpark 2.4+

Akshay Raju
3 min readJan 15, 2021

Number of jobs created while reading a CSV file?

Ever wondered or payed closer attention to the number of jobs that get created while reading data in Spark UI ?

Do these jobs created really matter ?

Here in this post we’ll see the significance of jobs and ways to optimize them.

What and Why are Spark Jobs important ?

The code submitted to the Spark Driver internally converts it into multiple Jobs, wherein each Job has a sequence of Stages ( consisting of DAG’s ) and each Stage comprises of multiple Tasks.

JobStagesTasks

A spark job(s) is created whenever there is an Action triggered such as count(), show(), collect() etc. Therefore, the entire system runtime depends when all these jobs are completed.

As you could see each job has a certain number of stages and tasks to complete. Hence, limiting the number of jobs obliviously will reduce the total system runtime of the application.

Reading a sample CSV file:

Consider reading a sample CARS Dataset present in local path.

  • Maker : StringType
  • Price : IntegerType

Case 1 : Reading file with ‘inferSchema’ and ‘header’ as True

  • df=spark.read.format(‘csv’).options(inferSchema=True,header=True).load(“file:///data/testfiles/cars.csv”)
Spark UI

Number of Jobs created is 2 → One job for accessing the file and acknowledging the first line as Header and other job for reading the schema of the input file.

Case 2 : Reading file with ‘header’ as True

  • df=spark.read.format(‘csv’).options(header=True).load(“file:///data/testfiles/cars.csv”)
Spark UI

Number of Jobs created is 1 → One job for accessing the file and acknowledging the first line as Header. Remember the schema will be of String Type for both columns (maker & price)

Can we limit the number of jobs ?

Certainly and the trick to this is by introducing the ‘schema’ will reading the CSV file as a dataframe.

The Schema will define the header columns, it’s datatype and null values(optional) present in the input file.

Case 3 : Reading file with ‘schema’

  • df=spark.read.format(‘csv’).schema(myschema).load(“file:///data/testfiles/cars.csv”)
  • Number of Jobs created is 0 → Since Spark has the metadata details of the input file, it doesn’t create any Jobs until an Action is triggered.

Hope, you understood how to optimize the number of jobs in Spark while reading input data from the source. Feel free to drop in your comments, thoughts or queries on the same.

--

--

Akshay Raju

Data Engineer | Open Source Contributor | Big Data Developer | AI and ML Enthusiast