Introducing pyspark_xray: a diagnostic tool that enables local debugging of PySpark applications using VSCode or PyCharm

Overview

pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that runs on slave nodes.

The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.

Problem

For developers, it’s very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.

If you develop PySpark applications, you know that PySpark application code is made up of two categories:

  • code that runs on master node
  • code that runs on worker/slave nodes

While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.

Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.

Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.

Solution

pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

This library achieves these capabilties by using the following techniques:

  • wrapper functions of Spark code on slave nodes, check out the section below to learn more details
  • practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
  • For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac’s memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
  • usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray’s const.py auto-detects whether local mode is on or off based on current OS, with values:
  • True: if current OS is Mac or Windows
  • False: otherwise

in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

Wrapper Functions For Code on Slave Nodes

pyspark_xray introduces these wrapper functions (defined in utils_m.py) of Spark native functions to enable local debugging of code that runs on slave nodes:

wrapper_rdd_map for RDD.map

RDD.map function returns a new RDD by applying a function to each element of the RDD.

wrapper_rdd_map wraps RDD.map function so that it runs the original function when CONST_BOOL_LOCAL_MODE is False, and runs locally debuggable re-written version when CONST_BOOL_LOCAL_MODE is True.

Check out demo_app for a demo of this wrapper, refer to demo apps section to learn how to setup.

wrapper_rdd_mapvalues for RDD.mapValues

RDD.mapValues function passes each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.

wrapper_rdd_mapvalues wraps RDD.mapValues function so that it runs the original function when CONST_BOOL_LOCAL_MODE is False, and runs locally debuggable re-written version when CONST_BOOL_LOCAL_MODE is True.

Like RDD.mapValues, wrapper_rdd_mapvalues wrapper is used on key value pair type of RDDs.

Check out demo_app for a demo of this wrapper, refer to demo apps section to learn how to setup.

wrapper_sdf_mapinpandas for DataFrame.mapInPandas

DataFrame.mapInPandas function maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.

wrapper_sdf_mapinpandas wraps Spark DataFrame.mapInPandas function so that it runs the original function when CONST_BOOL_LOCAL_MODE is False, and runs locally debuggable re-written version when CONST_BOOL_LOCAL_MODE is True.

Check out demo_app02 for a demo of this wrapper, refer to demo apps section to learn how to setup.

Demo Apps

  • demo_app demonstrates a sample use case of wrapper_rdd_map and wrapper_rdd_mapvalues
  • demo_app02 demonstrates a sample use case of wrapper_sdf_mapinpandas

The demo_app folder contains a simple PySpark application that takes data of 11 loans into a Spark RDD, then calculates interest amount for each loan by passing calc_mthly_payment and calc_interest lambda functions respectively to 2 types of RDD mapValues transformation functions to calculate monthly payment amount and total interest amount for each loan respectively, see input and output below

Code Structure

  • const.py, defines a couple of variables shared by multiple modules
  • driver.py, main entry point of Spark application, creates SparkConf and SparkSession objects and triggers Spark application
  • main.py, the backbone of Spark application, runs on master node, read data into RDD, then perform RDD transformations, then print output
  • utils_m.py, utility functions that run on master node, mainly for data ingestion purpose
  • utils_s.py, lambda functions that are passed as parameters to RDD transformation functions, run on slave nodes

Dependencies

as of Febuary 2021

  • pyspark_xray (this package)
  • spark: v3.0.1
  • pyspark: v3.0.1
  • java: v1.8.0
  • PyCharm: Community v2020.3

Preparation

  • Open command line, kick off java command, if you get an error, then download and install java (version 1.8.0_221 as of April 2020)
  • If you don’t have it, download and install PyCharm Community edition (version 2020.1 as of April 2020)
  • If you don’t have it, download and install Anaconda Python 3.7 runtime
  • Download and install spark latest Pre-built for Apache Hadoop (spark-2.4.5-bin-hadoop2.7 as of April 2020, 200+MB size) locally
  • Windows:
  • if you don’t have unzip tool, please download and install 7zip, a free tool to zip/unzip files
  • extract contents of spark tgz file to c:\spark-x.x.x-bin-hadoopx.x folder
  • follow the steps in this tutorial
  • install winutils.exe into c:\spark-x.x.x-bin-hadoopx.x\bin folder, without this executable, you will run into error when writing engine output
  • Mac:
  • extract contents of spark tgz file to \Users\[USERNAME]\spark-x.x.x-bin-hadoopx.x folder
  • install pyspark by pip install pyspark or conda install pyspark, make sure version of pyspark match the version of spark you use

Run Configuration

You run Spark application on a cluster from command line by issuing spark-submit command which submit a Spark job to the cluster. But from PyCharm or other IDE on a local laptop or PC, spark-submit cannot be used to kick off a Spark job. Instead, follow these steps to set up a Run Configuration of pyspark_xray’s demo_app on PyCharm

  • Set Environment Variables:
  • set HADOOP_HOME value to C:\spark-x.x.x-bin-hadoop2.7
  • set SPARK_HOME value to C:\spark-x.x.x-bin-hadoop2.7
  • use Github Desktop or other git tools to clone pyspark_xray from Github
  • PyCharm > Open pyspark_xray as project
  • Open PyCharm > Run > Edit Configurations > Defaults > Python and enter the following values:
  • Environment variables (Windows): PYTHONUNBUFFERED=1;PYSPARK_PYTHON=python;PYTHONPATH=$SPARK_HOME/python;PYSPARK_SUBMIT_ARGS=pyspark-shell;
  • Open PyCharm > Run > Edit Configurations, create a new Python configuration, point the script to the path of driver.py of pyspark_xray > demo_app (see screen shot below)

Debug Locally

In main.py, after loan data is ingested into RDD, two types of RDD transformation functions are called one after the other to demonstrate difference of debugging capability between pyspark_xray’s RDD transformation wrappers vs native RDD transformation functions.

At first, native RDD mapValues transformation function is called with calc_mthly_interest as lambda function parameter

Then pyspark_xray’s wrapper function of RDD mapValues transformation function, i.e. wrapper_mapvalues function, is called with calc_interest as lambda function parameter

Correspondingly, break points are set within calc_mthly_payment and calc_interest lambda functions respectively in utils_s.py. NOTE: these are break points that were not stoppable before adopting pyspark_xray.

Now start debugging demo_app and the break point set in calc_mthly_payment function will be skipped, but break point in calc_interest function will be stopped, see below. The reason is because calc_interest lamdba function was passed to pyspark_xray’s wrapper function of RDD mapValues transformation, while calc_mthly_payment function was passed to original RDD mapValues transformation.

References

PySpark Resources:

Master Data Engineer at Capital One