The dangers of untrusted Spark SQL input in a shared environment

How to deal with untrusted Spark SQL

10 Mar 2021 • 2 min read

Apache Spark is a powerful analytics engine that has a rich library of built-in SQL functions. Due to this, it is usually a bad idea to let users run arbitrary SQL against a shared Spark cluster.

On the surface, Spark SQL may look similar to other SQL languages such as MySQL or PostgreSQL but it is quite different and is capable of doing a lot of damage in the hands of an attacker. In the worst-case scenario, a malicious user could execute arbitrary Java code and shell commands (remote code execution).

Protecting your system against these exploits is different from how it’s usually done in other SQL languages. Any Spark function that accepts Spark SQL as argument is a potential target.

Remote code execution is the holy grail of exploits and by default, this would be as easy as adding a jar containing malicious code to the classpath by calling ADD JAR '/home/someuser/some-malicious-code.jar';. The ADD JAR function accepts a URL as an argument so an attacker could pull in any public jar from the internet.

For example, to add commons-io:

ADD JAR 'https://repo1.maven.org/maven2/commons-io/commons-io/2.8.0/commons-io-2.8.0.jar';

Spark SQL also provides a way to add arbitrary files to the environment via the ADD FILE function. This file could be an executable shell script.

ADD FILE '<some file from the internet>'

Spark SQL lets users call Java methods via reflection, so with the malicious jar or shell script on the classpath, an attacker could wreak havoc by executing it.

For example, the following SQL would get them an environment variable.

SELECT java_method('java.lang.System', 'getenv', 'PATH');

SELECT reflect('java.lang.System', 'getenv', 'PATH');

The LIST JAR or LIST FILE commands would give an attacker a list of jars and files on the classpath.

So which specific functions do we need to be careful with? Basically, any function that accepts Spark SQL as argument. For example org.apache.spark.sql.SparkSession.sql() or org.apache.spark.sql.functions.expr(). If you are passing user input to these functions on a shared cluster, you are taking a risk.

Another consideration is the Hadoop installation where Spark keeps temporary files. An attacker could execute the following and load all CSVs from another user.

SELECT * FROM csv.`/user/notmyuser/*.csv`

Therefore, it is essential that all Spark jobs are run as separate Hadoop users.

In summary, executing Spark SQL from user input on a shared Spark cluster is risky. What can be done to make it safe(r)?

Sanitizing user input and try to catch malicious input would be a quick but less than ideal solution. There’s always a risk that sanitization fails, or a new, exploitable function gets added in a new release of Spark that the sanitizing function does not check for.
If physically separating Spark clusters is not an option, then virtualization, where each user has their own set of virtual instances with their own Spark installation would be a solution but it’s hard to implement.
It’s imperative, that Spark jobs are separated by Hadoop user. This can be done by setting the HADOOP_USER_NAME environment variable when a Spark job is launched. For example, launching a Spark process via SparkLauncher:

SparkLauncher launcher = new SparkLauncher()
launcher.setConf("spark.driver.extraJavaOptions", "-DHADOOP_USER_NAME=<hadoop user name>")

Launching jobs this way will ensure that users cannot load files from outside their own Hadoop directories.

All of the above is applicable only to shared standalone Spark installations with Hadoop. Running Spark with Hive and other technologies introduces additional security risks which I will cover in another article.