Photo by Susan Wilkinson on Unsplash

Spark on Kubernetes the Simple Way! :)

Run your Job PySpark on k8s!

Josue Luzardo Gebrim
9 min readMar 15, 2023

--

This text summarizes what it is, what it is for, and how to use it with some tips about Spark in Kubernetes.

Apache Spark?

Apache Spark is a distributed processing engine that allows you to process large datasets across a cluster of computers. It was developed at the University of California, Berkeley in 2009 and was later donated to the Apache Software Foundation, making it an open-source project.

Some of its main features include:

  • Support for different data sources such as HDFS, HBase, Cassandra, S3, among others.
  • Ability to process data in real-time and in batches.
  • Support for different programming languages such as Java, Scala, Python, and R.
  • Ability to process data in memory, which can significantly speed up the processing of large volumes of data.
  • Extensible API, allowing the addition of new modules and functionalities.

Apache Spark is so successful in the data space because of its many advantages over other large-scale data processing tools. Some of the reasons Spark is so popular include:

  • Speed: Spark was designed to be fast, especially for large-scale data…

--

--

Josue Luzardo Gebrim

As a platform engineer, ecosystems, and data solutions, I'm sharing my opinion and what little I know from time to time here.