Photo by William Chiesurin on Unsplash

Dask: From Scratch to Scalable Analytics in Python! :)

A Set of Practical, Powerful, and Sexy Libraries for Working with Machine and Deep Learning!

Josue Luzardo Gebrim
8 min readJan 16, 2022

--

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

Dask is a set of flexible libraries for parallel computing in Python consisting of two parts:

  • Dynamic Task Scheduling: It’s like Airflow, Luigi, Celery, or Make but optimized for interactive computing workloads.
  • Custom types for “Big Data”: such as parallel arrays, dataframes, and lists that extend standard interfaces like NumPy, Pandas, or Python iterators for distributed environments, or larger than memory. These parallel collections run on top of dynamic task schedulers.

In addition to this part, there is still a strong integration with frameworks and other libraries for data science, customized interfaces to facilitate its use, in addition to being an open-source project with a large maintainer community and having a vast ecosystem of integrations and other “daughter” libraries. ”.

Dask emphasizes and is constantly evolving, following the following virtues:

  • Familiar: provides parallelized NumPy array and Pandas DataFrame objects

--

--

Josue Luzardo Gebrim
Josue Luzardo Gebrim

Written by Josue Luzardo Gebrim

As a platform engineer, ecosystems, and data solutions, I'm sharing my opinion and what little I know from time to time here.