Dask: From Scratch to Scalable Analytics in Python! :)
A Set of Practical, Powerful, and Sexy Libraries for Working with Machine and Deep Learning!
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
Dask is a set of flexible libraries for parallel computing in Python consisting of two parts:
- Dynamic Task Scheduling: It’s like Airflow, Luigi, Celery, or Make but optimized for interactive computing workloads.
- Custom types for “Big Data”: such as parallel arrays, dataframes, and lists that extend standard interfaces like NumPy, Pandas, or Python iterators for distributed environments, or larger than memory. These parallel collections run on top of dynamic task schedulers.
In addition to this part, there is still a strong integration with frameworks and other libraries for data science, customized interfaces to facilitate its use, in addition to being an open-source project with a large maintainer community and having a vast ecosystem of integrations and other “daughter” libraries. ”.
Dask emphasizes and is constantly evolving, following the following virtues:
- Familiar: provides parallelized NumPy array and Pandas DataFrame objects