Python for Big Data - OARC Workshop
From Balamurugan Desinghu
views
comments
From Balamurugan Desinghu
In recent years, Python has become one of the top programming languages for doing data analysis due to its inherent advantages such as simplicity, readability, portability, etc., However, Python is slow compared to C or Fortran, and it does not manage memory well. These limitations, with speed and memory management, may not be significant when analyzing small datasets, but they become bottlenecks when analyzing big datasets.
To address the challenges associated with big data analytics, the Python community developed and tested several techniques. In this workshop, we will go through some of these techniques including vectorization, parallelization, just in time compilation, and distributed task executions. We will do hands-on exercises to emphasize the following solutions.
Objectives
How to speed up the data analysis?
What to do when the data set size exceeds the available physical memory?
How to distribute the workloads when doing machine learning for big data sets?
What is needed? Laptop/Desktop with Internet connection
Duration: 3 hours
Programming Platform: Amarel Cluster
Prerequisite: Basic laptop usage. Basic knowledge of Python is helpful for doing the hands-on session.
Slides and materials: Will be provided in the workshop