This repository contains content for the Big Data Analytics with Python course. In its latest iteration, the course was taught at The African Institute for Mathematical Sciences (AIMS), Rwanda in 2022 and 2023 as part of the Master of Science in Mathematical Sciences (Data Science stream) program. For more details about this Masters programme, please check AIMS website.
This course can be thought of as a practical guide to working with large scale datasets. The principal aim is to introduce students/participants to the ecosystem of technologies for working with large scale datasets such as the technoligies for data storage, data processing, building machine learning models and more in the most practical approach possible using Python as the programming language. For more details about the course content, refer to this outline, otherwise, the main modules taught in the course are presented below.
- Module 1: Big data basics. The core message See the lecture slides here.
- Module 2: Functional programming and distributed data processing. See the lecture slides here and the corresponding notebook here.
- Module 3: Data gathering from the Web. See the lecture slides here and the corresponding notebooks here and here.
- Module 4: The Hadoop ecosystem. See the lecture slides here.
- Module 5: Introduction to Apache Spark. See the lecture slides here and the corresponding notebook here.
- Module 6: Data wrangling with Spark’s structured APIs. See the lecture slides here and the corresponding notebook here.
- Module 7: Machine Learning with Apache Spark. See the lecture slides here and the corresponding notebook here.
The repository contains the following folders:
- SLIDES: This folder has all the powerpoint and Google slides with lecture notes. Due to the large size of the presentations, this folder will mostly be empty as I'm not uploading these large files in here. However, the presentations can be found on the link.
- DOCS: This folder contains miscelleanous documents for the course. For instancee, the course outline.
- NOTEBOOKS: This folder has all the source code for the tutorials.This includes the notebooks and Python files.
- DATASETS: As the name suggests, tis folder has the datasets which are used in the course. Again, because of the size, these datasets are not uploaded here.
- RESOURCES: In this folder, there are learning resources such as PDF books and articles.
- SOFTWARE: This folder has all the packages required for the course. As some of the installation files are large, they are not available here but they can be found on the Google Drive linked.
In order to follow this material, the recommended approach is to tackle the modules as they are presented in the outline above. For each topic, go through the slides first and then move on to the tutorials in the notebooks. Its worth mentioning that since the course was delivered in person, the material isnt necessarily ideal for self paced learning but a person with reasonable prerequisite knowleedge can still follow the course and grasp the concepts.
For any questions regarding this course content, you can contact me through the two email adresses below: