Apache Spark with Python - Big Data with PySpark and Spark [Video]

Preview Online

Code Files

Apache Spark with Python - Big Data with PySpark and Spark [Video]

James Lee et al.

Monday, April 16, 2018

Learn Apache Spark and Python by 12+ hands-on examples of analyzing big data with PySpark and Spark

Quick links: > Table of contents > What will you learn? > Preview

Packt Subscription

FREE

€9.36/m after trial

Video

€72.59

RRP €145.16

Save 49%

What do I get with a Packt subscription?

Exclusive monthly discount - no contract
Unlimited access to entire Packt library of 6500+ eBooks and Videos
120 new titles added every month, on new and emerging tech

What do I get with an eBook?

Download this book in EPUB, PDF, MOBI formats
DRM FREE - read and interact with your content when you want, where you want, and how you want
Access this title in the subscription reader

What do I get with Print & eBook?

Get a paperback copy of the book delivered to you
Download this book in EPUB, PDF, MOBI formats
DRM FREE - read and interact with your content when you want, where you want, and how you want
Access this title in the subscription reader

What do I get with a Video?

Download this Video course in MP4 format
DRM FREE - read and interact with your content when you want, where you want, and how you want
Access this title in the subscription reader

€0.00

€72.59

€9.36 p/m after trial

RRP €145.16

Subscription

Video

Start a FREE 10-day trial

Frequently bought together

Apache Spark with Python - Big Data with PySpark and Spark [Video]

€ 145.16

€ 72.59

Apache Spark with Python - Big Data with PySpark and Spark [Video]

Apr 2018

3 hours 18 minutes

€ 72.59

Hands-On Big Data Analytics with PySpark

€ 20.21

€ 4.04

Hands-On Big Data Analytics with PySpark

Mar 2019

182 pages

€ 4.04

Buy 2 for €76.63
Save €74.58

Add to Cart

Video Details

ISBN 13 9781789133394

Course Length 3 hours 18 minutes

Get Started with Apache Spark

Course Overview

Introduction to Spark

Install Java and Git

Set up Spark

Run our first Spark job

RDD

RDD Basics

Create RDDs

Map and Filter Transformation

Solution to Airports by Latitude Problem

FlatMap Transformation

Set Operations

Solution for the Same Hosts Problem

Actions

Solution to Sum of Numbers Problem

Important Aspects about RDD

Summary of RDD Operations

Caching and Persistance

Spark Architecture and Components

Spark Architecture

Spark Components

Pair RDD

Introduction to Pair RDD

Create Pair RDDs

Filter and MapValue Transformations on Pair RDD

Reduce By Key Aggregation

Solution for the Average House Problem

Group By Key Transformation

Sort By Key Transformation

Solution for the Sorted Word Count Problem

Data Partitioning

Join Operations

Advanced Spark Topics

Accumulators

Solution to StackOverflow Survey Follow-up Problem

Broadcast Variables

Spark SQL

Introduction to Spark SQL

Spark SQL in Action

Spark SQL practice: House Price Problem

Spark SQL Joins

Dataframe or RDD

Dataframe and RDD Conversion

Performance Tuning of Spark SQL

Running Spark in a Cluster

Introduction to Running Spark in a Cluster

Spark-submit

Run Spark Application on Amazon EMR (ElasticMapReduce) cluster

Video Description

This course covers all the fundamentals of Apache Spark with Python and teaches you everything you need to know about developing Spark applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Apache Spark and general big data analysis and manipulations skills to help your company to adopt Apache Spark for building big data processing pipeline and data analytics applications. This course covers 10+ hands-on big data examples. You will learn valuable knowledge about how to frame data analysis problems as Spark problems. Together we will learn examples such as aggregating NASA Apache weblogs from different sources; we will explore the price trend by looking at the real estate data in California; we will write Spark applications to find out the median salary of developers in different countries through the Stack Overflow survey data; we will develop a system to analyze how maker spaces are distributed across different regions in the United Kingdom. And much much more.

Style and Approach

This course covers 10+ hands-on big data examples. You will learn valuable knowledge about how to frame data analysis problems as Spark problems.

Video Preview

What You Will Learn

An overview of the architecture of Apache Spark.
Develop Apache Spark 2.0 applications using RDD transformations and actions and Spark SQL.
Work with Apache Spark's primary abstraction, resilient distributed datasets (RDDs) to process and analyze large data sets
Analyze structured and semi-structured data using DataFrames, and develop a thorough understanding about Spark SQL.
Advanced techniques to optimize and tune Apache Spark jobs by partitioning, caching and persisting RDDs.
Scale up Spark applications on a Hadoop YARN cluster through Amazon's Elastic MapReduce service.
Share information across different nodes on an Apache Spark cluster by broadcast variables and accumulators.
Write Spark applications using the Python API - PySpark

Authors

James Lee

James Lee is a passionate software wizard working at one of the top Silicon Valley-based start-ups specializing in big data analysis. He has also worked at Google and Amazon. In his day job, he works with big data technologies, including Cassandra and Elasticsearch, and is an absolute Docker geek and IntelliJ IDEA lover. Apart from his career as a software engineer, he is keen on sharing his knowledge with others and guiding them, especially in relation to start-ups and programming. He has been teaching courses and conducting workshops on Java programming / IntelliJ IDEA since he was 21. James also enjoys skiing and swimming, and is a passionate traveler.

Pedro Magalhães Bernardo

Pedro Magalhães Bernardo is a software engineer and data scientist based in Rome, Italy. Currently, he is working as a freelancer on different software projects across the globe. Previously, he has worked for start-ups and other companies in Brazil where he helped build data pipelines and software architectures for big data analysis. His main areas of expertise are software architecture and data engineering, and he is currently pursuing an MSc in Data Science at the University of Rome "La Sapienza".