Analysis of the Data Processing Framework of Pandas and Snowpark Pandas API
This is a process analysis of migrating existing Pandas workflows to an almost lift-and-shift approach using the Snowpark Pandas API to meet ever-growing data needs.
Join the DZone community and get the full member experience.
Join For FreeThis article explains the process of how to migrate existing Pandas Workflows to Snowpark Pandas API, allowing for efficient scaling up of data processing needs without needing a full code rewrite. It is a pretty much lift and shift approach to have the data processing workflows up and running in minimal time and in a highly secure environment.
Prerequisites
- Expertise in Python Scripting of versions 3.8 and up
- Knowledge of basic and complex SQL for scripting
- Snowflake Account
- Snowflake Warehouse Usage permissions
- AWS S3/Cloud External Stage and Access Integration
Introduction
Pandas has been the go-to library for data manipulation and analysis. As datasets grow in volume and variety, the traditional Pandas can have implications with memory limitations and performance bottlenecks. Snowpark Pandas API — a promising tool that brings the power of distributed computing to the Pandas API, within the secure environment of Snowflake.
The Snowpark Pandas API is an extension of Snowflake's Snowpark framework, designed to allow Python developers to run Pandas code directly on data stored in Snowflake. By leveraging Snowflake's computational engine, this API enables scalable data processing without the need to move data out of the platform.
The Snowpark Pandas API is Snowflake’s effort to bridge the gap between the familiar, user-friendly Pandas experience and the powerful, scalable processing capabilities of the Snowflake Data Cloud. It allows Python developers and data scientists to write Pandas-like code while executing computations on Snowflake's infrastructure, benefiting from its scalability, performance, and security.
- Snowpark Pandas API is a Pandas-compatible API that runs on top of Snowpark, Snowflake’s Python library.
- It allows us to write Pandas code (like df.groupby(), df.merge(), df["col"].mean(), etc.), but the computation is offloaded to Snowflake.
- The API translates Pandas operations into SQL queries under the hood, leveraging Snowflake’s processing engine.
Snowpark Pandas operates by translating Pandas operations into SQL queries that Snowflake can execute. This approach maintains the familiar eager execution model of Pandas while optimizing performance through Snowflake's query engine
Main Benefits
- Familiarity: For Python developers well-versed in Pandas, this API offers a seamless transition. The syntax and operations closely mirror those of native Pandas, reducing the learning curve.
- Scalability: Traditional Pandas operates on a single machine's memory, Snowpark Pandas distributes computations across Snowflake's infrastructure. This allows for handling large datasets efficiently.
- Security and governance: Data does not go out of Snowflake's secure environment, ensuring compliance with organizational data governance policies.
- No additional infrastructure: No additional resources are needed to manage compute resources. Snowpark Pandas utilizes Snowflake's existing infrastructure, simplifying operations.
Process Walkthrough
The simple use case we discuss here is setting up a data processing workflow that does the data transformations and writes back to Snowflake. The step-by-step process flow is explained below.
1. Install dependencies:
bash
pip install snowflake-snowpark-python[modin]
Note: Ensure you're using Python 3.9, 3.10, or 3.11, and have Modin version 0.28.1 and Pandas version 2.2.1 installed.
2. Initialize Snowpark session:
python
from snowflake.snowpark.session import Session
session = Session.builder.configs({
'account': '<your_account>',
'user': '<your_user>',
'password': '<your_password>',
'role': '<your_role>',
'database': '<your_database>',
'schema': '<your_schema>',
'warehouse': '<your_warehouse>',
}).create()
3. Read data into Snowpark Pandas DataFrame:
python
import modin.pandas as pd
import snowflake.snowpark.modin.plugin
df = pd.read_snowflake('<your_table>')
4. Perform data operations:
python
filtered_df = df[df['column_name'] > 100]
5. Write data back to Snowflake:
python
df.to_snowflake('<your_table>', overwrite=True)
Architecture Overview
- Client-side libraries:
- Modin: Utilizes Modin to provide a Pandas-like API that supports parallel execution across multiple cores or nodes.
- Snowpark Pandas plugin: Integrates Modin with Snowflake, enabling operations to be executed within the Snowflake environment.
- Snowflake session:
- Establishes a connection to Snowflake, allowing data operations to be performed directly within the platform.
- Snowpark Pandas DataFrame:
- Represents data in a structure similar to Pandas DataFrames but optimized for distributed processing within Snowflake.
- SQL query translation:
- Operations on the Snowpark Pandas DataFrame are translated into SQL queries, leveraging Snowflake's compute engine for execution.
- Execution in Snowflake:
- The translated SQL queries are executed within Snowflake's infrastructure, utilizing its scalability and performance optimizations.
- Results storage:
- The processed data can be returned to the client as a Pandas DataFrame or stored within Snowflake for further use.
Limitations/Considerations
- Data types: While Snowpark Pandas aims to closely align with native Pandas, a few data types may have different representations due to Snowflake's type system.
- Local operations: Operations that require data to be moved outside Snowflake, such as
to_pandas()
, will materialize the data locally and may not benefit from distributed processing.
Use Cases
- Data exploration: Quickly analyze and visualize large datasets without the need for data extraction.
- Data engineering: Perform complex transformations on large datasets directly within Snowflake.
- Data cleansing: Efficient transformation and pre-processing of data at scale, ensuring high-quality inputs for downstream applications.
Conclusion
The Snowpark Pandas API represents a major advancement in data processing, utilizing the simplicity of Pandas with the scalability of Snowflake. It is an efficient and powerful tool for Python developers looking to enhance/leverage the full potential of cloud-based data platforms.
Snowpark Pandas demonstrates significant performance improvements over traditional methods. Reading a 10 million-row dataset into a Snowpark Pandas DataFrame took approximately 4.58 seconds, whereas using the to_pandas()
method took about 65 seconds.
For data architects, data engineers, and Pandas enthusiasts, this analysis aims to provide the insights needed to choose the best solution for their environments. If you need deeper technical insights or practical guides, refer to the Snowflake documentation.
Opinions expressed by DZone contributors are their own.
Comments