DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The 2025 Kubernetes Trend Report is here: Discover the leading trends in AI integration, tool sprawl reduction, and developer productivity.

Databases are evolving fast. Share your insights in DZone’s 2025 Database Systems Survey!

Deploy AI agents with confidence. Learn how AWS + W&B simplify observability, tracing, and governance in this live session.

Related

  • How To Introduce a New API Quickly Using Quarkus and ChatGPT
  • Build a REST API With Just 2 Classes in Java and Quarkus
  • Modern Data Processing Libraries: Beyond Pandas
  • Multi-Tenancy and Its Improved Support in Hibernate 6.3.0

Trending

  • How GenAI Can Eliminate SME Bottlenecks in Enterprise Systems
  • Why GPT-OSS:20B Feels Painfully Slow (And How Quantization Can Save Your Sanity)
  • Caching Mechanisms Using Spring Boot With Redis or AWS ElastiCache
  • Evaluating LLM-Powered Voice Assistants: A Guide Beyond Traditional Metrics
  1. DZone
  2. Data Engineering
  3. Databases
  4. Analysis of the Data Processing Framework of Pandas and Snowpark Pandas API

Analysis of the Data Processing Framework of Pandas and Snowpark Pandas API

This is a process analysis of migrating existing Pandas workflows to an almost lift-and-shift approach using the Snowpark Pandas API to meet ever-growing data needs.

By 
Prasath Chetty Pandurangan user avatar
Prasath Chetty Pandurangan
·
Jul. 15, 25 · Analysis
Likes (1)
Comment
Save
Tweet
Share
2.1K Views

Join the DZone community and get the full member experience.

Join For Free

This article explains the process of how to migrate existing Pandas Workflows to Snowpark Pandas API, allowing for efficient scaling up of data processing needs without needing a full code rewrite. It is a pretty much lift and shift approach to have the data processing workflows up and running in minimal time and in a highly secure environment.

Prerequisites

  1. Expertise in Python Scripting of versions 3.8 and up
  2. Knowledge of basic and complex SQL for scripting
  3. Snowflake Account
  4. Snowflake Warehouse Usage permissions
  5. AWS S3/Cloud External Stage and Access Integration

Introduction

Pandas has been the go-to library for data manipulation and analysis. As datasets grow in volume and variety, the traditional Pandas can have implications with memory limitations and performance bottlenecks. Snowpark Pandas API — a promising tool that brings the power of distributed computing to the Pandas API, within the secure environment of Snowflake.

The Snowpark Pandas API is an extension of Snowflake's Snowpark framework, designed to allow Python developers to run Pandas code directly on data stored in Snowflake. By leveraging Snowflake's computational engine, this API enables scalable data processing without the need to move data out of the platform.

The Snowpark Pandas API is Snowflake’s effort to bridge the gap between the familiar, user-friendly Pandas experience and the powerful, scalable processing capabilities of the Snowflake Data Cloud. It allows Python developers and data scientists to write Pandas-like code while executing computations on Snowflake's infrastructure, benefiting from its scalability, performance, and security.

  • Snowpark Pandas API is a Pandas-compatible API that runs on top of Snowpark, Snowflake’s Python library.
  • It allows us to write Pandas code (like df.groupby(), df.merge(), df["col"].mean(), etc.), but the computation is offloaded to Snowflake.
  • The API translates Pandas operations into SQL queries under the hood, leveraging Snowflake’s processing engine.

Snowpark Pandas operates by translating Pandas operations into SQL queries that Snowflake can execute. This approach maintains the familiar eager execution model of Pandas while optimizing performance through Snowflake's query engine

Main Benefits

  1. Familiarity: For Python developers well-versed in Pandas, this API offers a seamless transition. The syntax and operations closely mirror those of native Pandas, reducing the learning curve.
  2. Scalability: Traditional Pandas operates on a single machine's memory, Snowpark Pandas distributes computations across Snowflake's infrastructure. This allows for handling large datasets efficiently.
  3. Security and governance:  Data does not go out of Snowflake's secure environment, ensuring compliance with organizational data governance policies.
  4. No additional infrastructure: No additional resources are needed to manage compute resources. Snowpark Pandas utilizes Snowflake's existing infrastructure, simplifying operations.

Process Walkthrough

The simple use case we discuss here is setting up a data processing workflow that does the data transformations and writes back to Snowflake. The step-by-step process flow is explained below.

1. Install dependencies:    

Shell
 
 bash

      pip install  snowflake-snowpark-python[modin]


Note: Ensure you're using Python 3.9, 3.10, or 3.11, and have Modin version 0.28.1 and Pandas  version 2.2.1 installed.

2. Initialize Snowpark session:

Python
 
 python

      from snowflake.snowpark.session import Session

     session = Session.builder.configs({

     'account': '<your_account>',

     'user': '<your_user>',

     'password': '<your_password>',

                   'role': '<your_role>',

     'database': '<your_database>',

     'schema': '<your_schema>',

     'warehouse': '<your_warehouse>',

      }).create()


3. Read data into Snowpark Pandas DataFrame:

Python
 
     python

     import modin.pandas as pd

      import snowflake.snowpark.modin.plugin

    df = pd.read_snowflake('<your_table>')


4. Perform data operations:  

Python
 
   python

    filtered_df = df[df['column_name'] > 100]


5. Write data back to Snowflake:

Python
 
  python

  df.to_snowflake('<your_table>', overwrite=True)


Architecture Overview

  1. Client-side libraries:
    • Modin: Utilizes Modin to provide a Pandas-like API that supports parallel execution across multiple cores or nodes.
    • Snowpark Pandas plugin: Integrates Modin with Snowflake, enabling operations to be executed within the Snowflake environment.
  2. Snowflake session:
    • Establishes a connection to Snowflake, allowing data operations to be performed directly within the platform.
  3. Snowpark Pandas DataFrame:
    • Represents data in a structure similar to Pandas DataFrames but optimized for distributed processing within Snowflake.
  4. SQL query translation:
    • Operations on the Snowpark Pandas DataFrame are translated into SQL queries, leveraging Snowflake's compute engine for execution.
  5. Execution in Snowflake:
    • The translated SQL queries are executed within Snowflake's infrastructure, utilizing its scalability and performance optimizations.
  6. Results storage:
    • The processed data can be returned to the client as a Pandas DataFrame or stored within Snowflake for further use.

Limitations/Considerations

  • Data types: While Snowpark Pandas aims to closely align with native Pandas, a few data types may have different representations due to Snowflake's type system.
  • Local operations: Operations that require data to be moved outside Snowflake, such as to_pandas(), will materialize the data locally and may not benefit from distributed processing.

Use Cases

  • Data exploration: Quickly analyze and visualize large datasets without the need for data extraction.
  • Data engineering: Perform complex transformations on large datasets directly within Snowflake.
  • Data cleansing: Efficient transformation and pre-processing of data at scale, ensuring high-quality inputs for downstream applications.

Conclusion

The Snowpark Pandas API represents a major advancement in data processing, utilizing the simplicity of Pandas with the scalability of Snowflake. It is an efficient and powerful tool for Python developers looking to enhance/leverage the full potential of cloud-based data platforms.

Snowpark Pandas demonstrates significant performance improvements over traditional methods. Reading a 10 million-row dataset into a Snowpark Pandas DataFrame took approximately 4.58 seconds, whereas using the to_pandas() method took about 65 seconds.

For data architects, data engineers, and Pandas enthusiasts, this analysis aims to provide the insights needed to choose the best solution for their environments. If you need deeper technical insights or practical guides, refer to the Snowflake documentation.

API Data processing Framework Pandas

Opinions expressed by DZone contributors are their own.

Related

  • How To Introduce a New API Quickly Using Quarkus and ChatGPT
  • Build a REST API With Just 2 Classes in Java and Quarkus
  • Modern Data Processing Libraries: Beyond Pandas
  • Multi-Tenancy and Its Improved Support in Hibernate 6.3.0

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: