Edwardvaneechoud

Posted on May 29

Stop Drawing ETL Diagrams — Your Python Code Visualizes Itself

#python #dataengineering #opensource #programming

Ever wished you could write Python code and get the clarity of a visual data flow? That's exactly what Flowfile offers with FlowFrame — a Polars-like API that silently builds a visual ETL graph as you code.

The problem we're solving

As data engineers and scientists, we often find ourselves in one of two camps:

Writing complex Python code that's powerful but hard to explain
Drawing diagrams that are clear but disconnected from actual implementation

What if your code could be both powerful AND self-documenting?

Enter FlowFrame: code that visualizes itself

FlowFrame, part of the open-source Flowfile project, bridges this gap by combining the precision of Python coding with the clarity of visual ETL pipelines.

Key benefits:

Write familiar, Polars-style Python code
Automatically generate visual pipelines you can view, edit, and share
Seamlessly switch between code and visual interfaces
Full LazyFrame API support (new in v0.3.3!)

Getting started

Install Flowfile with a single command:

pip install flowfile

See it in action

Let's explore with a practical example using a sales dataset:

import flowfile as ff
from flowfile import col, open_graph_in_editor, when, lit

# Read the sales data
sales = ff.read_csv("files/Superstore Sales Dataset.csv", separator=",")

# Helper function to clean column names
def correct_column_names(df: ff.FlowFrame) -> ff.FlowFrame:
    columns = df.columns
    return df.select((ff.col(c).alias(c.replace(" ", "_").lower()) for c in columns),
                     description="Column names to lowercase")

sales_clean = correct_column_names(sales)

# Advanced transformations with the FlowFrame API
transformed = (
    sales_clean
    .with_columns([
        # Calculate shipping time in days
        (col("ship_date") - col("order_date")).dt.total_days().alias("shipping_days"),
        # Extract year from order date for trend analysis
        col("order_date").dt.year().alias("order_year"),
        # Create sales tiers based on order value
        when(col("sales") < 50).then(lit("Small"))
        .when(col("sales") < 200).then(lit("Medium"))
        .when(col("sales") < 500).then(lit("Large"))
        .otherwise(lit("Enterprise"))
        .alias("order_tier"),
        # Clean up state codes (handle nulls)
        col("state").fill_null("Unknown").alias("state_clean")
    ], description='Create order characteristics')
    .filter(col("shipping_days") >= 0, description="Remove invalid shipping times")
    .group_by(["category", "segment", "order_tier"])
    .agg([
        col("sales").sum().alias("total_sales"),
        col("sales").mean().alias("avg_order_value"),
        col("shipping_days").mean().alias("avg_shipping_days"),
        col("order_id").n_unique().alias("unique_orders"),
        col("customer_id").n_unique().alias("unique_customers")
    ])
    .with_columns([
        # Calculate customer concentration ratio
        (col("unique_customers") / col("unique_orders") * 100)
        .round(1).alias("customer_concentration_pct")
    ], description='Create nice readable percentages')
    .filter(col("total_sales") > 1000, description='Filter on significant segments')
    .sort(["total_sales", "avg_order_value"], descending=[True, True])
)

# The magic happens here — visualize your pipeline!
open_graph_in_editor(transformed.flow_graph)

This launches the Flowfile Designer, showing your entire pipeline visually. Each node represents an operation, allowing you to:

Visualize data flow
Debug by inspecting results at each step
See data previews
Make visual edits that sync back to code

What's new in v0.3.3: full Polars LazyFrame support

The latest release brings massive improvements to the FlowFrame API:

1. Type safety

# Now with full type hints and autocompletion!
transformed.select(col("customer_concentration_pct").cum_sum())  # IDE knows all available methods

2. Dynamic expression methods

All Polars expression methods are now available:

# List operations
df.with_columns(col("tags").list.len().alias("tag_count"))

# String operations  
df.filter(col("email").str.contains("@company.com"))

# Datetime operations
df.with_columns(col("timestamp").dt.hour().alias("hour"))

# Statistical operations
df.select(col("value").rolling_mean(7).alias("weekly_avg"))

3. User-defined functions support

# Custom functions are now tracked in the graph!
from functions import custom_transform 

df.with_columns(
    col("value").map_batches(custom_transform).alias("transformed")
)

Real-world use cases

1. Explaining complex transformations

Ever tried explaining a complex data pipeline to non-technical stakeholders? With FlowFrame, build in Python, then share a visual flow everyone understands.

2. Debugging made visual

When something doesn't look right, seeing the transformation flow and inspecting data at each step makes troubleshooting much faster than combing through lines of code.

3. Team collaboration

# Data scientists can start in code
pipeline = create_complex_pipeline()
pipeline.save_graph("quarterly_analysis.flowfile")

# Analysts can continue visually in Flowfile Designer
# No code required!

What's next?

Flowfile is constantly evolving:

Tighter code-visual mapping: Every transformation accurately reflected in both directions
Visual-to-code generation: Generate clean Python code from visual designs
Domain-specific workflows: Specialized support for ML pipelines, time series analysis, and more
Cloud integrations: Direct connections to cloud data warehouses

Try it yourself

FlowFrame bridges the gap between code and visual ETL tools, offering a powerful way to build data pipelines that are both efficient and understandable.

# Your turn! Install and try it:
pip install flowfile

# Then run the example above or try your own data

The project is fully open-source on GitHub. We'd love your feedback, contributions, or ideas for new features!

DEV Community