DEV Community

Edwardvaneechoud
Edwardvaneechoud

Posted on

Stop Drawing ETL Diagrams — Your Python Code Visualizes Itself

Ever wished you could write Python code and get the clarity of a visual data flow? That's exactly what Flowfile offers with FlowFrame — a Polars-like API that silently builds a visual ETL graph as you code.

The problem we're solving

As data engineers and scientists, we often find ourselves in one of two camps:

  • Writing complex Python code that's powerful but hard to explain
  • Drawing diagrams that are clear but disconnected from actual implementation

What if your code could be both powerful AND self-documenting?

Enter FlowFrame: code that visualizes itself

FlowFrame, part of the open-source Flowfile project, bridges this gap by combining the precision of Python coding with the clarity of visual ETL pipelines.

Key benefits:

  • Write familiar, Polars-style Python code
  • Automatically generate visual pipelines you can view, edit, and share
  • Seamlessly switch between code and visual interfaces
  • Full LazyFrame API support (new in v0.3.3!)

Getting started

Install Flowfile with a single command:

pip install flowfile
Enter fullscreen mode Exit fullscreen mode

See it in action

Let's explore with a practical example using a sales dataset:

import flowfile as ff
from flowfile import col, open_graph_in_editor, when, lit

# Read the sales data
sales = ff.read_csv("files/Superstore Sales Dataset.csv", separator=",")

# Helper function to clean column names
def correct_column_names(df: ff.FlowFrame) -> ff.FlowFrame:
    columns = df.columns
    return df.select((ff.col(c).alias(c.replace(" ", "_").lower()) for c in columns),
                     description="Column names to lowercase")

sales_clean = correct_column_names(sales)

# Advanced transformations with the FlowFrame API
transformed = (
    sales_clean
    .with_columns([
        # Calculate shipping time in days
        (col("ship_date") - col("order_date")).dt.total_days().alias("shipping_days"),
        # Extract year from order date for trend analysis
        col("order_date").dt.year().alias("order_year"),
        # Create sales tiers based on order value
        when(col("sales") < 50).then(lit("Small"))
        .when(col("sales") < 200).then(lit("Medium"))
        .when(col("sales") < 500).then(lit("Large"))
        .otherwise(lit("Enterprise"))
        .alias("order_tier"),
        # Clean up state codes (handle nulls)
        col("state").fill_null("Unknown").alias("state_clean")
    ], description='Create order characteristics')
    .filter(col("shipping_days") >= 0, description="Remove invalid shipping times")
    .group_by(["category", "segment", "order_tier"])
    .agg([
        col("sales").sum().alias("total_sales"),
        col("sales").mean().alias("avg_order_value"),
        col("shipping_days").mean().alias("avg_shipping_days"),
        col("order_id").n_unique().alias("unique_orders"),
        col("customer_id").n_unique().alias("unique_customers")
    ])
    .with_columns([
        # Calculate customer concentration ratio
        (col("unique_customers") / col("unique_orders") * 100)
        .round(1).alias("customer_concentration_pct")
    ], description='Create nice readable percentages')
    .filter(col("total_sales") > 1000, description='Filter on significant segments')
    .sort(["total_sales", "avg_order_value"], descending=[True, True])
)

# The magic happens here — visualize your pipeline!
open_graph_in_editor(transformed.flow_graph)
Enter fullscreen mode Exit fullscreen mode

This launches the Flowfile Designer, showing your entire pipeline visually. Each node represents an operation, allowing you to:

  • Visualize data flow
  • Debug by inspecting results at each step
  • See data previews
  • Make visual edits that sync back to code

Flowfile Designer showing a visual ETL pipeline with connected nodes representing data transformations, filters, and aggregations from the sales analysis example

What's new in v0.3.3: full Polars LazyFrame support

The latest release brings massive improvements to the FlowFrame API:

1. Type safety

# Now with full type hints and autocompletion!
transformed.select(col("customer_concentration_pct").cum_sum())  # IDE knows all available methods
Enter fullscreen mode Exit fullscreen mode

2. Dynamic expression methods

All Polars expression methods are now available:

# List operations
df.with_columns(col("tags").list.len().alias("tag_count"))

# String operations  
df.filter(col("email").str.contains("@company.com"))

# Datetime operations
df.with_columns(col("timestamp").dt.hour().alias("hour"))

# Statistical operations
df.select(col("value").rolling_mean(7).alias("weekly_avg"))
Enter fullscreen mode Exit fullscreen mode

3. User-defined functions support

# Custom functions are now tracked in the graph!
from functions import custom_transform 

df.with_columns(
    col("value").map_batches(custom_transform).alias("transformed")
)
Enter fullscreen mode Exit fullscreen mode

Flowfile Designer interface displaying a custom function node integrated within the visual pipeline, demonstrating how user-defined functions appear as trackable components in the flow graph

Real-world use cases

1. Explaining complex transformations

Ever tried explaining a complex data pipeline to non-technical stakeholders? With FlowFrame, build in Python, then share a visual flow everyone understands.

2. Debugging made visual

When something doesn't look right, seeing the transformation flow and inspecting data at each step makes troubleshooting much faster than combing through lines of code.

3. Team collaboration

# Data scientists can start in code
pipeline = create_complex_pipeline()
pipeline.save_graph("quarterly_analysis.flowfile")

# Analysts can continue visually in Flowfile Designer
# No code required!
Enter fullscreen mode Exit fullscreen mode

What's next?

Flowfile is constantly evolving:

  • Tighter code-visual mapping: Every transformation accurately reflected in both directions
  • Visual-to-code generation: Generate clean Python code from visual designs
  • Domain-specific workflows: Specialized support for ML pipelines, time series analysis, and more
  • Cloud integrations: Direct connections to cloud data warehouses

Try it yourself

FlowFrame bridges the gap between code and visual ETL tools, offering a powerful way to build data pipelines that are both efficient and understandable.

# Your turn! Install and try it:
pip install flowfile

# Then run the example above or try your own data
Enter fullscreen mode Exit fullscreen mode

The project is fully open-source on GitHub. We'd love your feedback, contributions, or ideas for new features!

Top comments (0)