I need to query a large number of domains stored in a CSV file and save the associated metadata from the responses. My final goal is to generate a CSV file after applying filters to this metadata.
The original CSV file can be quite large, and I want to avoid loading everything into memory for performance reasons.
I’m considering two approaches:
- Read the CSV, perform the queries, save the metadata to a new CSV, and then load this new file with Polars to apply filters and generate the final output.
- Use Polars from the start to read the CSV, perform the queries using map_batches, apply filters in the same pipeline, and then produce the final file directly. Here’s a simplified example of the second approach:
import polars as pl
import asyncio
import httpx
# Load CSV and add metadata columns
lf = (
pl.scan_csv("domains.csv")
.with_columns(
pl.col("domain").map_batches(
self.execute_domain_checks, return_dtype=pl.List(pl.Dict)
)
)
)
# Asynchronous function to perform the domain checks
async def check_multiple_domains(self, domains: pl.Series) -> list[list[dict]]:
async with httpx.AsyncClient(follow_redirects=True) as client:
tasks = [self.check_single_domain(domain) for domain in domains]
results = await asyncio.gather(*tasks)
return results
def execute_domain_checks(self, domains: pl.Series) -> pl.Series:
results = asyncio.run(self.check_multiple_domains(domains))
return pl.Series(results, strict=False)
My questions:
- Is the second approach viable in terms of performance, considering the CSV file might be large?
- What are the advantages and disadvantages of using polars.map_batches for this kind of task compared to a more sequential approach (approach 1)?
- Do you have suggestions on improving memory management or performance in such a scenario?