Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Using Python Functions Inside DuckDB Queries

00:00 In the previous lesson, I showed you three different ways of performing an SQL query in DuckDB using a join as an example. In this lesson, I’ll show you how to use Python functions inside your queries.

00:13 DuckDB, they must have flown away, allows you to integrate Python functions into your queries. You do this by registering a Python function and then you can use it in your SELECT operations.

00:25 This is typically done to do data transformation, making changes to data from one or more columns. DuckDB is written in a lower-level language, so it’s quite performant.

00:35 When you integrate Python like this, you’re leaving that lower-level language and coming up into Python-land. As much as I love Python, it isn’t the fastest beast.

00:43 So although this feature can be useful, it can also be far slower than straight SQL. A lot of what you might use this feature for can also be done with more advanced SQL.

00:53 So if speed is important, make sure you want to pay this cost. Caveats aside, let’s go for a waddle in the REPL. Same old, same old. Now let me create a Python function that combines a first and last name into a single value.

01:21 Nothing fancy here, just a straightforward Python function. Although do note that type hints are important in this case. They tell DuckDB what data types can interact with this function.

01:33 Let’s quickly test it out.

01:39 Yep, good to go. Now, I’ll register it with DuckDB.

01:48 Usually when you’re dealing with a function, you’re calling it, so you might not realize that functions, like everything in Python, are objects. When you use the parentheses on the end, you’re invoking the object as a callable, but without them, you get an object reference just like any other variable.

02:05 This registration call takes a string name and then a reference to the function being called. It doesn’t call the function yet. This is kind of like the set alias in the previous lesson.

02:17 Registering the name with the actual reference for later. Let me get some data.

02:29 I’m using the relation object form here just because it’s convenient. And now for some SQL.

02:43 this version of SELECT uses the first and last_name columns, passing them to our registered function, short_name. The AS keyword says what to call the resulting column.

03:01 To mix it up a bit, I’ve also added another column, this one being a SQL calculation. It’s the delta between the start and end date. It also uses the AS keyword to name the result.

03:18 In creating this course, I ran this code a couple of different times. Once this call took almost a second, but the other times it was almost instantaneous.

03:26 I’m not sure whether that was a cache thing or just what was going on, but like I said, mixing the performance below the water DuckDB world with the less performant dry land slithering Python world, means everything kind of slows down.

03:41 Custom functions aren’t the only integrations that DuckDB supports. It also works directly with pandas. This, of course, is only possible if you’ve already got pandas installed.

04:00 The .df() method turns a relation object into a pandas DataFrame, and if pandas are too cute and cuddly, you could do Polars instead. Note that you have to have both Polars and PyArrow installed for this to work.

04:23 In this case, instead of .df(), you use .pl(), but the idea is the same. It converts the result into a Polars DataFrame.

04:33 At the start of this course, I mentioned that DuckDB isn’t your typical transaction-based database. Its purpose and focus is data analytics, and it does that well.

04:42 This, though, does cause some restrictions. It isn’t really set up for concurrency, a server-based database like Postgres or Oracle are meant to handle multiple transactions at a time from multiple sources.

04:55 DuckDB isn’t for that. If you’ve got multi-threaded code, you can read concurrently, but if two threads try to write at the same time, you’ll run into problems.

05:05 This isn’t really a big deal as it isn’t the purpose of the tool, but it’s something you need to keep in mind. What do you call a mallard standing in front of musicians? A conductor.

05:19 I got good news for you. The next lesson is a summary, which means the duck puns are almost over.

Become a Member to join the conversation.