Interpolating based on non-diagonal neighboring values

Question

I have a comma-separated value (CSV) file as input, and I am supposed to interpolate all missing (nan) values based on neighboring non-diagonal values.

The CSV looks like:

37.454012,95.071431,73.199394,59.865848,nan
15.599452,5.808361,86.617615,60.111501,70.807258
2.058449,96.990985,nan,21.233911,18.182497
nan,30.424224,52.475643,43.194502,29.122914
61.185289,13.949386,29.214465,nan,45.606998

I decided that Python and Pandas would be the most liked by others, but language choice was left to me.

import argparse
import pandas as pd
import sys
import os

parser = argparse.ArgumentParser()
parser._action_groups.pop()
required = parser.add_argument_group('Required arguments')
#optional = parser.add_argument_group('Optional arguments')
required.add_argument('--i', '--input_file', required=True, help = 'input CSV file')
required.add_argument('--o', '--output_file', required=True, help = 'output CSV file')
#optional.add_argument('--m', '--method', required = False, help = 'an option B', default = 'linear')
#return parser.parse_args()
args = parser.parse_args()

if not os.path.isfile(args.i):
    print(f"File \"{args.i}\" doesn't exist", file = sys.stderr)
    sys.exit(1)
df = pd.read_csv(args.i, header = None)
print(df)
rows, cols = df.shape
inter_df = df
def numeric(val):
    if pd.isna(val):
        return True
    return pd.to_numeric(val)
        
print(f"rows = {rows}; cols = {cols}")
for i in range(rows):
    for j in range(cols):
        #print(f"df[{i}][{j}] = {df.iloc[i,j]}")
        if numeric(df.iloc[i,j]) == False:
            printf(f'df[{i}][{j}] is not numeric, and there may be others') # check for non-numeric values
            sys.exit(2)
        if pd.notna(df.iloc[i,j]): # don't interpolate known values
            continue
        adjacent_val = []
        if ((i > 0) and (pd.isna(df.iloc[i-1, j]) == False)):   # cell that's up
            adjacent_val.append(df.iloc[i-1, j])
        if ((i < (rows-1)) and (pd.isna(df.iloc[i+1, j]) == False)): # cell that's down
            adjacent_val.append(df.iloc[i+1, j])
        if ((j > 0) and (pd.isna(df.iloc[i, j-1]) == False)): # left cell
            adjacent_val.append(df.iloc[i, j-1])
        if (((j+1) < cols) and (pd.isna(df.iloc[i, j+1]) == False)): # right cell
            adjacent_val.append(df.iloc[i, j+1])
        inter_df.iloc[i,j] = sum(adjacent_val) / len(adjacent_val) # mean
#--------------
# check for errors of various types
#--------------
total_na_vals = inter_df.isna().sum().sum()
if total_na_vals > 0:
    print(f"Missing vals were found", file = sys.stderr)
    print("Rows with any missing values:\n", inter_df[inter_df.isna().any(axis=1)])
    sys.exit(3)
max_df = df.max().max()
max_inter_df = inter_df.max().max()
if max_inter_df > max_df:
    print('if interpolating, max value cannot be greater than the original data frame')
    sys.exit(4)
min_df = df.min().min()
min_inter_df = inter_df.min().min()
if min_inter_df < min_df:
    print('if interpolating, min value cannot be less than the original data frame')
    sys.exit(5)
# write results to specified output file
inter_df.to_csv(args.o, sep = ',')

"modular code design, dependency management, and testing" were considered insufficient for this task, even though I verified that the code was producing the correct output and had numerous checks.

How could I have written this better?

If two NaNs are adjacent, your interpolation leaves it as NaN (since the sum of neighbours is NaN). Is that intentional, that you're only correcting isolated single-element gaps? — Peter Cordes
– Peter Cordes, Commented Jun 29 at 8:11

toolic · Accepted Answer · 2025-06-27 19:42:05Z

Comments

Delete commented-out code to reduce clutter. For example:

#optional = parser.add_argument_group('Optional arguments')

#optional.add_argument('--m', '--method', required = False, help = 'an option B', default = 'linear')
#return parser.parse_args()

Layout

Move the numeric function to the top after the import lines. Having it in the middle of the code interrupts the natural flow of the code (from a human readability standpoint).

Documentation

The PEP 8 style guide recommends adding docstrings for functions and at the top of the code to summarize its purpose. For example:

def numeric(val):
    """Say what this does"""

Simpler

It is uncommon to compare against False as in this line:

if numeric(df.iloc[i,j]) == False:

It is simpler as:

if not numeric(df.iloc[i,j]):

The same goes for other similar comparisons.

Since the total_na_vals variable is only used once after it is set:

total_na_vals = inter_df.isna().sum().sum()
if total_na_vals > 0:

it can be eliminated:

if inter_df.isna().sum().sum() > 0:

The same is true for:

max_df = df.max().max()
max_inter_df = inter_df.max().max()
if max_inter_df > max_df:

Divide

It is not clear to me if adjacent_val could be empty by the time this line is executed:

inter_df.iloc[i,j] = sum(adjacent_val) / len(adjacent_val) # mean

Review the code to make sure you will not get a divide-by-0 error.

Tools

You could run code development tools to automatically find some style issues with your code.

Both ruff and pylint complain about printf in this line:

printf(f'df[{i}][{j}] is not numeric, and there may be others') # check for non-numeric values

ruff:
F821 Undefined name `printf`

pylint:      
E0602: Undefined variable 'printf' (undefined-variable)

My guess is that you never exercised that line of code in your testing. You should probably replace printf with print.

ruff also finds:

F541 [*] f-string without any placeholders
 |
 | total_na_vals = inter_df.isna().sum().sum()
 | if total_na_vals > 0:
 |     print(f"Missing vals were found", file = sys.stderr)
 |           ^^^^^^^^^^^^^^^^^^^^^^^^^^ F541
 |     print("Rows with any missing values:\n", inter_df[inter_df.isna().any(axis=1)])
 |     sys.exit(3)
 |
 = help: Remove extraneous `f` prefix

Output

I see values written to the output file with many digits:

31.222654000000002

Consider formatting the output file data using fewer digits, as you do for the data printed to stdout, like:

37.454012

Reinderien · Accepted Answer · 2025-06-27 22:59:01Z

I decided that Python and Pandas would be the most liked by others

Python is fine for this application; Pandas would not be my first choice. There are no meaningful column or row indices. Scipy is better for this. Among other reasons, I expect it to be much, much faster.

Put your argument parser in a function. I also think the input and output file paths should be simple positional arguments.

When you can, avoid prototyping with square data. It makes finding transposition errors very difficult. I demonstrate with one row added to your data.

The idea of interpolating based on non-diagonal neighbours is under-defined in your problem. I am going to suggest that in real life, you would want to toss all of your left/right/up/down logic out the window and use an off-the-shelf linear interpolator. It also seems like the implication is for the upper-right nan to be filled with an "interpolated" value from its left and lower neighbours, but crucially this is not interpolation in Euclidean space. Imagine that there is a line extending from the left neighbour diagonally to the lower neighbour. Anything below that line is defined. Anything above that line either

has to be extrapolated rather than interpolated; or
will neither be extrapolated nor interpolated, and left nan; or
needs to be interpolated in Manhattan space rather than in Euclidean space.

The last option would be a non-Euclidean geometry and very strange indeed. I demonstrate the second option.

import argparse
from pathlib import Path

import numpy as np
from scipy.interpolate import LinearNDInterpolator


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file', type=Path, help='input CSV file')
    parser.add_argument('output_file', type=Path, help='output CSV file')
    return parser.parse_args()


def interpolate(input_: np.ndarray) -> np.ndarray:
    m, n = input_.shape  # rows, columns
    i = np.arange(m)     # 1D row index
    j = np.arange(n)     # 1D col index
    jj, ii = np.meshgrid(j, i)  # each 2D broadcast indices

    # (m*n),2 array of all indices
    ij = np.stack((ii, jj), axis=-1).reshape((-1, 2))

    flat = input_.ravel()     # input 1D in row-major order
    missing = np.isnan(flat)  # boolean mask of missing values
    present = ~missing        # boolean mask of present values

    # interpolator object having all non-NA values
    interp = LinearNDInterpolator(
        points=ij[present], values=flat[present],
    )

    # fill in missing values
    flat[missing] = interp(ij[missing])

    # reshape back to input shape
    return flat.reshape(input_.shape)


def main() -> None:
    args = parse_args()
    input_ = np.loadtxt(args.input_file, delimiter=',')
    output = interpolate(input_)
    np.savetxt(args.output_file, output, delimiter=',')


if __name__ == '__main__':
    main()

Padded input:

37.454012,95.071431,73.199394,59.865848,nan
15.599452,5.808361,86.617615,60.111501,70.807258
2.058449,96.990985,nan,21.233911,18.182497
nan,30.424224,52.475643,43.194502,29.122914
61.185289,13.949386,29.214465,nan,45.606998
40,10,30,nan,40

Output:

3.745401199999999875e+01,9.507143100000000402e+01,7.319939399999999807e+01,5.986584799999999973e+01,nan
1.559945199999999943e+01,5.808360999999999663e+00,8.661761500000000069e+01,6.011150099999999696e+01,7.080725800000000447e+01
2.058448999999999973e+00,9.699098499999999490e+01,5.911244800000000055e+01,2.123391099999999909e+01,1.818249700000000146e+01
3.162186900000000023e+01,3.042422399999999882e+01,5.247564299999999804e+01,4.319450199999999995e+01,2.912291400000000152e+01
6.118528899999999737e+01,1.394938600000000051e+01,2.921446500000000057e+01,3.741073149999999714e+01,4.560699799999999726e+01
4.000000000000000000e+01,1.000000000000000000e+01,3.000000000000000000e+01,3.500000000000000000e+01,4.000000000000000000e+01

Graipher · Accepted Answer · 2025-06-28 19:38:50Z

As one of the points brought up was insufficient dependency management, you were probably expected to deliver at least a requirements.txt, possibly and probably better yet a pyproject.toml defining any dependencies (ie non-standard library). Whether you use pip, poetry, UV or some other dependency manager to do this was probably irrelevant, if they did not specify which one they preferred. In this case this initially would have only included pandas, ideally with a version constraint to make sure only versions of pandas can be installed with which your code was tested to work.

Regarding testing, while you do perform some consistency checks at the end to determine if the output makes sense, this is quite far removed from adding unit tests to determine if your code is working correctly. For this you were probably expected to at least have doctests for some simple happy path functionality check and/or proper unit tests with something like pytest.

Any additional tools (like pytest or ruff recommended in another answer) would need to be defined as a dev dependency.

The other answers have already provided enough pointers regarding modular code design, so I will not repeat them here.

Stack Exchange Network

Interpolating based on non-diagonal neighboring values

3 Answers 3

Comments

Layout

Documentation

Simpler

Divide

Tools

Output

You must log in to answer this question.

Hot Network Questions

Interpolating based on non-diagonal neighboring values

3 Answers 3

Comments

Layout

Documentation

Simpler

Divide

Tools

Output

You must log in to answer this question.

Related

Hot Network Questions