Skip to content

BUG: read_csv() : inconsistent dtype and content parsing. #61730

Open
@945fc41467

Description

@945fc41467

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

"field1" ,"field2" ,"field3" ,"field4" ,"field5"      ,"field6" ,"field7"
"1"      ,      14 ,       6 ,      21 ,"euia"        ,    0.54 ,    1
"2"      ,      30 ,       5 ,      26 ,"euia"        ,    0.82 ,    1
"2"      ,       1 ,       0 ,       0 ,"eua"         ,    0    ,    0
"3"      ,      27 ,       7 ,      17 ,"euia"        ,    1    ,    1
"4"      ,      14 ,       0 ,       9 ,"euia"        ,    0.64 ,    0.92
"4"      ,      10 ,       0 ,       0 ,"eua"         ,    0    ,    0
"9"      ,      17 ,       1 ,       6 ,"euia"        ,    0.65 ,    0.58
"10"     ,      27 ,       4 ,      13 ,"eu"          ,    1    ,     
"10"     ,         ,       0 ,       0 ,"euia"        ,    0    ,     
"12"     ,      14 ,       1 ,      13 ,"uia"         ,    1    ,    0.75
"12"     ,       5 ,       1 ,       4 ,"ui   eiuaea" ,    1    ,    1
"13"     ,      22 ,       3 ,       7 ," euia"       ,    0.89 ,    1
"6"      ,      22 ,       3 ,       5 ,"euia"        ,    0.84 ,    0.79
"7"      ,      23 ,       5 ,       4 ,"uia"         ,    0.78 ,    1
"8"      ,      26 ,      11 ,       2 ,"euia"        ,    1.12 ,    1.30
"5"      ,      28 ,       3 ,       3 ,"euia"        ,    0.72 ,    0.68



import pandas as pd


pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.width', 1000)
pd.set_option("display.max_colwidth", None)

df = pd.read_csv("exemple.csv")
# df = pd.read_csv("exemple.csv", quoting=1)  # change nothing
list(df.columns)
df.dtypes
list(df["field5      "])

df = pd.read_csv("exemple.csv", sep=r"\s*,\s*", engine="python")
list(df.columns)
df.dtypes
list(df["field5"])

df = pd.read_csv("exemple.csv", quoting=2)
list(df.columns)
df.dtypes
list(df["field5      "])

df = pd.read_csv("exemple.csv", quoting=3)
list(df.columns)
df.dtypes
list(df['"field5"      '])

df = pd.read_csv("exemple.csv", quoting=2, dtype={"field1 ": "object",
                                                  "field2 ": "Int32",  # fail
                                                  "field3 ": "int",
                                                  "field4 ": "int",
                                                  "field5      ": "object",
                                                  "field6 ": "float",
                                                  "field7": "float"  # fail
                                                  })

Issue Description

Hello,

I tried to parse a file like the exemple given, and I spent an afternoon just on this. Nothing looks logical to me. So I am sorry, I will make one ticket for everything, cause it would be to long to make one for each problem. Fill free to divide it in several task.

Expected colums dtypes look quite easy to guess to me : the user used quotemarks on field1 to force a string type. Fields 2-4 are expected to be integers. It could be almost understandable if field2 was converted to a float because np.int dtype doesn’t manage NA values. But Pandas has a integer type which does. So there is no reason. Field5 should be string containing text between quotemarks. Field 6 and 7 are expected to be float. Let see what happen

First try : df = pd.read_csv("exemple.csv")

  • Columns names quotemarks are removed, but trailing space are keeped. That’s quite surprising as there is no logic : Or you consider quotemarks are text delimiters and should be removed, but in this case, why to keep characters outside the delimiters ? Or you consider a everything is part of the string and in this case you must keep everything.
  • dtypes are problematic:
    • field1 have been implicitly converted to int64. The user explicitly asked for a str. The convention “what is between quotemarks is a string” is common to R, C++ and Python and wide spread. Why to not respect it
    • field2 is converted to a string. Missing values are a common case to handle. I would understand a conversion to float, or an error raised. But why a conversion to a string ?
    • field5 have the same problem than column names.
    • field7 is converted to a string. Here it is not understandable at all as np.float handle NA values.
    • Other field are correct. Which is also a little surprising. So initials and trailing spaces pose problem in string fields and empty fields, but not in number field ?

Case : df = pd.read_csv("exemple.csv", sep=r"\s*,\s*", engine="python")

Here init and trailing spaces are removed, but not quotemarks. This ticket is probably already opened somewhere. Field types are ok, excepted for field2, which should be Int32.

Case : df = pd.read_csv("exemple.csv", quoting=2)

Here I tried to explicitly tel the methods that quotemarks means string. Nonetheless it doesn’t work. But integer field are now floats. Excepted for field2 and field7 which are… strings !

Case : df = pd.read_csv("exemple.csv", quoting=3)

Here, the parsing of column names and string fields is wrong, but at least logical. It just keep everything.
Fields containing NA values are still converted to string.

Case : df = pd.read_csv("exemple.csv", quoting=2, dtype={"field1 ": "object", "field2 ": "Int32", # fail "field3 ": "int", "field4 ": "int", "field5 ": "object", "field6 ": "float", "field7": "float" # fail })

Raise errors and doesn’t handle fields names correctly.

Expected Behavior

No implicit conversion. Never.

For string field : I understand I may have to tweak the quoting and quotechar parameters, but once done, everything between quotemark should be string, not int or float, and white spaces outside should be ignored.

For float fields containing NA values : should be float field with NA values.

For int field containing NA values : ideally should be parsed as pandas IntXX which handle NA values. At minimum as a np.float. But never a string.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2cc3762 python : 3.13.3 python-bits : 64 OS : Linux OS-release : 6.12.34-1-MANJARO Version : #1 SMP PREEMPT_DYNAMIC Thu, 19 Jun 2025 15:49:06 +0000 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : fr_FR.UTF-8 LOCALE : fr_FR.UTF-8

pandas : 2.3.0
numpy : 2.3.1
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : None
sphinx : None
IPython : 9.3.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : 3.1.6
lxml.etree : 5.4.0
matplotlib : 3.10.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : 2.9.10
pymysql : None
pyarrow : None
pyreadstat : None
pytest : 8.4.1
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions