When reading a file without headers, existing answers correctly say that header= parameter should be set to None, but none explain why. It's because by default, header=0, which means the first row of the file is inferred as the header. For example, the following code overwrites the first row with col_names because the first row was read as the header and it was replaced by col_names.
Note that it's assumed that the columns are separated by a space ' ' here.
col_names = ["Sequence", "Start", "End", "Coverage"]
df = pd.read_csv("path/to/file.txt", sep=' ') # <--- wrong
df.columns = col_names
To get the correct output, you can do one of the following two things:
- set
header=None:
df = pd.read_csv("path/to/file.txt", sep=' ', header=None) # <--- OK
df.columns = col_names
- or use
names= parameter to assign column names in one function call:
df = pd.read_csv("path/to/file.txt", sep=' ', names=col_names) # <--- OK
header=None way is often preferred if the number of columns is not known (because it is important that len(col_names) is equal to the number of columns inferred from the file, otherwise only the last column will be read as a column and all preceding rows will be read as index levels) or if the specific column names are not important. For example, calling add_prefix() after read_csv can add prefix to the default column names:
df = pd.read_csv("path/to/file.txt", sep=' ', header=None).add_prefix('col')
On the other hand, if the file has a header, i.e. first row in the file is meant to be read as column labels, then passing names= will push the first row as the first row in the dataframe. In that case, if you want to set the column labels during the pd.read_csv call, pass header=0.
import io
data = """
ab,bc
10,2.
"""
df = pd.read_csv(io.StringIO(data), names=['a', 'b']) # <--- wrong
df = pd.read_csv(io.StringIO(data), names=['a', 'b'], header=0) # <--- OK