I have a big (8 GB) csv gzip file. I would like to read it through pandas into a DataFrame. Since the length of the file is big, I read it in chunks and it works fine but I'm interested in knowing whether is there a way to read only the last x lines, without decompressing the whole file.
1 Answer
I am thinking of various ways to read the last lines of a dataframe. As I am not sure if I understood what you mean by "without decompressing the whole file" correctly, I wonder if any of the options bellow is of interest to you.
Option 1
When reading a .csv file using pandas.read_csv(), rows can be skipped over so they are not included in the import.
For that, when calling it one should pass skiprows=[x], where x is the row number to be excluded (Note that row numbering is list-like, beginning with 0).
Option 2
Another option might be converting the file to HDF5 and select a start and stop. Here's an example
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date' : np.random.randn(50000)},index=pd.date_range('20200528',periods=50000,freq='s'))
store = pd.HDFStore('example.h5', mode='w')
store.append('df', df)
rowsnumber = store.get_storer('df').nrows
store.select('df',start=nrows-5,stop=rowsnumber) #Change the start to the number of rows one wants to display starting from the end
Option 3
Assuming that the df is already associated with the variable df, in order to read the last 5 rows, use df.iloc
rows = df.iloc[-5:]
Or df.tail
rows = df.tail(5)
skiprows=some_numas a param toread_csv