Sequentially read huge CSV file in python

Question

I have a 10gb CSV file that contains some information that I need to use.

As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.

Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.

Is there an efficient way to perform this task in Python? May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?

While it is possible in many different ways doing it in python, sometimes its more practical to just split the file (using e.g. split -l 100 filename) into smaller files before processing them with python. — sphere
– sphere, Commented Mar 20, 2017 at 10:13

Guillaume · Accepted Answer · 2017-03-20 10:14:38Z

5

You can use pandas.read_csv() with chuncksize parameter:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
    # each chunck_df contains a part of the whole CSV

answered Mar 20, 2017 at 10:14

Guillaume

6,1193 gold badges28 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Debashis Sahoo Over a year ago

after reading each chunk_df we can concat each one to get our full CSV file in one data frame. df = pd.concat(chunk_df, ignore_index = True)

Leonardo · Accepted Answer · 2019-10-15 12:04:41Z

This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.

import pandas as pd
import os

The chunksize here orders the number of rows within the csv file you want to read later

chunksize2 = 2000

path = './'
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2

start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1",
                skiprows=chunksize2*start_chunk)

headers = []

for i, df2 in enumerate(data2):
try:

    print('reading cvs....')
    print(df2)
    print('header: ', list(df2.keys()))
    print('our header: ', headers)

    # Access chunks within data

    # for chunk in data:

    # You can now export all outcomes in new csv files
    file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
    save_path = os.path.abspath(
        os.path.join(
            path, file_name
        )
    )
    print('saving ...')

except Exception:
    print('reach the end')
    break

Wojciech Moszczyński · Accepted Answer · 2018-12-28 12:20:22Z

Method to transfer huge CSV into database is good because we can easily use SQL query. We have also to take into account two things.

FIRST POINT: SQL also are not a rubber, it will not be able to stretch the memory.

For example converted to bd file:

https://nycopendata.socrata.com/Social-Services/311-Service-Requests- from-2010-to-Present/erm2-nwe9

For this db file SQL language:

pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)

It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second). It could be malicious to add that downloading directly from a csv file is a bit more efficient:

giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)

(time of operation 16,5 second)

SECOND POINT: To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:

df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])

All code for file 311 as about I pointed:

start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000 
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
## -----------------------------------------------------------------------
      df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])  # to datetimes
      df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
      df.index += j
      i+=1
      df.to_sql('table', WM_csv_datab7, if_exists='append')
      j = df.index[-1] + 1
print(time.time() - start_time)

At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.

Collectives™ on Stack Overflow

Sequentially read huge CSV file in python

3 Answers 3

1 Comment

The chunksize here orders the number of rows within the csv file you want to read later

headers = []

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

The chunksize here orders the number of rows within the csv file you want to read later

headers = []

Comments

Comments

Linked

Related