17

I have a 10gb CSV file that contains some information that I need to use.

As I have limited memory on my PC, I can not read all the file in memory in one single batch. Instead, I would like to iteratively read only some rows of this file.

Say that at the first iteration I want to read the first 100, at the second those going to 101 to 200 and so on.

Is there an efficient way to perform this task in Python? May Pandas provide something useful to this? Or are there better (in terms of memory and speed) methods?

3
  • Maybe: stackoverflow.com/questions/10717504/… Commented Mar 20, 2017 at 10:09
  • While it is possible in many different ways doing it in python, sometimes its more practical to just split the file (using e.g. split -l 100 filename) into smaller files before processing them with python. Commented Mar 20, 2017 at 10:13
  • What do you want to do with each iteration? Commented Jan 13, 2024 at 11:45

3 Answers 3

5

You can use pandas.read_csv() with chuncksize parameter:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

for chunck_df in pd.read_csv('yourfile.csv', chunksize=100):
    # each chunck_df contains a part of the whole CSV
Sign up to request clarification or add additional context in comments.

1 Comment

after reading each chunk_df we can concat each one to get our full CSV file in one data frame. df = pd.concat(chunk_df, ignore_index = True)
2

This code may help you for this task. It navigates trough a large .csv file and does not consume lots of memory so that you can perform this in a standard lap top.

import pandas as pd
import os

The chunksize here orders the number of rows within the csv file you want to read later

chunksize2 = 2000

path = './'
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1")
df2 = data2.get_chunk(chunksize2)
headers = list(df2.keys())
del data2

start_chunk = 0
data2 = pd.read_csv('ukb35190.csv',
                chunksize=chunksize2,
                encoding = "ISO-8859-1",
                skiprows=chunksize2*start_chunk)

headers = []

for i, df2 in enumerate(data2):
try:

    print('reading cvs....')
    print(df2)
    print('header: ', list(df2.keys()))
    print('our header: ', headers)

    # Access chunks within data

    # for chunk in data:

    # You can now export all outcomes in new csv files
    file_name = 'export_csv_' + str(start_chunk+i) + '.csv'
    save_path = os.path.abspath(
        os.path.join(
            path, file_name
        )
    )
    print('saving ...')

except Exception:
    print('reach the end')
    break

Comments

0

Method to transfer huge CSV into database is good because we can easily use SQL query. We have also to take into account two things.

FIRST POINT: SQL also are not a rubber, it will not be able to stretch the memory.

For example converted to bd file:

https://nycopendata.socrata.com/Social-Services/311-Service-Requests- from-2010-to-Present/erm2-nwe9

For this db file SQL language:

pd.read_sql_query("SELECT * FROM 'table'LIMIT 600000", Mydatabase)

It can read maximum about 0,6 mln records no more with 16 GB RAM memory of PC (time of operation 15,8 second). It could be malicious to add that downloading directly from a csv file is a bit more efficient:

giga_plik = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
Abdul = pd.read_csv(giga_plik, nrows=1100000)

(time of operation 16,5 second)

SECOND POINT: To effectively using SQL data series converted from CSV we ought to memory about suitable form of date. So I proposer add to ryguy72's code this:

df['ColumnWithQuasiDate'] = pd.to_datetime(df['Date'])

All code for file 311 as about I pointed:

start_time = time.time()
### sqlalchemy create_engine
plikcsv = 'c:/1/311_Service_Requests_from_2010_to_Present.csv'
WM_csv_datab7 = create_engine('sqlite:///C:/1/WM_csv_db77.db')
#----------------------------------------------------------------------
chunksize = 100000 
i = 0
j = 1
## --------------------------------------------------------------------
for df in pd.read_csv(plikcsv, chunksize=chunksize, iterator=True, encoding='utf-8', low_memory=False):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
## -----------------------------------------------------------------------
      df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])  # to datetimes
      df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])
## --------------------------------------------------------------------------
      df.index += j
      i+=1
      df.to_sql('table', WM_csv_datab7, if_exists='append')
      j = df.index[-1] + 1
print(time.time() - start_time)

At the end I would like to add: converting a csv file directly from the Internet to db seems to me a bad idea. I propose to download base and convert locally.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.