Pandas - Read only first few lines of each rows

Question

I have a large CSV file with about 10000 rows of text information and each row of my dataset consists of a number of lines. However I just want to read say first 20 lines of each of the rows of my CSV file.

I came across n_rows parameter in pandas read_csv method which helps us in limiting the number of rows of the dataset that we would like to load. Is there also a way where we could only read first 20 lines of data from each of the rows in pandas?

Doesn't a newline terminate a "row" in a CSV file? How can a row consist of multiple lines? — Jan Christoph Terasa
– Jan Christoph Terasa, Commented Jun 3, 2020 at 8:33
Thank you. The solution provided by Arun works for me and I would edit my question. — dmorgan
– dmorgan, Commented Jun 3, 2020 at 9:40

TiTo · Accepted Answer · 2020-06-03 10:02:14Z

1

You can read in the csv with df = pd.read_csv('path\file.csv') and than just select the first 20 rows by df_new = df.head(20). Is that what you where thinking of?

edited Jun 3, 2020 at 10:02

answered Jun 3, 2020 at 8:34

TiTo

8852 gold badges13 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Arun · Accepted Answer · 2020-06-03 11:16:38Z

If I get your question correctly, your CSV file has multiple rows, where each row has multiple lines separated by the newline character '\n'. And you want to choose the first (say, for example) 3 lines from each row.

This can be achieved as:

# Read in CSV file using pandas-
data = pd.read_csv("example.csv")

# The first two rows (toy example) of dataset are-
data.iloc[0,0]
# 'Hello, this is first line\nAnd this is the second line\nThird and final line'

data.iloc[1,0]
# 'Today is 3rd June\nThe year is 2020\nSummer weather'

# First row's first line-
data.iloc[0,0].split("\n")[0]
# 'Hello, this is first line'

# First row's first two lines-
data.iloc[0,0].split("\n")[0:2]
# ['Hello, this is first line', 'And this is the second line']

The general syntax to get the first 'n' lines from row 'x' (assuming that the first column has the string data) is:

data.iloc[x,0].split("\n")[:n]

To select the first 'm' lines (assuming there are m lines or more) from the first 'x' rows, use the code:

data.iloc[:x, 0].apply(lambda y: y.split("\n")[0:m])

Does this help?

Thanks arun. This was something I was exactly looking for and one other thing. How can I apply your general syntax to an entire dataframe. For example I have 50 rows of data and from your syntax I can get first 'n' lines of a specific row 'x'. But if I want to do it for all rows and store the results to a new dataframe?
@dmorgan I have added code to answer your question. If it answers your question, can you mark it as helping your question?

Not_a_programmer · Accepted Answer · 2020-06-03 08:43:06Z

0

If TiTo's answer is not what you are looking for, maybe the iloc method is. You can store the first 20 rows by doing firstRows = df.iloc[:20].

However, if you only ever need the first 20 rows, you shouldn't load the whole file into memory. As you mentioned, this can be achieved with the nrows parameter.

answered Jun 3, 2020 at 8:43

Not_a_programmer

1718 bronze badges

Collectives™ on Stack Overflow

Pandas - Read only first few lines of each rows

3 Answers 3

Comments

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Linked

Related