0

Please help! I have tried different things/packages writing a program that takes in 4 inputs and returns the writing score statistics of a group based on those combination of inputs from a csv file. This is my first project, so I would appreciate any insights/hints/tips!

Here is the csv sample (has 200 rows total):

id  gender  ses schtyp  prog        write
70  male    low public  general     52
121 female  middle  public  vocation    68
86  male    high    public  general     33
141 male    high    public  vocation    63      
172 male    middle  public  academic    47
113 male    middle  public  academic    44
50  male    middle  public  general     59
11  male    middle  public  academic    34      
84  male    middle  public  general     57      
48  male    middle  public  academic    57      
75  male    middle  public  vocation    60      
60  male    middle  public  academic    57  

Here is what I have so far:

import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
    data.append(row)
data=numpy.array(data)

#asks for inputs 
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')

#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())

What I am missing is how to filter and gets stats only for a specific group. For example, say I input male, public, middle, and academic -- I'd want to get the average writing score for that subset. I tried the groupby function from pandas, but that only gets you stats for broad groups (such as public vs private). I also tried DataFrame from pandas, but that only gets me filtering for one input and not sure how to get the writing scores. Any hints would be greatly appreciated!

2
  • Have a read from this section onwards and see how you get on, basically what you ask can be done Commented Oct 7, 2014 at 14:12
  • Seems like a typical case for Boolean Indexing over multiple columns in a dataframe. Can you try following the method outlined here Commented Oct 7, 2014 at 17:57

2 Answers 2

1

Agreeing with Ramon, Pandas is definitely the way to go, and has extraordinary filtering/sub-setting capability once you get used to it. But it can be tough to first wrap your head around (or at least it was for me!), so I dug up some examples of the sub-setting you need from some of my old code. The variable itu below is a Pandas DataFrame with data on various countries over time.

# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania'  # returns True/False values
itu[subset]  # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania']  # one-line command, equivalent to the above two lines

# Pandas has many built-in functions like .isin() to provide params to filter on    
itu[itu.cntrycode.isin(['USA','FRA'])]  # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])]  # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])]  # Both of above at same time

# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]

# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) & 
    itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]

# Finally, if you're comfortable with using map() and list comprehensions, 
you can do some advanced subsetting that includes evaluations & functions 
to determine what elements you want to select from the whole, such as all 
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName']  # gives us UAE, UK, & US
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a bunch TC Allen! It worked. Thanks for giving me some crucial tips and hints as I just begin to learn this program :)
0

Look at pandas. I think it will shorten your csv parsing work and gives the subset funcitonality you're asking for...

import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)

#get all of the male students
data[data['gender'] == 'male']

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.