1

I have csv file

ID,"address","used_at","active_seconds","pageviews"
0a1d796327284ebb443f71d85cb37db9,"vk.com",2016-01-29 22:10:52,3804,115
0a1d796327284ebb443f71d85cb37db9,"2gis.ru",2016-01-29 22:48:52,214,24
0a1d796327284ebb443f71d85cb37db9,"yandex.ru",2016-01-29 22:14:30,4,2
0a1d796327284ebb443f71d85cb37db9,"worldoftanks.ru",2016-01-29 22:10:30,41,2

and I need remove strings, that contains some words. There are 117 words.

I try

for line in df:
    if 'yandex.ru' in line:
        df = df.replace(line, '')

but to 117 words it works too slowly and after it I create pivot_table and words, that I try to delete, contains in columns.

             aaa                         10ruslake.ru  youtube.ru 1tv.ru  24open.ru
0   0025977ab2998580d4559af34cc66a4e             0        0       34      43
1   00c651e018cbcc8fe7aa57492445c7a2             230      0       0       23
2   0120bc30e78ba5582617a9f3d6dfd8ca             12       0       0       0
3   01249e90ed8160ddae82d2190449b773             25       0       13      25

That columns contain only 0

How Can I do it faster and remove lines so as that words are not be in columns?

2
  • 3
    Can you add Minimal, Complete, and Verifiable example? Commented Apr 26, 2016 at 10:28
  • Sorry you're iterating over your df columns and then testing whether a word is present and replacing with empty string? Are your words in a list? if so you can try pattern = '|'.join(words)' for col in df: df[col] = df.str.replace(pattern, '', case=False) Commented Apr 26, 2016 at 10:29

2 Answers 2

1

IIUC you can use isin with boolean indexing:

print df
                                 ID          address              used_at  \
0  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
1  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
2  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
3  0a1d796327284ebb443f71d85cb37db9        yandex.ru  2016-01-29 22:14:30   
4  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
0            3804        115  
1            3804        115  
2             214         24  
3               4          2  
4              41          2  

words = ['vk.com','yandex.ru']

print ~df.address.isin(words)
0    False
1    False
2     True
3    False
4     True
Name: address, dtype: bool

print df[~df.address.isin(words)]
                                 ID          address              used_at  \
2  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
4  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
2             214         24  
4              41          2  

Then use pivot:

print df[~df.address.isin(words)].pivot(index='ID', columns='address', values='pageviews')
address                           2gis.ru  worldoftanks.ru
ID                                                        
0a1d796327284ebb443f71d85cb37db9       24                2

Another solution is removed rows, when in some column is 0 (e.g. pageviews ):

print df

                                 ID          address              used_at  \
0  0a1d796327284ebb443f71d85cb37db9       youtube.ru  2016-01-29 22:10:52   
1            0a1d796327284ebfsffsdf       youtube.ru  2016-01-29 22:10:52   
2  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
3  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
4  0a1d796327284ebb443f71d85cb37db9        yandex.ru  2016-01-29 22:14:30   
5  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
0            3804          0  
1            3804          0  
2            3804        115  
3             214         24  
4               4          2  
5              41          2  
print df.pageviews != 0
0    False
1    False
2     True
3     True
4     True
5     True
Name: pageviews, dtype: bool

print df[(df.pageviews != 0)]
                                 ID          address              used_at  \
2  0a1d796327284ebb443f71d85cb37db9           vk.com  2016-01-29 22:10:52   
3  0a1d796327284ebb443f71d85cb37db9          2gis.ru  2016-01-29 22:48:52   
4  0a1d796327284ebb443f71d85cb37db9        yandex.ru  2016-01-29 22:14:30   
5  0a1d796327284ebb443f71d85cb37db9  worldoftanks.ru  2016-01-29 22:10:30   

   active_seconds  pageviews  
2            3804        115  
3             214         24  
4               4          2  
5              41          2  

print df[(df.pageviews != 0)].pivot_table(index='ID', columns='address', values='pageviews')
address                           2gis.ru  vk.com  worldoftanks.ru  yandex.ru
ID                                                                           
0a1d796327284ebb443f71d85cb37db9       24     115                2          2
Sign up to request clarification or add additional context in comments.

3 Comments

I asked question from another account stackoverflow.com/questions/36839602/… . I need to delete columns with some url. And I try to delete strings in original csv file and after that create a pivot_table
Hmmm, you need delete some rows in data, where column address contains some strings? Or substrings? before pivot? If yes, you need isin with boolean indexing
minor point wouldn't this be more readable df[df.pageviews != 0]?
0

The fastest way I know to work on csv files is to use the package Pandas to create a dataframe from it.

import pandas as pd

df = pd.read_csv(the_path_of_your_file,header = 0)
df.ix[df.ix[:,'address'] == 'yandex.ru','address'] = ''

This replaces the cells containing 'yandex.ru' by an cell with an empty string. Then you can write it back as a csv with:

df.to_csv(the_path_of_your_file)

If what you want to do is erasing the rows where that url occurs, use:

df = df.drop(df[df.address == 'yandex.ru'].index)

Comments