1
\$\begingroup\$

I need to classify URLs from a DataFrame and modify it by exact match and contains conditions:

class PageClassifier:
    def __init__(self, contains_pat, match_pat):
        """
        :param match_pat: A dict with exact match patterns in values as lists
        :type match_pat: dict
        :param contains_pat: A dict with contains patterns in values as lists
        :type contains_pat: dict
        """
        self.match_pat = match_pat
        self.contains_pat = contains_pat

    def worker(self, data_frame):
        """
        Classifies pages by type (url patterns)
        :param data_frame: DataFrame to classify
        :return: Classified by URL patterns DataFrame
        """
        try:
            for key, value in self.contains_pat.items():
                reg_exp = '|'.join(value)
                data_frame.loc[data_frame['url'].str.contains(reg_exp, regex=True), ['page_class']] = key

            for key, value in self.match_pat.items():
                data_frame.loc[data_frame['url'].isin(value), ['page_class']] = key

            return data_frame

        except Exception as e:
            print('page_classifier(): ', e, type(e))


df = pd.read_csv('logs.csv',
                     delimiter='\t', parse_dates=['date'],
                     chunksize=1000000)    
contains = {'catalog': ['/category/', '/tags', '/search'], 'resources': ['.css', '.js', '.woff', '.ttf', '.html', '.php']}
match = {'info_pages': ['/information', '/about-us']}
    
classify = PageClassifier(contains, match)

new_pd = pd.DataFrame()
for num, chunk in enumerate(df):
    print('Start chunk ', num)
    new_pd = pd.concat([new_pd, classify.worker(chunk)])
new_pd.to_csv('classified.csv', sep='\t', index=False)

But it is very slow and takes to much RAM when I work with files over 10GB. How can I search and modify data faster? I need "exact match" and "contains" patterns searching in one func.

\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

The first thing I notice here that will tank performance the most is:

new_pd = pd.concat([new_pd, classify.worker(chunk)])

cs95 outlines this issue very well in their answer here. The general advice is "NEVER grow a DataFrame!". Essentially creating a new copy of the DataFrame in each iteration is quadratic in time complexity as the entire DataFrame is copied each iteration, and the DataFrame only gets larger which ends up costing more and more time.

If we wanted to improve this approach we might consider something like:

df_list = []
for num, chunk in enumerate(df):
    df_list.append(classify.worker(chunk))

new_pd = pd.concat(df_list, ignore_index=True)
new_pd.to_csv('classified.csv', sep='\t', index=False)

However, assuming we don't ever need the entire DataFrame in memory at once, and given that our logs.csv is so large that we need to read it in chunks, we should also consider writing out our DataFrame in chunks:

for num, chunk in enumerate(df):
    classify.worker(chunk).to_csv(
        'classified.csv', sep='\t', index=False,
        header=(num == 0),  # only write the header for the first chunk,
        mode='w' if num == 0 else 'a'  # append mode after the first iteration
    )

In terms of reading in the file, we appear to only using the url and page_class columns. Since we're not using the DateTime functionality of the date column we don't need to take the time to parse it.

df = pd.read_csv('logs.csv', delimiter='\t', chunksize=1000000)
\$\endgroup\$
1
  • \$\begingroup\$ Thanks for your answer. Especially for "header=(num == 0), mode='w' if num == 0 else 'a'". This was very useful example for my practice. \$\endgroup\$ Commented Oct 15, 2021 at 7:00

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.