Skip to main content
edited tags
Link
200_success
  • 145.6k
  • 22
  • 191
  • 481
Source Link
double_j
  • 248
  • 2
  • 9

Python string clean up function with optional args

I've got a function that I mainly use while web scraping. It gives me the ability to throw in a multi line address and clean it or a name field with unwanted characters and clean those, etc.

Below is the code and I would like to know if this is the best approach. If I should switch to recursive or stick with the while loop. Or if I should look at some other completely different approach. Examples of I/O commented in the code.

def clean_up(text, strip_chars=[], replace_extras={}):
    """
    :type text str
    :type strip_chars list
    :type replace_extras dict
    *************************
    strip_chars: optional arg
    Accepts passed list of string objects to iter through.
    Each item, if found at beginning or end of string, will be
    gotten rid of.
    example:
    text input: '       ,  ,      , .,.,.,.,,,......test, \t  this\n.is.a\n.test...,,,         , .'
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^------^^^^----^^-----^^-----^^^^^^^^^^^^^^^^^^
    strip_chars arg: [',', '.']
    output: 'test, this .is.a .test'
    *************************
    replace_extras: optional arg
    Accepts passed dict of items to replace in the standard
    clean_up_items dict or append to it.
    example:
    text_input: ' this is one test\n!\n'
                 ^--------^^^-----^^-^^
    replace_extras arg: {'\n': '', 'one': '1'}
    output: 'this is 1 test!'
    *************************
    DEFAULT REPLACE ITEMS
    ---------------------
    These can be overridden and/or appended to using the replace_extras
    argument.
    replace item      |   with
    <\\n line ending> - <space>
    <\\r line ending> - <space>
    <\\t tab>         - <space>
    <  double-space>  - <space>
    <text-input>      - <stripped>
    *************************
    """

    clean_up_items = {'\n': ' ', '\r': ' ', '\t': ' ', '  ': ' '}
    clean_up_items.update(replace_extras)

    text = text.strip()

    change_made = True
    while change_made:
        text_old = text
        for x in strip_chars:
            while text.startswith(x) or text.endswith(x):
                text = text.strip(x).strip()

        for key, val in clean_up_items.items():
            while key in text:
                text = text.replace(key, val)

        change_made = False if text_old == text else True

    return text.strip()