Let's take a look at word_count, which appears to be the central function:
def word_count(_file, max_to_min=False):
txt = open(_file, "rU")
word_dict = {}
for line in txt:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
txt.close()
final_list = process_dict(word_dict)
format_print(final_list, max_to_min, len(word_dict))
_file is not a suitable name as according to PEP 8. It is more Pythonic to use with open(_file, "rU") as f too (known as context managers). With that, I rename _file to filename. These two points are mentioned in vnp's answer; however, I disagree with vnp's suggestion to catch the exception, as there is no need for a graceful exit. The program should crash if the file cannot be opened.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
word_dict = {}
for line in f:
if line.replace(" ", "") != ("\n" or None):
process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
final_list = process_dict(word_dict)
format_print(final_list, max_to_min, len(word_dict))
Your function calls the process_line function:
def process_line(words, word_dict):
for word in words:
if word in word_dict:
word_dict[word] += 1
else:
word_dict[word] = 1
There's a builtin Python class for this called Counter. It has a dictionary interface too. With that, the process_line function is no longer necessary and we can rewrite this as:
from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
if line.replace(" ", "") != ("\n" or None):
counter.update(filter(None, split("[^a-zA-Z']+", line.lower())))
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))
Secondly, you appear to be removing all spaces from the line so as to find out if the line is just a series of whitespace and contains no actual words. This can be easily done using the strip function.
from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(filter(None, split("[^a-zA-Z']+", line)))
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))
filter can be rewritten as a generator, which feels more natural to me. That also uses less parenthesis, making the code more readable.
from collections import Counter
.
.
.
def word_count(filename, max_to_min=False):
with open(filename, "rU") as f:
counter = Counter()
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(x for x in split("[^a-zA-Z']+", line) if x)
final_list = process_dict(counter)
format_print(final_list, max_to_min, len(counter))
Now, let's take a look at process_dict.
def process_dict(word_dict):
temp_list = []
for key, value in word_dict.items():
temp_list.append((value, key))
temp_list.sort()
return temp_list
The first few lines can be done with a lambda. The new function looks like this:
def process_dict(counter):
temp_list = map(lambda (a, b): (b, a), counter.items())
temp_list.sort()
return temp_list
But was there really a need for a function on its own? In fact, I'd argue that since your function is named word_count, the function should only count words. Hence, we should just return the counter object and let the printing be handled. Also, we usually name functions as verbs, so I'll change the name to count_words.
The above change affects our whole program structure. Hence, I'll show the final code before explaining the changes I made.
from collections import Counter
from re import split
BANNER = "-" * 35
def format_print(counter, is_reverse=False):
lst = counter.items()
lst.sort(key=lambda (a, b): (b, a), reverse=is_reverse)
print ("[Unique Words: %d]" % len(lst)).center(35, "=")
print "%-16s | %16s" % ("Word", "Count")
print BANNER
for word, count in lst:
print "%-16s | %16d" % (word, count)
def count_words(filename):
counter = Counter()
with open(filename, "rU") as f:
for line in f:
line = line.strip().lower()
if not line:
continue
counter.update(x for x in split("[^a-zA-Z']+", line) if x)
return counter
format_print(count_words("Gettysburg.txt"), is_reverse=False)
I've removed max_to_min=False since we no longer sort the items in count_words.
In format_print, I renamed reverse to is_reverse, assigned it to False by default and removed num_words.
Afterwards, I rewrote the function that sorts the list such that it would sort by the count, then the word, without affecting the structure of the list. This makes the later loop more intuitive.
I've also seperated print statements that had strings seperated by commas, since they were confusing. I declared BANNER as a global variable (which is alright in Python, as long as it is used as a constant). In the process, I made a few minor changes to the output; I hope you don't mind!
It took a long time, but the end result is worth it. I hope that I've managed to show you the process of clearing up your code. :)
EDIT: The code here is not yet tested; I am currently checking all of the code I posted here.
EDIT 2: Updated the fixed version.