Bug
Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $…" price for every line of the input, instead of once per domain.
This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:
The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The
domains,prices, andprices_after_first_yearare separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindlyzip()them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of $, then your data would be inconsistent, and you would never find out, because the code doesn't crash.
Parsing
To extract information from text that fits a certain pattern, use is regular expressions. It requires some investment of time to learn regular expressions — it is actually a concept and language that is independent of Python — but it really is the right tool to use for these kinds of tasks.
For this problem, I would use regular expressions in two ways:
To split the input into one stanza per domain, use
re.split(). Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion(?=\.), so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see
(?P<name>...)in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $…" optionally.
Input / output
The open() calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.
To write tab-delimited output, use the csv module.
Suggested solution
import csv
from operator import itemgetter
import re
REGEX = re.compile(
r'(?P<name>\.[A-Z]+).*^\$(?P<price>[\d.]+)(?:.*then \$(?P<then>[\d.]+))?',
flags=re.MULTILINE | re.DOTALL
)
def parse_and_sort_domains(input, output, sort_attr='price'):
domains = [
REGEX.match(description).groupdict()
for description in re.split(r'\n(?=\.)', input.read())
]
writer = csv.DictWriter(
output, domains[0].keys(), dialect=csv.excel_tab, extrasaction='ignore'
)
writer.writerows(sorted(domains, key=itemgetter(sort_attr)))
with open('domain_prices.txt') as input, \
open('sorted_domain_prices.txt', 'w') as output:
parse_and_sort_domains(input, output)