Return to Answer

deleted 3 characters in body

Source Link

edited May 29, 2018 at 20:49

145.6k
22
191
481

Bug

Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $…" price for every line of the input, instead of once per domain.

This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:

The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The domains, prices, and prices_after_first_year are separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindly zip() them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of $, then your data would be inconsistent, and you would never find out, because the code doesn't crash.

Parsing

To extract information from text that fits a certain pattern, use is regular expressions. It requires some investment of time to learn regular expressions — it is actually a concept and language that is independent of Python — but it really is the right tool to use for these kinds of tasks.

For this problem, I would use regular expressions in two ways:

To split the input into one stanza per domain, use re.split(). Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion (?=\.), so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.
To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see (?P<name>...) in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $…" optionally.

Input / output

The open() calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.

To write tab-delimited output, use the csv module.

Bug

Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $…" price for every line of the input, instead of once per domain.

This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:

The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The domains, prices, and prices_after_first_year are separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindly zip() them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of $, then your data would be inconsistent, and you would never find out, because the code doesn't crash.

Parsing

For this problem, I would use regular expressions in two ways:

To split the input into one stanza per domain, use re.split(). Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion (?=\.), so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.
To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see (?P<name>...) in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $…" optionally.

Input / output

The open() calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.

To write tab-delimited output, use the csv module.

Bug

Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $…" price for every line of the input, instead of once per domain.

This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:

The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The domains, prices, and prices_after_first_year are separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindly zip() them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of $, then your data would be inconsistent, and you would never find out, because the code doesn't crash.

Parsing

To extract information from text that fits a certain pattern, use regular expressions. It requires some investment of time to learn regular expressions — it is actually a concept and language that is independent of Python — but it really is the right tool to use for these kinds of tasks.

For this problem, I would use regular expressions in two ways:

To split the input into one stanza per domain, use re.split(). Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion (?=\.), so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.
To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see (?P<name>...) in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $…" optionally.

Input / output

The open() calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.

To write tab-delimited output, use the csv module.

Bug

Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $…" price for every line of the input, instead of once per domain.

This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:

The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The domains, prices, and prices_after_first_year are separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindly zip() them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of $, then your data would be inconsistent, and you would never find out, because the code doesn't crash.

Parsing

For this problem, I would use regular expressions in two ways:

To split the input into one stanza per domain, use re.split(). Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion (?=\.), so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.
To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see (?P<name>...) in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $…" optionally.

Input / output

The open() calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.

To write tab-delimited output, use the csv module.

Stack Exchange Network

Return to Answer

Bug

Parsing

Input / output

Suggested solution

Bug

Parsing

Input / output

Suggested solution

Bug

Parsing

Input / output

Suggested solution

Bug

Parsing

Input / output

Suggested solution