Function in Python to extract web Data

Question

I developed this feature that I think can be improved quite a bit. The result is the desired one, but it took me many lines of code. Any idea to optimize it?

import json
import urllib.request

dni = 71179047
    
def get_cuit(dni):
    request = urllib.request.Request('https://www.cuitonline.com/search.php?q=' + str(dni))
    response = urllib.request.urlopen(request)
    data_content = response.read()
    html = data_content.decode('ISO-8859-1')
    cuit = re.findall(r"\(?\b[0-9]{2}\)?-?[0-9]{7,8}-[0-9]{1}\b", html)[0]
    result = re.sub('[^0-9]','', cuit)
    return result

get_cuit(dni)

It's cliché, but: Do not parse HTML with regexes. Do not parse HTML with regexes. Do not parse HTML with regexes. — Reinderien
– Reinderien, Commented Dec 18, 2020 at 19:52

Reinderien · Accepted Answer · 2020-12-18 22:55:34Z

15

Use Requests for HTTP and let it deal with the encoding
Do not bake the query into the URL string
Use BeautifulSoup for HTML parsing
Use typehints

Nice and simple:

from requests import get
from bs4 import BeautifulSoup


def get_cuit(dni: int) -> str:
    with get(
        'https://www.cuitonline.com/search.php',
        params={'q': dni},
    ) as resp:
        resp.raise_for_status()
        doc = BeautifulSoup(resp.text, 'html.parser')

    span = doc.find('span', class_='cuit')
    return span.text


print(get_cuit(71_179_047))

answered Dec 18, 2020 at 22:55

Reinderien

71.1k5 gold badges76 silver badges256 bronze badges

\$\begingroup\$ As you said, 'nice and simple'. Thanks! ;) \$\endgroup\$

Raymont
– Raymont

2020-12-19 01:52:18 +00:00
Commented Dec 19, 2020 at 1:52
\$\begingroup\$ @Raymont BeautifulSoup really helps too. If you aren't familiar with it, I highly recommend checking it out to see what it can do. \$\endgroup\$

BruceWayne
– BruceWayne

2020-12-20 18:00:37 +00:00
Commented Dec 20, 2020 at 18:00
1

\$\begingroup\$ Why do Python devs tend to prefer from requests import get/get(foo) over import requests/requests.get(foo)? It just seems very short-sighted to me. Surely, reading requests.get(foo) is more informative/self-explanatory than just reading get(foo)? In the latter case, I'd have no idea where this get came from unless I scrolled all the way up to check all the import statements. Why not just keep namespaces separate? Ease of reading is vastly more important than ease of writing. \$\endgroup\$

Will
– Will

2020-12-21 01:08:53 +00:00
Commented Dec 21, 2020 at 1:08
1

\$\begingroup\$ @Will This is a 12-line program. The provenance of get is somewhat painfully obvious. In a much longer file, it would be worth considering the other import style. \$\endgroup\$

Reinderien
– Reinderien

2020-12-21 04:36:20 +00:00
Commented Dec 21, 2020 at 4:36
1

\$\begingroup\$ Reading your answer just happened to remind me that I don't see the merit of the from X import Y import style in any case, generally speaking. That said, I admit that it isn't a big deal per se in your specific example code (even if get seems overly generic to me), and that this wasn't the ideal place to express that notion. \$\endgroup\$

Will
– Will

2020-12-21 08:29:01 +00:00
Commented Dec 21, 2020 at 8:29

| Show 3 more comments

alecxe · Accepted Answer · 2020-12-19 02:47:08Z

Couple additional points to what @Reinderien mentioned:

if you would need to execute this function often, consider initializing a requests.Session() and re-using it for subsequent requests
you could improve the speed of HTML parsing here as well:
- use lxml instead of html.parser (lxml needs to be installed):
```
doc = BeautifulSoup(resp.text, 'lxml')
```
- since you are looking for a single element, SoupStrainer would really help out to allow your parser not waste time and resources parsing other parts of HTML:
```
from bs4 import BeautifulSoup, SoupStrainer

parse_only = SoupStrainer('span', class_='cuit')
doc = BeautifulSoup(resp.text, 'lxml', parse_only=parse_only)
```

CertainPerformance · Accepted Answer · 2020-12-19 01:52:25Z

If you did have to use a regular expression for this task (which you shouldn't in this case - see the other answer), you could improve it by:

[0-9] is equivalent to \d, a digit character
A number inside brackets indicates the number of times the previous token should be repeated. In the case of {1}, it's superfluous - you can leave out the brackets entirely and the token will be matched exactly once anyway.

\(?\b\d{2}\)?-?\d{7,8}-\d\b

Since you don't want all matches, but just the first match, .search could be a bit more appropriate than using .findall to find all matches followed by extracting the [0]th match.

Stack Exchange Network

Function in Python to extract web Data

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Function in Python to extract web Data

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions