Scrape and parse dictionary website and get meaning of each word in a text file

Question

In this program, I have done automation using Selenium, to find english meaning of each sanskrit word from a dictionary website.

The function eng_meaning accepts the word and using selenium and BeautifulSoup scarps the website and find english meaning.

def eng_meaning(word):
    """
    returns english meaning of the sanskrit word passed
    """
    url = 'https://www.learnsanskrit.cc/index.php?mode=3&direct=au&script=hk&tran_input='
    driver = webdriver.Chrome("/usr/bin/chromedriver")
    driver.get(url)
    meanings = []

    try:
        #find text box where we enter the word
        text_box = driver.find_element_by_id("tran_input")

        #enter word in the text box
        text_box.send_keys(word)

        #clicking translate button
        buttons = driver.find_elements_by_tag_name("button")
        buttons[-1].click()

        html_page = driver.page_source
        #parse html page
        soup = BeautifulSoup(html_page, 'html.parser')
        #find table which holds meaning
        table = soup.body.find('table', attrs={'class':'table0 bgcolor0'})
        table_body = table.find('tbody')

        #find rows of table which holds meaning
        rows = table_body.find_all('tr')
        process = True #will become false if word/sentence other then passed word is on table
        row_itr = 0

        for row in rows:
            #find coulmns in the current row
            columns = row.find_all('td')
            i = 0

            for col in columns:

                #split, join and strip the current snaskrit or english word and store in meaning
                meaning =  " ".join(col.get_text().split())
                meaning = meaning.strip()

                if (i == 0):
                    tmp = meaning
                    if (row_itr == 0):
                        #add sanskrit word in the list. Only done once
                        sanskrit_word = meaning
                elif (i == 2):
                    if (row_itr == 0):
                        #add english transliteration in list. Only done once
                        english_transliteration = meaning

                    if (word != meaning and word != tmp):
                        #current word is process doesn't match with passed word. So stop processing
                        process = False
                        break
                elif (i == 3):
                    #append sanskrit word and english transliteration in list. Done only once
                    if (row_itr == 0):
                        meanings.append(sanskrit_word)
                        meanings.append(english_transliteration)
                    #append the english meaning to list
                    meanings.append(meaning)
                i+=1
            if (process is False):
                #stop processing
                break
            row_itr += 1
        return meanings
    except(Exception, AttributeError) as e:
        print("No meaning found for", word)

    finally:
        if len(meanings) != 0:
            print("meanings found")
        driver.quit()

The program reading_writing_to_file.py reads a sanskrit text file, remove duplicate words and store meaning of each word in another text file.

import sanskrit_to_english as se

def remove_duplicates(file_name):
    """
    Remove duplicate word from the file
    """
    unique_words = set()
    with open(file_name, 'r') as file:
        for line in file:
            for word in line.split():
                unique_words.add(word)
        file.close()
    return unique_words


def eng_translation(file_name_r, file_name_w):
    """
    Read sanskrit text file  and find english meaning of each word
    """
    words = list(remove_duplicates(file_name_r))

    with open(file_name_w, 'w') as file:
        for word in words:
            meanings = se.eng_meaning(word)
            if (meanings):
                file.write(meanings[0])
                file.write(" - ")
                file.write(meanings[1])
                file.write(" - ")
                meanings_str = ' \\'.join([meanings[i] for i in range(2, len(meanings))])
                file.write(meanings_str)
                file.write("\n")
    file.close()
    

file_name_r = "data.txt"
file_name_w = "Sanskrit_english.txt"
eng_translation(file_name_r, file_name_w)

I am beginner in selenium. How can I improve run-time performance of this code. It takes approx. 5 minutes to find meaning of a word.

Sanskrit dictionary website - learnsanskrit.cc

Kate · Accepted Answer · 2021-03-24 20:54:16Z

The most obvious bottleneck in your code is that you are calling the function eng_meaning in a loop, but that function creates a new Selenium instance everytime, which is an expensive operation. Imagine that you are searching in Google then systematically closing and reopening your browser after every search. Waste of time isn't it.

What you should do is restructure your code to initiate Selenium once at the start of your program. Then all you have to do is change URL, submit new parameters etc while keeping the existing instance.

This should speed up things markedly. I have not run your program, nor have I done any benchmarking. I recommend that you use the timeit module or similar to time the performance of your code, and figure out the sections where there are delays. Or at least add a few prints here and there in your code, showing the current execution step and a timestamp.

5 minutes per word is really a lot, but it is also possible that the website is applying some form of rate limiting against heavy users or scrapers (that is you).

Misc remarks

Warning: you have some typos (misspellings).

The naming conventions for functions or variables leave to desire.

except(Exception, AttributeError) as e:

is superfluous: Exception will catch everything, but you just can catch AttributeError in this context.

file.close() is not needed when you are using the context manager (with)

In eng_meaning you have a loop on table rows which is somewhat convoluted. Variable names like tmp are not very intuitive. In spite of the comments, it is not immediately clear why variable process exists and what you are really trying to do.

The way you are incrementing variables (row/column counters) is error-prone, especially with two nested loops with condition blocks. Instead of:

for row in rows:

you could have:

for row_counter, row in enumerate(rows, start=1):

then let Python take care of incrementation for you. It's possible that BS already has methods built-in to fetch row/column index, I'm not sure. But it's not something you should be handling manually.

RootTwo · Accepted Answer · 2021-03-25 23:38:41Z

In addition to @Anonymous' answer, her are a few more observations.

Selenium - finding elements

Getting the translate button based on it being the last button on the page is rather brittle:

    buttons = driver.find_elements_by_tag_name("button")
    buttons[-1].click()

If the web site rearranges things (something a designer could do), it would break the script. It is best to use an id or name, which, unfortunately the web page your scraping doesn't have. But Selenium can use CSS selectors too, like

    button = driver.find_element_by_css_selector("button[@value='translate']")

Changing the button id, name, or, value would require changing the backend code, which is a more significant change. So the theory is that they will change less often.

Loops

A loop that does something different each time through the loop is a bit of a code smell. It makes it difficult to understand what the code does and doesn't really save any lines of code. If the first row needs special treatment write code for the first row and then a loop for the remaining rows. Put repeated code in functions:

SANSKRIT_COL = 0
ENGLISH_COL  = 2

def clean_text(text):
    return " ".join(col.get_text().split()).strip()

def scrape_row(row):
    column = row.find_all('td')

    sanskrit = clean_text(column[SANSKRIT_COL])
    english = clean_text(column[ENGLISH_COL])

    return sanskrit, english

Then the nested loops can be replaced with something like this:

    rows = table_body.find_all('tr')

    sanskrit, english = scrape_row(rows[0])
    meanings = [sanskrit, english]

    for row in rows[1:]:
        sanskrit, english = scrape_row(rows[0])
        
        if sanskrit != word:
            break

        meanings.append(english)
 
    return meanings

try - except - else - finally

It is almost never a good idea to include Exception in an except clause. Doubly so when you don't print or log the exception. It will catch every exception including KeyboardInterrupt (e.g., Ctrl-C). Use the most specific Exception you can.

In a try-except-else-finally statement, the else clause is executed if there are no exceptions in the try clause. The finally clause gets executed last, regardless of whether there was exception or not. So, in your code if an exception occurs after the first row is processed, the code would print "No meaning found..." in the except clause and then print "meanings found" in the finally clause.

`set.union()`

set.union() can take an iterable, so the loop in remove_duplicates() can be:

for line in file:
    unique_words.union(line.split())

`with open(...) as file`

When using open in a with statement (e.g., as a context manager), the file is automatically closed at the end of the block, so file.close() isn't necessary.

Stack Exchange Network

Scrape and parse dictionary website and get meaning of each word in a text file

2 Answers 2

Misc remarks

Selenium - finding elements

Loops

try - except - else - finally

`set.union()`

`with open(...) as file`

You must log in to answer this question.

Hot Network Questions

Scrape and parse dictionary website and get meaning of each word in a text file

2 Answers 2

Misc remarks

Selenium - finding elements

Loops

try - except - else - finally

set.union()

with open(...) as file

You must log in to answer this question.

Related

Hot Network Questions

`set.union()`

`with open(...) as file`