0

OK, my coding is very rusty so I've been borrowing and adapting from tutorials.

I started playing around with BeautifulSoup opening a file with:

with open('event.html', encoding='utf8') as f:
    soup = bs4.BeautifulSoup(f, "lxml")

Later, I need to find a string in the same file and BS seemed more complicated so I did:

lines = f.readlines()

And put it together with the previous instructions:

with open('event.html', encoding='utf8') as f:
    soup = bs4.BeautifulSoup(f, "lxml")
    lines = f.readlines()

Where I'm puzzled is that if I swap two lines and make that block like below:

with open('event.html', encoding='utf8') as f:
    lines = f.readlines()
    soup = bs4.BeautifulSoup(f, "lxml")

Then the rest of my code will break. Why is it?

6
  • the first one works Commented May 15, 2017 at 13:55
  • 3
    because .readlines() advances the file pointer to the end of the file So when BS tries to read the pointer is at the end of the file Commented May 15, 2017 at 13:57
  • so, should I use a different/better method to extract the lines? Commented May 15, 2017 at 14:00
  • 1
    you can reset the pointer to the start of the file as per user3381590 answer or see stackoverflow.com/questions/10201008/… Commented May 15, 2017 at 14:03
  • order is unimportant for me but I was banging my head wondering why the code wasn't working and then even more confused when I figured out re-ordering that portion "fixed" it... if anyone has a suggestion I'll take it Commented May 15, 2017 at 14:04

1 Answer 1

2

The readlines function causes the internal file pointer to point to the end of the file. I haven't used BeautifulSoup myself but I assume they are assuming that the input file is at pointed at the 0th index in the file. Seeking the file to the beginning using f.seek(0) should alleviate that.

with open('event.html', encoding='utf8') as f:
    lines = f.readlines()
    f.seek(0)
    soup = bs4.BeautifulSoup(f, "lxml")

BeautifulSoup is probably reading the file and then setting the file pointer back to where it was after finishing the read, which is why it is working the other way around.

Sign up to request clarification or add additional context in comments.

2 Comments

From my tests I believe neither BeautifulSoup nor readlines() set the pointer back. If the other runs first, BS will crash the script but readlines() will simply return empty and move on. Your f.seek(0) fixes this. Thanks!
If BS does not set the pointer back, then lines should be an empty list when f.readlines is called.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.