3

I'm wondering how to scrape this site: https://1997-2001.state.gov/briefings/statements/2000/2000_index.html

It only contains 'a' and 'href', no class or ID, it's very simply structured. I would like to run a string that will scrape all of the links on the page for their content.

I've tried this code using chromedriver, but it has only printed a list of the links (I am quite the amateur at web-scraping). Any help would be great.

    >>> elems = driver.find_elements_by_xpath("//a[@href]")
    >>> for elem in elems:
    ...     print(elem.get_attribute("href"))

2 Answers 2

2

I hope I understood your question well: this script will go through each link, opens it and prints the document it contains:

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[width="580"] img + a'):
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[width="580"]').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

Prints:

https://1997-2001.state.gov/briefings/statements/2000/ps001227.html
Statement by Philip T. Reeker, Deputy Spokesman
December 27, 2000
China - LUOYANG Fire
We were saddened to learn of the terrible fire that killed hundreds of people in the Chinese city of Luoyang.  The United States offers its sincerest condolences to the families of the victims of the tragic December 25 blaze.  We also offer our best wishes for a speedy recovery to the survivors.

--------------------------------------------------------------------------------
https://1997-2001.state.gov/briefings/statements/2000/ps001226.html
Media Note
December 26, 2000
Renewal of the Secretary of State's Advisory Committee
on
Private International Law
The Department of State has renewed the Charter of the Secretary of State's Advisory Committee on Private International Law (ACPIL), effective as of November 20, 2000.   The Under Secretary for Management has determined that ACPIL is necessary and in the public interest.
ACPIL enables the Department to obtain the expert and considered view of the private sector organizations and interests most knowledgeable of, as well as most affected by, international activities to unify private law.  The committee consists of members from private sector organizations, bar associations, national legal organizations, and federal and state government agency and judicial interests concerned with private international law.  ACPIL will follow the procedures prescribed by the Federal Advisory Committee Act (FACA) (Public Law 92-463).  Meetings will be open to the public unless a determination is made in accordance with Section 10(d) of the FACA, 5 U.S.C. 552b(c)(1) and (4), that a meeting or a portion of the meeting should be closed to the public.
Any questions concerning this committee should be referred to the Executive Director, Harold Burman, at 202-776-8420.

--------------------------------------------------------------------------------
https://1997-2001.state.gov/briefings/statements/2000/ps001225.html
Statement by Philip T. Reeker, Deputy Spokesman
December 25, 2000
Parliamentary Elections in Serbia
The United States congratulates the Democratic Opposition of Serbia on their victory in Saturday's election for the Serbia parliament. Official results indicate that the United Democratic Opposition (DOS) won with 64 percent of the vote to just 13 percent for the Socialist Party.
We also congratulate the Serbian people for their widespread participation in what international observers have stated was a free and fair election.  This is the first time the Serbian people have had a free and fair election in over a decade. As such, it is an important milestone in the ongoing democratic transition that began with Milosevic's defeat in September's federal presidential elections. The Democratic Opposition is now in a stronger position to carry out the reforms needed to fully integrate Serbia into the international community.
We look forward to working with the new Serbian government in the same amicable and cooperative spirit we now enjoy with the federal Yugoslav government.

--------------------------------------------------------------------------------

...and so on.

EDIT: Corrected code:

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[width="580"] img + a'):
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

EDIT 2 (For 1999):

import requests 
from bs4 import BeautifulSoup


url = 'https://1997-2001.state.gov/briefings/statements/1999/1999_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('td[width="580"] img + a'):
    if 'http' not in a['href']:
        u = 'https://1997-2001.state.gov/briefings/statements/1999/' + a['href']
    else:
        u = a['href']
    
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    tag = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"]), blockquote')
    if tag:
        t = tag.get_text(strip=True, separator='\n')
        print( t.split('[end of document]')[0] )
    print('-' * 80)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, this was very useful! One problem-- the code stopped running after mid November links, rather than gathering from every link on the page from January-December, is there any way to include all of the links' contents?
Sorry for my confusion, but did you modify the code to work for this specific page? Just asking because I was hoping to use the code to apply to similarly structured pages, and when I use this to try to run other links on the page (for example, '1997-2001.state.gov/briefings/statements/1999/…) the code does not work.
@Sarah "The code does not work" Why? What's the error?
“AttributeError 'NoneType' object has no attribute 'get_text'” , for the link I commented, and otherwise it simply would not run the command for the links for 1998, or 1997.
2

i am not very expert of web scraping, but you can use BeautifulSoup:

from bs4 import BeautifulSoup
import urllib.request
import bs4 as bs

url  = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
sauce  = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')   

for a in (soup.find_all('a')):
    try:
        print(a['href'])
    except:
        print("a element doesn't have href")

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.