Sunday

How to scrape element with BeautifulSoup out of a table?

7:18 AM beautifulsoup, html, python, python-requests, web-scraping No comments

Issue

I try to extract the content of the right side on this page:

https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=idn%3D1173921214

When we take a look on the html, the information is stored in this table:

With my code snippet, I can´t reach the text I want to.

def getDescriptionDNB():
    description = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
    response = requests.get(description)
    soupedDescription = BeautifulSoup(response.content, "html.parser")
    text = soupedDescription.find(class_="amount").text
    if text == "Treffer 1 von 1":
        autor = soupedDescription.find_all("tr")
        for i in autor:
            test = i.findNext("td").text
            print(test)

The problem is, I don´t know how to get down to the inner <td> tag to get the information I want to.

Do you know, how I can solve this Problem?

Solution

Main issue is - HTML of page is broken, there are some tr without td and without closing tag.

Try to select your elements more specific or try to store info in dict and pick by key.

Create a dict with css selectors:

...
dict(
    row.get_text(':',strip=True).split(':',1) 
    for row in soup.select('tr:has(td:not([colspan]))')
)

Create a dict with pandas.read_html():

import pandas as pd
url = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
pd.read_html(url)[0].dropna().set_index(0)[1].to_dict()

Output

Based on url of your snippet.

{'Link zu diesem Datensatz': 'https://d-nb.info/94985462X',
 'Titel': 'Learning English - Password red:Teil: Reformierte Rechtschreibung / 3. / [Hauptw.].',
 'Ausgabe': '1. Aufl., 1. Dr.',
 'Verlag': 'Stuttgart ; Düsseldorf ; Leipzig : Klett',
 'Zeitliche Einordnung': 'Erscheinungsdatum: 1997',
 'Umfang/Format': '172 S. ; 25 cm',
 'ISBN/Einband/Preis': '978-3-12-546630-2 Pp. : DM 29.60:3-12-546630-X Pp. : DM 29.60:3-12-54663-0 (falsch) Pp. : DM 29.60',
 'Sprache(n)': 'Englisch (eng), Deutsch (ger)',
 'Frankfurt': 'Signatur: 1997 A 10551:Bereitstellung  in Frankfurt',
 'Leipzig': 'Signatur: 1997 A 10551:Bereitstellung  in Leipzig'}

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday

How to scrape element with BeautifulSoup out of a table?

Issue

Solution

Output

0 comments:

Post a Comment

Popular Posts

Labels