Issue
I try to extract the content of the right side on this page:
https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=idn%3D1173921214
When we take a look on the html, the information is stored in this table:
With my code snippet, I can´t reach the text I want to.
def getDescriptionDNB():
description = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
response = requests.get(description)
soupedDescription = BeautifulSoup(response.content, "html.parser")
text = soupedDescription.find(class_="amount").text
if text == "Treffer 1 von 1":
autor = soupedDescription.find_all("tr")
for i in autor:
test = i.findNext("td").text
print(test)
The problem is, I don´t know how to get down to the inner <td>
tag to get the information I want to.
Do you know, how I can solve this Problem?
Solution
Main issue is - HTML of page is broken, there are some tr
without td
and without closing tag.
Try to select your elements more specific or try to store info in dict
and pick by key.
Create a dict
with css selectors
:
...
dict(
row.get_text(':',strip=True).split(':',1)
for row in soup.select('tr:has(td:not([colspan]))')
)
Create a dict
with pandas.read_html()
:
import pandas as pd
url = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
pd.read_html(url)[0].dropna().set_index(0)[1].to_dict()
Output
Based on url of your snippet.
{'Link zu diesem Datensatz': 'https://d-nb.info/94985462X',
'Titel': 'Learning English - Password red:Teil: Reformierte Rechtschreibung / 3. / [Hauptw.].',
'Ausgabe': '1. Aufl., 1. Dr.',
'Verlag': 'Stuttgart ; Düsseldorf ; Leipzig : Klett',
'Zeitliche Einordnung': 'Erscheinungsdatum: 1997',
'Umfang/Format': '172 S. ; 25 cm',
'ISBN/Einband/Preis': '978-3-12-546630-2 Pp. : DM 29.60:3-12-546630-X Pp. : DM 29.60:3-12-54663-0 (falsch) Pp. : DM 29.60',
'Sprache(n)': 'Englisch (eng), Deutsch (ger)',
'Frankfurt': 'Signatur: 1997 A 10551:Bereitstellung in Frankfurt',
'Leipzig': 'Signatur: 1997 A 10551:Bereitstellung in Leipzig'}
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.