Issue
I'm trying to extract information from the following block of HTML code:
<div class="topicicons"><span title="This is a marketplace ad topic." class="icon icon-tag"></span></div>
<a data-nologvisit href="/pinball/forum/forum/games-for-sale" rel="7" class="subforum subforum-7" title="Pinball machines for sale">MFS</a>
<a class="t" href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span class="tag tag-price">$ 25,000 </span><span class="tag tag-loc">Whiteland, IN</span></a>
<span class="by">By ARW55 (1 year ago)<span class="last"> - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" class="topic topic-mb0 sf-7 has-new sfbox-1 topic-featured">
The fields I want to extract are the name (which in this example is "Pirates of the Caribbean (LE)"), the price ($25,000), location (Whiteland, IN), and last post (Last post 3 days ago). So far, I've used this line of code
soup.findAll(True, {'class': ['t', 'by']})
to get the following output:
FS: Pirates of the Caribbean (LE)$ 25,000 Whiteland, IN
By ARW55 (1 year ago) - Last post 3 days ago
However, I am lost on how to extract the information I want from these strings. There are hundreds of other similar entries, e.g.
FS: Teenage Mutant Ninja Turtles (Pro)$ 8,000 (OBO) Downers Grove, IL
By Thorn-in-pinball (3 days ago) - Last post 3 days ago
and I am not sure where to get started. I would appreciate any advice or guidance.
Thank you!
Solution
With Beautiful Soup there is an easy way to pick out attributes from elements, as these elements are nested we can look individually at the contents of each find and grab the respective text attribute to get the information you want.
# Parent elements
movie_post_element = soup.find("a", class_="t")
# Child element
movie_element = movie_post_element.contents[0]
# Child text
movie = movie_element.text
Then a full example of this would be..
import bs4
html = """<div class="topicicons"><span title="This is a marketplace ad topic." class="icon icon-tag"></span></div>
<a data-nologvisit href="/pinball/forum/forum/games-for-sale" rel="7" class="subforum subforum-7" title="Pinball machines for sale">MFS</a>
<a class="t" href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span class="tag tag-price">$ 25,000 </span><span class="tag tag-loc">Whiteland, IN</span></a>
<span class="by">By ARW55 (1 year ago)<span class="last"> - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" class="topic topic-mb0 sf-7 has-new sfbox-1 topic-featured">"""
soup = bs4.BeautifulSoup(html)
# Parent elements
movie_element = soup.find("a", class_="t")
author_element = soup.find("span", class_="by")
movie = movie_element.contents[0].text
price = movie_element.contents[1].text
location = movie_element.contents[2].text
author = author_element.contents[0].text
post_date = author_element.contents[1].text
by_text = author_element.text
Answered By - Frostyfeet909
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.