Issue
I'm using Python 3.9.12 with requests_html
to parse HTML in a local file and extract the text content of the <title>
tag. However, when I try to extract the content, it returns more than just the text of the <title>
tag. I need help to get just the text "Test - A Sample Website" from the <title>
tag.
Here's the relevant part of the HTML file (named simple.html
):
<!doctype html>
<html class="no-js" lang="">
<head>
<title>Test - A Sample Website</title>
<!-- Other head elements -->
</head>
<body>
<!-- Body content -->
</body>
</html>
And here's my Python script:
from requests_html import HTML
with open('simple.html') as html_file:
source = html_file.read()
html = HTML(html=source)
match = html.find('title')
print(match[0].html)
This script outputs the entire head section starting from the <title>
tag, instead of just the text of the <title>
tag. How can I modify this script to get only the text "Test - A Sample Website" from the <title>
tag?
Solution
The HTML representation of the element, including its children, is returned by the html attribute in requests_html. Because the title tag is part of the <head>
element, calling match[0].html
returns the complete contents of the element, not just the <title>
. Instead of using .html
, use the .text
property to extract only the text content of the <title>
tag. The text attribute returns the element's text content, which is what you want.
from requests_html import HTML
with open('simple.html') as html_file:
source = html_file.read()
html = HTML(html=source)
match = html.find('title', first=True)
if match:
print(match.text)
else:
print("Title tag not found.")
Answered By - str1ng
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.