Issue
I want to print the tag as a whole just like soup.find_all() does but using lxml etree. In lxml it prints out the tag name instead of whole tag which I want to use for comparison purposes. Thank you.
Code:
from bs4 import BeautifulSoup
from lxml import etree
doc = "<p><a></a><a></a>Printable Text"
soup = BeautifulSoup(doc, "lxml")
root = etree.fromstring(str(soup))
tree = etree.ElementTree(root)
for e in tree.iter():
print(e.tag)
print("--------------")
Output:
html
--------------
body
--------------
p
--------------
a
--------------
a
--------------
Expected Output:
<html><body><p><a></a><a></a>Printable Text</p></body></html>
--------------
<body><p><a></a><a></a>Printable Text</p></body>
--------------
<p><a></a><a></a>Printable Text</p>
--------------
<a></a>
--------------
<a></a>
--------------
Solution
You don't really need to parse your doc (note that in your question you failed to include the closing <p> tag) with beautifulsoup, then parse the soup with lxml, and finally wrap that with ElementTree. But if you want/need to stick to that, you can get close (but not 100%) to your expected output by changing your for loop from
for e in tree.iter():
print(e.tag)
to (as mentioned by @mzjn in the comment):
for e in tree.iter():
print(etree.tostring(e).decode())
If you want/can skip the ElementTree step, you can get the same output by using xpath:
for e in root.xpath('//*'):
print(etree.tostring(e).decode())
In either case, the output is
<html><body><p><a/><a/>Printable Text</p></body></html>
<body><p><a/><a/>Printable Text</p></body>
<p><a/><a/>Printable Text</p>
<a/>
<a/>Printable Text
If you can/want to skip the lxml part altogether, you can get to your exact expected output by printing directly from the soup with css selectors:
for s in soup.select('*'):
print(s)
Output:
<html><body><p><a></a><a></a>Printable Text</p></body></html>
<body><p><a></a><a></a>Printable Text</p></body>
<p><a></a><a></a>Printable Text</p>
<a></a>
<a></a>
Answered By - Jack Fleeting
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.