Issue
I'm trying to parse some html using the xml python library. The html I'm trying to parse is from download.docker.com which breaks out to,
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</title>
</head>
<body>
<h1>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</h1>
<hr>
<pre><a href="../">../</a>
<a href="containerd.io_1.5.10-1_amd64.deb">containerd.io_1.5.10-1_amd64.deb</a>
...
</pre><hr></body></html>
Parsing the html with the following code,
import urllib
import xml.etree.ElementTree as ET
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = ET.fromstring(html_doc)
>>> ParseError: mismatched tag: line 6, column 2
unless I'm mistaken, this is because of the <meta charset="UTF-8">
. Using something like lxml, I can make this work with,
import urllib
from lxml import html
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = = html.fromstring(html_doc)
Is there any way to parse this html using the xml python library instead of lxml?
Solution
Is there any way to parse this html using the xml python library instead of lxml?
The answer is no.
An XML library (for example xml.etree.ElementTree
) cannot be used to parse arbitrary HTML. It can be used to parse HTML that also happens to be well-formed XML. But your HTML document is not well-formed.
lxml on the other hand can be used for both XML and HTML.
By the way, note that "the xml python library" is ambiguous. There are several submodules in the xml
package in the standard library (https://docs.python.org/3/library/xml.html). All of them will reject the HTML document in the question.
Answered By - mzjn
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.