Issue
I am trying to parse only the texts of an html document which contains angle-brackets as part of text.
For example, the html file would look something like this:
<html>
<head></head>
<body>
<div>
<p>1. <someUnicodeString></p>
<p>2. <foo 2012.12.26.></p>
<p>3. <123 2012.12.26.></p>
<p>4. <@ 2012.12.26.></p>
<p>5. foobarbar</p>
</div>
</body>
</html>
I want the outcome of the parsed textfile to be like this:
1. <someUnicodeString>
2. <foo 2012.12.26.>
3. <123 2012.12.26.>
4. <@ 2012.12.26.>
5. foobarbar
I am using Jsoup's parse function to achieve this as shown below,
Document doc = null;
try {
doc = Jsoup.parse(new File(path), "UTF-8");
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
doc.outputSettings().escapeMode(EscapeMode.xhtml);
//set line breaks in readable format
doc.select("br").append("\\n");
doc.select("p").prepend("\\n\\n");
String bodyText = doc.body().html().replaceAll("\\\\n", "\n");
bodyText = Jsoup.clean(bodyText, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
File f = new File(textFileName+".txt");
f.getParentFile().mkdirs();
PrintWriter writer = new PrintWriter(f, "UTF-8");
writer.print(Parser.unescapeEntities(bodyText, false));
writer.close();
} catch(IOException e) {
//Do something
e.printStackTrace();
}
However once Jsoup goes through the parsing process, it adds tags for each angle-bracket followed by characters.
<p>1. <someUnicodeString></someUnicodeString></p>
<p>2. <foo 2012.12.26.></foo></p>
<p>3. <123 2012.12.26.></p>
<p>4. <@ 2012.12.26.></p>
<p>5. foobarbar</p>
Eventually producing the outcome
1.
2.
3. <123 2012.12.26.>
4. <@ 2012.12.26.>
5. asdasd
How can I prevent Jsoup from erasing angle-brackets inside text when parsing?
Or is there a way to make Jsoup to recognize that certain angle-brackets are not html elements? (perhaps using regex?)
I am new to Jsoup and would very much appreciate any kind of help. Thank you.
Solution
Thanks to the comment of Davide Pastore, and the question "Right angle bracket in HTML"
I was able to solve the problem with the following code.
doc = Jsoup.parse(new File(path), "UTF-8");
//replace all left-angle tags inside <p> element to "<"
Elements pTags = doc.select("p");
for (Element tag : pTags) {
//change the boundary of the regex to whatever suits you
if (tag.html().matches("(.*)<[a-z](.*)")) {
String innerHTML = tag.html().replaceAll("<(?=[a-z])", "<");
tag.html(innerHTML);
}
}
If you go through the process of converting "<" in text to <
before you start parsing, you will be able the get the right output.
Answered By - Joon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.