Saturday

How to prevent Jsoup from erasing angle-brackets inside text when parsing

Issue

I am trying to parse only the texts of an html document which contains angle-brackets as part of text.

For example, the html file would look something like this:

<html>
 <head></head> 
 <body> 
  <div>
    <p>1. <someUnicodeString></p> 
    <p>2. <foo 2012.12.26.></p> 
    <p>3. <123 2012.12.26.></p> 
    <p>4. <@ 2012.12.26.></p> 
    <p>5. foobarbar</p> 
  </div>
 </body>
</html>

I want the outcome of the parsed textfile to be like this:

1. <someUnicodeString> 
2. <foo 2012.12.26.> 
3. <123 2012.12.26.> 
4. <@ 2012.12.26.> 
5. foobarbar

I am using Jsoup's parse function to achieve this as shown below,

Document doc = null;

try {
    doc = Jsoup.parse(new File(path), "UTF-8");
    doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
    doc.outputSettings().escapeMode(EscapeMode.xhtml);

    //set line breaks in readable format
    doc.select("br").append("\\n");
    doc.select("p").prepend("\\n\\n");
    String bodyText = doc.body().html().replaceAll("\\\\n", "\n");
    bodyText = Jsoup.clean(bodyText, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));

    File f = new File(textFileName+".txt");
    f.getParentFile().mkdirs();
    PrintWriter writer = new PrintWriter(f, "UTF-8");
    writer.print(Parser.unescapeEntities(bodyText, false));
    writer.close();
} catch(IOException e) {
    //Do something
    e.printStackTrace();
}

However once Jsoup goes through the parsing process, it adds tags for each angle-bracket followed by characters.

<p>1. <someUnicodeString></someUnicodeString></p> 
<p>2. <foo 2012.12.26.></foo></p> 
<p>3. <123 2012.12.26.></p> 
<p>4. <@ 2012.12.26.></p> 
<p>5. foobarbar</p>

Eventually producing the outcome

1.  
2.  
3. <123 2012.12.26.> 
4. <@ 2012.12.26.> 
5. asdasd

How can I prevent Jsoup from erasing angle-brackets inside text when parsing?

Or is there a way to make Jsoup to recognize that certain angle-brackets are not html elements? (perhaps using regex?)

I am new to Jsoup and would very much appreciate any kind of help. Thank you.

Solution

Thanks to the comment of Davide Pastore, and the question "Right angle bracket in HTML"

I was able to solve the problem with the following code.

doc = Jsoup.parse(new File(path), "UTF-8");
//replace all left-angle tags inside <p> element to "&lt;"
Elements pTags = doc.select("p");
for (Element tag : pTags) {
    //change the boundary of the regex to whatever suits you
    if (tag.html().matches("(.*)<[a-z](.*)")) {
        String innerHTML = tag.html().replaceAll("<(?=[a-z])", "&lt;");
        tag.html(innerHTML);
    }
}

If you go through the process of converting "<" in text to < before you start parsing, you will be able the get the right output.

Answered By - Joon

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday

How to prevent Jsoup from erasing angle-brackets inside text when parsing

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels