Thursday

extracting <h2> title text from html where title text might include newlines

9:19 AM extract, html, r, text No comments

Issue

I have an html file with some <h2> tags such as

a <- '<section id="sec-standard-stoet-geary" class="level2" data-number="9.4">
      <h2 data-number="9.4" class="anchored" data-anchor-id="sec-standard-stoet-geary">
      <span class="header-section-number">9.4</span> Standardising PISA results</h2>'

b <- '<span class="fu">read_parquet</span>(<span 
     class="st">"&lt;folder&gt;PISA_2015_student_subset.parquet"</span>)</span></code><button 
     title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre> 
     </div>
     </div>
     </section><section id="sec-leftjoin" class="level2" data-number="9.3"><h2 data-number="9.3" 
     class="anchored" data-anchor-id="sec-leftjoin">
     <span class="header-section-number">9.3</span> Linking data using <code>left_join</code>
     </h2>
     <p>some text</p>'

c <- paste(a,b,a)

I can extract the title of the a using:

str_extract_all(a, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results"

But trying this on b returns nothing:

str_extract_all(b, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> character(0)

and c only returns the first and third instance of h2 when it should return all instances:

str_extract_all(c, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results" "Standardising PISA results"

EDIT: from the comments this appears to be the regex not being able to parse the newline characters.

I've tried enabling single line mode in regex (?s) for the parsing, but it's still not working

Solution

Here's a helper function that will choose H2 eleements with spans but will ignore the spans

library(xml2)
library(stringr)

geth2 <- function(x) {
  temp <- read_html(x) %>% xml_find_all("//h2[span]")
  xml_remove(xml_find_all(temp, ".//span"))
  temp %>% xml_text() %>% str_squish()  
}

geth2(a)
# [1] "Standardising PISA results"
geth2(b)
# [1] "Linking data using left_join"

If you wanted to keep the markup inside the H2, this could work

geth2 <- function(x) {
  temp <- read_html(x) %>% xml_find_all("//h2[span]")
  xml_remove(xml_find_all(temp, ".//span"))
  temp %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish()  
}
geth2(a)
# [1] "Standardising PISA results"
geth2(b)
# [1] "Linking data using <code>left_join</code>"

For a version that will work with multiple H2 tags, you can use

geth2 <- function(x) {
  temp <- read_html(x) %>% xml_find_all("//h2[span]")
  xml_remove(xml_find_all(temp, ".//span"))
  cleanup <- . %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish() 
  sapply(temp, cleanup)
}
geth2(c)
# [1] "Standardising PISA results"
# [2] "Linking data using <code>left_join</code>"
# [3] "Standardising PISA results"

Answered By - MrFlick

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday

extracting <h2> title text from html where title text might include newlines

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels