Issue
I have an html file with some <h2>
tags such as
a <- '<section id="sec-standard-stoet-geary" class="level2" data-number="9.4">
<h2 data-number="9.4" class="anchored" data-anchor-id="sec-standard-stoet-geary">
<span class="header-section-number">9.4</span> Standardising PISA results</h2>'
b <- '<span class="fu">read_parquet</span>(<span
class="st">"<folder>PISA_2015_student_subset.parquet"</span>)</span></code><button
title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre>
</div>
</div>
</section><section id="sec-leftjoin" class="level2" data-number="9.3"><h2 data-number="9.3"
class="anchored" data-anchor-id="sec-leftjoin">
<span class="header-section-number">9.3</span> Linking data using <code>left_join</code>
</h2>
<p>some text</p>'
c <- paste(a,b,a)
I can extract the title of the a
using:
str_extract_all(a, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results"
But trying this on b
returns nothing:
str_extract_all(b, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> character(0)
and c
only returns the first and third instance of h2 when it should return all instances:
str_extract_all(c, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results" "Standardising PISA results"
EDIT: from the comments this appears to be the regex not being able to parse the newline characters.
I've tried enabling single line mode in regex (?s)
for the parsing, but it's still not working
Solution
Here's a helper function that will choose H2 eleements with spans but will ignore the spans
library(xml2)
library(stringr)
geth2 <- function(x) {
temp <- read_html(x) %>% xml_find_all("//h2[span]")
xml_remove(xml_find_all(temp, ".//span"))
temp %>% xml_text() %>% str_squish()
}
geth2(a)
# [1] "Standardising PISA results"
geth2(b)
# [1] "Linking data using left_join"
If you wanted to keep the markup inside the H2, this could work
geth2 <- function(x) {
temp <- read_html(x) %>% xml_find_all("//h2[span]")
xml_remove(xml_find_all(temp, ".//span"))
temp %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish()
}
geth2(a)
# [1] "Standardising PISA results"
geth2(b)
# [1] "Linking data using <code>left_join</code>"
For a version that will work with multiple H2 tags, you can use
geth2 <- function(x) {
temp <- read_html(x) %>% xml_find_all("//h2[span]")
xml_remove(xml_find_all(temp, ".//span"))
cleanup <- . %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish()
sapply(temp, cleanup)
}
geth2(c)
# [1] "Standardising PISA results"
# [2] "Linking data using <code>left_join</code>"
# [3] "Standardising PISA results"
Answered By - MrFlick
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.