Issue
I'm currently trying to scrape some transcripts (example here) to make a visualization. One piece of information I'd like to avail myself of is who is speaking. Unfortunately, when I scrape the site, this information disappears.
Briefly, here's the situation.
Here's an example snippet of what the reader sees on the site:
EZRA KLEIN: Bribery can be effective as a parent. You think you’re not going to do it, and then you do.
ELIZABETH WARREN: Yeah.
But when I scrape the site I get
[n] Bribery can be effective as a parent. You think you’re not going to do it, and then you do.
[n+1] Yeah.
This is after using the following pipeline to get this into R:
read_html('[insert_html]') %>%
html_elements('p') %>%
html_text()
(All the functions come from the rvest
package)
Any idea how this is happening / what I can do to preserve the association of speaker and utterance?
Solution
When you load this page via rvest::read_html
the website assumes that you are in a browser that does not support javascript so the page layout will be a "noscript" one, not the one you will see if you browse to the page in a desktop web browser.
To retrieve the speaker name try:
read_html('[insert_html]') %>%
html_elements('dl > dt') %>%
html_text()
To retrieve the associated utterance try:
read_html('[insert_html]') %>%
html_elements('dl > dd') %>%
html_text()
Answered By - br00t
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.