Issue
I'm trying to extract every URL like "https://....zip" from the element <a href=""> of the page: https://divvy-tripdata.s3.amazonaws.com/index.html using rvest library as follows:
link <- "https://divvy-tripdata.s3.amazonaws.com/index.html"
library(rvest)
library(xml2)
html <- read_html(link)
html %>% html_attrs("href")
Output:
html %>% html_attrs("href") Error in html_attrs(., "href") : unused argument ("href")
Can you please help me using R to extract all URL from the above link?
HTML: https://i.stack.imgur.com/5BiFU.jpg
Solution
The links are coming from an additional GET request made by the browser which returns xml. You can still go with rvest and grab the Key nodes then complete the urls.
library(rvest)
base_url <- "https://divvy-tripdata.s3.amazonaws.com"
files <- read_html(base_url) |> html_elements('key') |> html_text() |> url_absolute(base_url)
For older R versions, swop |>
with %>%
and add library(magrittr)
as import.
Answered By - QHarr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.