Monday

I am unable to web-scrape a URL from a hyperlink in this website. | Web-scrape, Python

4:15 PM beautifulsoup, html, python, web-scraping No comments

Issue

I am unable to get the links from this website called: https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/

I use the following code:

import requests
import bs4

req = requests.get('https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/')
soup = bs4.BeautifulSoup(req.text,"lxml")

When I use this code, I am unable to find the source for the links that I want.

soup.body

The code above returns the following output:

<body class="calcite">
<div aria-label="loading" id="base-loader">
<svg class="loader-square loader-square-1" viewbox="0 0 56 56" xmlns="http://www.w3.org/2000/svg">
<rect height="56" width="56"></rect>
</svg>
<svg class="loader-square loader-square-2" viewbox="0 0 56 56" xmlns="http://www.w3.org/2000/svg">
<rect height="56" width="56"></rect>
</svg>
<svg class="loader-square loader-square-3" viewbox="0 0 56 56" xmlns="http://www.w3.org/2000/svg">
<rect height="56" width="56"></rect>
</svg>
<div class="loader-bars"></div>
</div>
<script>
      if (typeof customElements !== 'undefined') {
        customElements.efineday = customElements.define;
      }
    </script>
<!-- crossorigin options added because otherwise we cannot see error messages from unhandled errors and rejections -->
<script crossorigin="anonymous" src="https://hubcdn.arcgis.com/opendata-ui/assets/assets/vendor-c4dd10aa0ad3c0cd3c74b496637f5da5.js"></script>
<script crossorigin="anonymous" src="https://hubcdn.arcgis.com/opendata-ui/assets/assets/opendata-ui-01a5d313fd3d0fe1fa17d9270ab5c456.js"></script>
<div id="ember-basic-dropdown-wormhole"></div>
<!-- opendata-ui version: 5.171.0+6ac420726b - Tue, 05 Oct 2021 18:03:32 GMT -->
</body>

I cant seem to find any drive links on the HTML body text above.

I want to get all of the google drive links from that hyperlinks on the left column

Solution

Selenium solution:

from selenium import webdriver
import time

options = webdriver.ChromeOptions()
# options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
driver.get("https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/")
time.sleep(3)  # wait for page to download
links = [i.get_attribute('href') for i in driver.find_elements_by_css_selector("td p a")]
print(links)
driver.quit()

If you didn't use Selenium before, do not forget to install chromedriver.

requests + urllib.parse solution (inspired by the comment of @tromgy)

import requests
import bs4
from urllib.parse import unquote

req = requests.get('https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/')
soup = bs4.BeautifulSoup(req.text,"lxml")
url = unquote(soup.select_one("script#site-injection").string[23:].strip())

soup = bs4.BeautifulSoup(url,"lxml")
links = [i['href'][2:-2] for i in soup.select("td p a")]
links

If you want to get only the left column of the table, then use td:first-of-type p a as css selector. It simply inspects the first td element and ignore the second (right) one.

Answered By - Muhteva

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday

I am unable to web-scrape a URL from a hyperlink in this website. | Web-scrape, Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels