Issue
I am unable to get the links from this website called: https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/
I use the following code:
import requests
import bs4
req = requests.get('https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/')
soup = bs4.BeautifulSoup(req.text,"lxml")
When I use this code, I am unable to find the source for the links that I want.
soup.body
The code above returns the following output:
<body class="calcite">
<div aria-label="loading" id="base-loader">
<svg class="loader-square loader-square-1" viewbox="0 0 56 56" xmlns="http://www.w3.org/2000/svg">
<rect height="56" width="56"></rect>
</svg>
<svg class="loader-square loader-square-2" viewbox="0 0 56 56" xmlns="http://www.w3.org/2000/svg">
<rect height="56" width="56"></rect>
</svg>
<svg class="loader-square loader-square-3" viewbox="0 0 56 56" xmlns="http://www.w3.org/2000/svg">
<rect height="56" width="56"></rect>
</svg>
<div class="loader-bars"></div>
</div>
<script>
if (typeof customElements !== 'undefined') {
customElements.efineday = customElements.define;
}
</script>
<!-- crossorigin options added because otherwise we cannot see error messages from unhandled errors and rejections -->
<script crossorigin="anonymous" src="https://hubcdn.arcgis.com/opendata-ui/assets/assets/vendor-c4dd10aa0ad3c0cd3c74b496637f5da5.js"></script>
<script crossorigin="anonymous" src="https://hubcdn.arcgis.com/opendata-ui/assets/assets/opendata-ui-01a5d313fd3d0fe1fa17d9270ab5c456.js"></script>
<div id="ember-basic-dropdown-wormhole"></div>
<!-- opendata-ui version: 5.171.0+6ac420726b - Tue, 05 Oct 2021 18:03:32 GMT -->
</body>
I cant seem to find any drive links on the HTML body text above.
I want to get all of the google drive links from that hyperlinks on the left column
Solution
Selenium
solution:
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
# options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
driver.get("https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/")
time.sleep(3) # wait for page to download
links = [i.get_attribute('href') for i in driver.find_elements_by_css_selector("td p a")]
print(links)
driver.quit()
If you didn't use Selenium
before, do not forget to install chromedriver.
requests
+urllib.parse
solution (inspired by the comment of @tromgy)
import requests
import bs4
from urllib.parse import unquote
req = requests.get('https://riwayat-file-vaksinasi-dki-jakarta-jakartagis.hub.arcgis.com/')
soup = bs4.BeautifulSoup(req.text,"lxml")
url = unquote(soup.select_one("script#site-injection").string[23:].strip())
soup = bs4.BeautifulSoup(url,"lxml")
links = [i['href'][2:-2] for i in soup.select("td p a")]
links
If you want to get only the left column of the table, then use td:first-of-type p a
as css selector. It simply inspects the first td
element and ignore the second (right) one.
Answered By - Muhteva
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.