Thursday

Using Scrapy, when I try to scrape two almost identical pages, the CSS selector works for one but not the other

4:20 AM css, python, scrapy, web-scraping No comments

Issue

The selector is:

response.css('div.dp-conteudo__esquerda span.varpb').extract_first()

Pages are this and that.

First one returns the correct span normally, but the second one doesn't return anything.

The only relevant difference I can see is that the second page has a span with the varpb class appear first in a different part of the code, but it's the same as the other one I wanted. In line 581 of second page's source code:

...
<a class="--link" href="/putear"><span class="varpt">putear</span><span class="varpb">putear</span></a><span class="mx-2" style="color:#888888;">]</span></item> ou
...

Even if it didn't get the "correct" span, shouldn't it get this one? Am I missing something?

To make this clear: I don't care about making it work (by other means;changing the selector), I want to understand why it doesn't, please.

If any portuguese-speaking person wonders why I was scraping these words, it was for a game of Scrabble.

EDIT:

Thanks to Alexander's answer and this other question, I realized scrapy isn't scraping the HTML code I expected, but rather an "incomplete" version of it. I can't confirm this since I didn't see this with my own eyes (due to poor understanding of the code; sorry), but apparently this is due to Ajax calls being made by the page.

Solution

For the first page --> this. You get the result <span class="varpb">putear</span> which I believe is what you are expecting to get from the example in your question. This is the only result because it's the only span le

For the second page --> that you get nothing because there are no span elements with a class of varpb that are descendants from a div element with a class of dp-conteudo__esquerda. Both of those elements with those classes exist but they don't exist in that branch of the element tree.

Edit

It appears the issue is that you are not looking at the actual source html for the pages you are referring to.

One way to be sure that you are actually looking at the html content that is being parsed by scrapy is to save the response.text to an html file locally and actually inspect it yourself. This ensures that you are seeing the same html that scrapy sees.

Alternatively you can use view(response) in scrapy shell if you want to view the html seen by scrapy in your browser.

view(response)

Here is an example of how you might do that with scrapy shell:

In [2]: fetch('https://dicionario.priberam.org/putear')
2023-12-28 00:22:01 [scrapy.core.engine] INFO: Spider opened
2023-12-28 00:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dicionario.priberam.org/putear> (referer: None)

In [3]: with open('putear.html', 'wt', encoding='utf8') as fd:
   ...:     fd.write(response.text)
   ...:

In [4]: fetch('https://dicionario.priberam.org/puteares')
2023-12-28 00:23:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dicionario.priberam.org/puteares> (referer: None)

In [5]: with open('puteares.html', 'wt', encoding='utf8') as fd:
   ...:     fd.write(response.text)
   ...:

Answered By - Alexander

This Answer collected from stackoverflow and tested by AngularFix community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday

Using Scrapy, when I try to scrape two almost identical pages, the CSS selector works for one but not the other

Issue

Solution

Edit

0 comments:

Post a Comment

Popular Posts

Labels