Issue
The selector is:
response.css('div.dp-conteudo__esquerda span.varpb').extract_first()
First one returns the correct span normally, but the second one doesn't return anything.
The only relevant difference I can see is that the second page has a span with the varpb class appear first in a different part of the code, but it's the same as the other one I wanted. In line 581 of second page's source code:
...
<a class="--link" href="/putear"><span class="varpt">putear</span><span class="varpb">putear</span></a><span class="mx-2" style="color:#888888;">]</span></item> ou
...
Even if it didn't get the "correct" span, shouldn't it get this one? Am I missing something?
To make this clear: I don't care about making it work (by other means;changing the selector), I want to understand why it doesn't, please.
If any portuguese-speaking person wonders why I was scraping these words, it was for a game of Scrabble.
EDIT:
Thanks to Alexander's answer and this other question, I realized scrapy isn't scraping the HTML code I expected, but rather an "incomplete" version of it. I can't confirm this since I didn't see this with my own eyes (due to poor understanding of the code; sorry), but apparently this is due to Ajax calls being made by the page.
Solution
For the first page --> this. You get the result <span class="varpb">putear</span>
which I believe is what you are expecting to get from the example in your question. This is the only result because it's the only span
le
For the second page --> that you get nothing because there are no span
elements with a class of varpb
that are descendants from a div
element with a class of dp-conteudo__esquerda
. Both of those elements with those classes exist but they don't exist in that branch of the element tree.
Edit
It appears the issue is that you are not looking at the actual source html for the pages you are referring to.
One way to be sure that you are actually looking at the html content that is being parsed by scrapy is to save the response.text
to an html file locally and actually inspect it yourself. This ensures that you are seeing the same html that scrapy sees.
Alternatively you can use view(response)
in scrapy shell if you want to view the html seen by scrapy in your browser.
view(response)
Here is an example of how you might do that with scrapy shell:
In [2]: fetch('https://dicionario.priberam.org/putear')
2023-12-28 00:22:01 [scrapy.core.engine] INFO: Spider opened
2023-12-28 00:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dicionario.priberam.org/putear> (referer: None)
In [3]: with open('putear.html', 'wt', encoding='utf8') as fd:
...: fd.write(response.text)
...:
In [4]: fetch('https://dicionario.priberam.org/puteares')
2023-12-28 00:23:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dicionario.priberam.org/puteares> (referer: None)
In [5]: with open('puteares.html', 'wt', encoding='utf8') as fd:
...: fd.write(response.text)
...:
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.