Issue
I am wanting to import a transcript from a website, but analyse data from only half of it. I have imported the URL, and I want to count the total number of unique words in the text but only from the line of the transcript “The Rental of the Manor of Mayfield, 1545”. Does anyone know what code I can use to do this? I can't find out how to count words from a URL but only from a certain part. So far I have written:
import requests
source = 'http://www.myjacobfamily.com/historical%20manuscripts/mayfield%201.htm'
r = requests.get(source)
print(r.text)
Solution
I've included code below of what I think you were looking for.
import requests
import bs4
response = requests.get('http://www.myjacobfamily.com/historical%20manuscripts/mayfield%201.htm')
soup = bs4.BeautifulSoup(response.text, 'html.parser')
lines = soup.find_all('p')
story = []
record = False
for line in lines:
if "The Rental of the Manor of Mayfield, 1545." in line.text:
story.append(line.text)
record = True
continue
if record is True and "---" not in line.text:
story.append(line.text)
elif record is True and "---" in line.text:
break
print(story)
In this code I extract a single story from the link you posted (Perhaps what "half a page" means?) by using the BeautifulSoup
module to parse all the information between <p>
and </p>
tags. You can view this information by using developer tools on your internet browser. Once all lines
have been loaded the code iterates through them and doesn't start recording until The Rental of the Manor of Mayfield, 1545. is encountered. At this point it will grab every line until one containing "---" is reached (which seems to be how they delineate stories on the site). At this point it breaks the loop and prints the story. You can concatenate this list into a single string with:
"".join(story)
I think it would be tremendously easier to copy the story you want into a text document and then process that text document with something like Python. Web scraping would definitely not be my first choice to solve this problem.
Answered By - Reedinationer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.