python爬取指定多个网页数据_我如何使用python和beautifulsoup4在网站中循环抓取数据的多个页面...

本文介绍了一位开发者使用Python和BeautifulSoup4从PGA.com网站抓取高尔夫球场信息的问题。目前的脚本只能抓取一页的数据,包括球场名称、地址、所有权、网址和电话号码。为获取所有18000个球场的详细信息,需要对900页进行遍历。解决方案是通过循环遍历从1到907(包含第1页和最后一页),每次迭代中更新URL中的页码参数,从而抓取所有页面的数据。
摘要由CSDN通过智能技术生成

1586010002-jmsa.png

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer

I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data and import it into a CSV but I am now having a problem of scraping data from multiple pages on the PGA website. I want to extract ALL THE GOLF COURSES but my script is limited only to one page I want to loop it in away that it will capture all data for golf courses from all pages found in the PGA site. There are about 18000 gold courses and 900 pages to capture data

Attached below is my script. I need help on creating code that will capture ALL data from the PGA website and not just one site but multiple. In this manner it will provide me with all the data of gold courses in the United States.

Here is my script below:

import csv

import requests

from bs4 import BeautifulSoup

url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})

g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:

try:

name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

except:

name=''

try:

address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text

except:

address1=''

try:

address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text

except:

address2=''

try:

website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text

except:

website=''

try:

Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text

except:

Phonenumber=''

course=[name,address1,address2,website,Phonenumber]

courses_list.append(course)

with open ('filename5.csv','wb') as file:

writer=csv.writer(file)

for row in courses_list:

writer.writerow(row)

#for item in g_data1:

#try:

#print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text

#except:

#pass

#try:

#print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text

#except:

#pass

#for item in g_data2:

#try:

#print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

#except:

#pass

#try:

#print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text

#except:

#pass

#try:

#print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text

#except:

#pass

This script only captures 20 at a time and I want to capture all in one script which account for 18000 golf courses and 900 pages to scrape form.

解决方案

The PGA website's search have multiple pages, the url follows the pattern:

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

this means you can read the content of the page, then change the value of page by 1, and read the the next page.... and so on.

import csv

import requests

from bs4 import BeautifulSoup

for i in range(907): # Number of pages plus one

url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)

r = requests.get(url)

soup = BeautifulSoup(r.content)

# Your code for each individual page here

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值