python数据分页pandas_使用python pandas&刮分页网页表美丽的汤

I am a beginner in python pandas, i am trying to scrap a paginated table using beautiful soup package, the data is scraped, but the content of each cell comes in a single row, i couldn't get a coherent csv file

here is my code :

import urllib

import urllib.request

from bs4 import BeautifulSoup

import os

file=open(os.path.expanduser("sites_commerciaux.csv"), "wb")

def make_soup(url):

thepage=urllib.request.urlopen(url)

soupdata=BeautifulSoup(thepage,"html.parser")

return soupdata

headers="Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact"

file.write(bytes(headers,encoding='ascii',errors='ignore'))

save=""

for num in range(0,22):

soup=make_soup("http://www.ceetrus.com/fr/implantations-sites-commerciaux?page="+str(num))

for rec in soup.findAll('tr'):

saverec=""

for data in rec.findAll('td'):

saverec=saverec+","+data.text

if len(saverec)!=0:

save=save+"\n"+saverec[1:]

file.write(bytes(save,encoding='ascii',errors='ignore'))

can anyone help me fix it please

解决方案

i did some cleaning. first, why the bytes type? you're writing text. then why ascii? please use unicode. if later in your code you really need ascii encode to ascii then. the use of findAll is deprecated, please use find_all. you had also a possible issue with commas in the surface value. finally, always use context managers when possible (here: working with files)

and now for your question, you had two problems:

your test if len(saverec)!=0: was in the for-loop, generating lots

of useless data.

you were not stripping the data of its unneeded whitespaces

.

import urllib

import urllib.request

from bs4 import BeautifulSoup

import os

def make_soup(url):

thepage=urllib.request.urlopen(url)

soupdata=BeautifulSoup(thepage,"html.parser")

return soupdata

save=""

for num in range(0, 22):

soup=make_soup("http://www.ceetrus.com/fr/implantations-sites-commerciaux?page="+str(num))

for rec in soup.find_all('tr'):

saverec=""

for data in rec.find_all('td'):

data = data.text.strip()

if "," in data:

data = data.replace(",", "")

saverec=saverec+","+data

if len(saverec)!=0:

save=save+"\n"+saverec[1:]

print('#%d done' % num)

headers="Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact"

with open(os.path.expanduser("sites_commerciaux.csv"), "w") as csv_file:

csv_file.write(headers)

csv_file.write(save)

which outputs for the first page:

Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact

ALCORCÓN,ALCORCÓN - MADRID,Ouvert,4298 m²,40,José Carlos GARCIA

Alegro Alfragide,CARNAXIDE,Ouvert,11461 m²,122,

Alegro Castelo Branco,CASTELO BRANCO,Ouvert,6830 m²,55,

Alegro Setúbal,Setúbal,Ouvert,27000 m²,114,

Ancona,Ancona,Ouvert,7644 m²,41,Ettore PAPPONETTI

Angoulême La Couronne,LA COURONNE,Ouvert,6141 m²,45,Juliette GALLOUEDEC

Annecy Grand Epagny,EPAGNY,Ouvert,20808 m²,61,Delphine BENISTY

Anping,Tainan,Ouvert,969 m²,21,Roman LEE

АКВАРЕЛЬ,Volgograd,Ouvert,94025 m²,182,Viktoria ZAITSEVA

Arras,ARRAS,Ouvert,4000 m²,26,Anais NIZON

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值