python数据分页pandas_使用python pandas&刮分页网页表美丽的汤

weixin_39924198

于 2020-12-24 04:48:05 发布

阅读量376

点赞数

文章标签： python数据分页pandas

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39924198/article/details/111945555

版权

I am a beginner in python pandas, i am trying to scrap a paginated table using beautiful soup package, the data is scraped, but the content of each cell comes in a single row, i couldn't get a coherent csv file

here is my code :

import urllib

import urllib.request

from bs4 import BeautifulSoup

import os

file=open(os.path.expanduser("sites_commerciaux.csv"), "wb")

def make_soup(url):

thepage=urllib.request.urlopen(url)

soupdata=BeautifulSoup(thepage,"html.parser")

return soupdata

headers="Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact"

file.write(bytes(headers,encoding='ascii',errors='ignore'))

save=""

for num in range(0,22):

soup=make_soup("http://www.ceetrus.com/fr/implantations-sites-commerciaux?page="+str(num))

for rec in soup.findAll('tr'):

saverec=""

for data in rec.findAll('td'):

saverec=saverec+","+data.text

if len(saverec)!=0:

save=save+"\n"+saverec[1:]

file.write(bytes(save,encoding='ascii',errors='ignore'))

can anyone help me fix it please

解决方案

i did some cleaning. first, why the bytes type? you're writing text. then why ascii? please use unicode. if later in your code you really need ascii encode to ascii then. the use of findAll is deprecated, please use find_all. you had also a possible issue with commas in the surface value. finally, always use context managers when possible (here: working with files)

and now for your question, you had two problems:

your test if len(saverec)!=0: was in the for-loop, generating lots

of useless data.

you were not stripping the data of its unneeded whitespaces

.

import urllib

import urllib.request

from bs4 import BeautifulSoup

import os

def make_soup(url):

thepage=urllib.request.urlopen(url)

soupdata=BeautifulSoup(thepage,"html.parser")

return soupdata

save=""

for num in range(0, 22):

soup=make_soup("http://www.ceetrus.com/fr/implantations-sites-commerciaux?page="+str(num))

for rec in soup.find_all('tr'):

saverec=""

for data in rec.find_all('td'):

data = data.text.strip()

if "," in data:

data = data.replace(",", "")

saverec=saverec+","+data

if len(saverec)!=0:

save=save+"\n"+saverec[1:]

print('#%d done' % num)

headers="Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact"

with open(os.path.expanduser("sites_commerciaux.csv"), "w") as csv_file:

csv_file.write(headers)

csv_file.write(save)

which outputs for the first page:

Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact

ALCORCÓN,ALCORCÓN - MADRID,Ouvert,4298 m²,40,José Carlos GARCIA

Alegro Alfragide,CARNAXIDE,Ouvert,11461 m²,122,

Alegro Castelo Branco,CASTELO BRANCO,Ouvert,6830 m²,55,

Alegro Setúbal,Setúbal,Ouvert,27000 m²,114,

Ancona,Ancona,Ouvert,7644 m²,41,Ettore PAPPONETTI

Angoulême La Couronne,LA COURONNE,Ouvert,6141 m²,45,Juliette GALLOUEDEC

Annecy Grand Epagny,EPAGNY,Ouvert,20808 m²,61,Delphine BENISTY

Anping,Tainan,Ouvert,969 m²,21,Roman LEE

АКВАРЕЛЬ,Volgograd,Ouvert,94025 m²,182,Viktoria ZAITSEVA

Arras,ARRAS,Ouvert,4000 m²,26,Anais NIZON

weixin_39924198

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。