pycharm + python 做爬虫

pycharm + python 做爬虫

python3.5+bs4爬虫模块

安装步骤:
首先,你要先进入pycharm的Project Interpreter界面,进入方法是:setting(ctrl+alt+s) ->Project Interpreter,Project Interpreter在具体的Project下。

点击“+”,输入beautifulsoup ,就可以找到你要安装的插件了。

Python3的选择bs4进行安装,Python2的选择beautifulSoup。

爬虫代码

from bs4 import BeautifulSoup
withopen(‘D:\\PycharmProjects\\web_parse\\the_blah.html’,’r’) as wb_data: //这里打开本地文件需要双斜杠
Soup = BeautifulSoup(wb_data,’lxml’)
images = Soup.select(‘body > div.main-content > ul > li > img’)
引用块内容
print(Soup)

python版本3.5 报错结果:bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need to install a parser library?
解决办法:安装lxml

重新安装lxml和bs4仍然报错。
补充:
已解决问题,缺少libxslt。
brew install libxslt

web_parse.py

from bs4 import BeautifulSoup
with open(‘D:\PycharmProjects\code_of_video1_2\web\new_index.html’,’r’) as wb_data:
Soup = BeautifulSoup(wb_data,’lxml’)
images = Soup.select(‘body > div.main-content > ul > li > img’)
titles = Soup.select(‘body > div.main-content > ul > li > div.article-info > h3 > a’)
descs = Soup.select(‘body > div.main-content > ul > li > div.article-info > p.description’)
cates = Soup.select(‘body > div.main-content > ul > li > div.article-info > p.meta-info’)
rates = Soup.select(‘body > div.main-content > ul > li > div.rate > span’)
#print(images,titles,descs,cates,rates,sep=’\n————\n’)
#print(cates)
info=[]#定义info list;
for title,image,desc,rate,cate in zip(titles,images,descs,rates,cates):
data = {
‘title’:title.get_text(),
‘rate’ :rate.get_text(),
‘desc’ :desc.get_text(),
‘cate’ :list(cate.stripped_strings),#列表化
‘image’:image.get(‘src’)
}
info.append(data)
”’提取评分大于 3,输出‘title’和‘cate”’
for i in info:
if float(i[‘rate’])>3:
print(i[‘title’],i[‘cate’])
”’body > div.main - content > ul > li: nth - child(1) > img/html/body/div[2]/ul/li[1]/img
/html/body/div[2]/ul/li[1]/div[1]/h3/a
body > div.main-content > ul > li:nth-child(1) > div.article-info > h3 > a
print(Soup)”’

提取评分大于 3,输出‘title’和‘cate
输出

Sardinia’s top 10 beaches [‘fun’, ‘Wow’]
How to get tanned [‘butt’, ‘NSFW’]
How to be an Aussie beach bum [‘sea’]

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值