BeautifulSoup简单的自助抓取一点信息

最新推荐文章于 2024-09-20 10:09:44 发布

irisat163

最新推荐文章于 2024-09-20 10:09:44 发布

阅读量566

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/irisat163/article/details/52390241

版权

#!/usr/bin/envpython
# -*- coding: utf-8 -*-
#coding:utf-8

import urllib2
from bs4 import BeautifulSoup

output_file=open('qczj_brand_changshang.txt','a')

url_test='http://car.autohome.com.cn/price/series-135.html'

page_test=urllib2.urlopen(url_test)

test_soup=BeautifulSoup(page_test, 'html.parser')

for item in test_soup.findAll('span', class_="font-arial"):

print str(item)

如果要大规模的爬取数据，并且保存入库，还是需要一个框架。

临时采集一些信息，可以用urllib2和BeautifulSoup。

比如，我临时需要一些车系的价格区间，保存到output_file里面。

把链接放到url_test，如果需要抓取一系列车系的价格，那么就建一个车系的列表，遍历这个列表，html的格式都是‘http://car.autohome.com.cn/price/series-车系代码.html’就行了。

然后，用urllib2.urlopen把URL的字符串，变成一个网页的instance。

然后，用BeautifulSoup，建立一个<class 'bs4.BeautifulSoup'>，就是把刚刚建立的网页Instance，就是page_test，通过html.parser解析器，解析网页的内容。解析的结果，比如说名字叫test_soup。

接下来，因为在网页http://car.autohome.com.cn/price/series-135.html上，通过查看网页元素，我们知道车系的价格信息的结点标记是span，

class_="font-arial"，见下图。

所以，通过test_soup.findAll('span', class_="font-arial")就得到了'bs4.element.ResultSet'，也就是结果集合，

我们打印这个集合里的元素就行了。

看到这个集合里是，<span class="font-arial">12.99-16.99万</span>

必要的时候，可以通过正则表达式，把其中的‘12.99-16.99万’匹配出来。

这样，支持批量车系列表之后，程序就是：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding:utf-8

import urllib2
from bs4 import BeautifulSoup
from re import *

output_file=open('qczj_brand_changshang.txt','a')

chexi_list=["3999","2593",,,,,"1940"]

for each_chexi in chexi_list:

    url_test='http://car.autohome.com.cn/price/series-' + str(each_chexi) +'.html'

    page_test=urllib2.urlopen(url_test)


    test_soup=BeautifulSoup(page_test, 'html.parser')


    for item in test_soup.findAll('span', class_="font-arial"):


test_pattern=compile('^<span class="font-arial">(?P<price>.*)\<')
test_price_str=match(test_pattern,str(item),False)
price_matched=''.join(test_price_str.groupdict("price").values())
output_file.write(str(each_chexi) + '\t' + str(price_matched) + '\n')

这样，批量车系的价格，就这样保存到一个文件了。

结果文件大致如下：

这样可以自助的采集一些信息还是挺方便的。

irisat163

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫