昨天才刚学的这个爬虫的第三方库,被它的强大所震撼,以下为自己写的爬虫的程序:
# coding=utf-8
import urllib2
from bs4 import BeautifulSoup
import re
url = "http://tools.2345.com/m/naonao/"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
temp = soup.find_all(name="a",attrs={"class":"box"})
print len(temp)
for i in range(1,len(temp)):
temp_key = []
temp_value = []
sub_url = url+ temp[i].get('href')
print sub_url
sub_response = urllib2.urlopen(sub_url)
sub_soup = BeautifulSoup(sub_response)
name = (sub_soup.find(name="span",attrs = {"class":"name"})).string
date = (sub_soup.find(name="span",attrs ={"class":"date"})).string
theHead = name+"("+date+")"#theHead是说明星座运势及日期的
print theHead
temp1 = sub_soup.find(name = "div",attrs = {"class":"brief"})
print temp1.text #temp1为运势主体内容
temp2 = (sub_soup.find(name="div",attrs = {"class":"taglist"})).text #temp2为其运势中需注意的
print temp2
运行时会出现这个错误:WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
使用网上的引入sys模块并重写编码的那个方法并不可行,引入并重写后,发现程序运行后根本没有内容被读写出来,希望哪位看到的朋友若知道如何解决能给以提示下,谢谢
后续~~~~
解决方法已找到,就是要在
sub_soup = BeautifulSoup(sub_response)
这一句中,加入from_encoding,具体为sub_soup = BeautifulSoup(sub_response,from_encoding="GBK")