python自带了urllib和urllib2模块,以及第三方的requests库来抓取网页,这里我们使用easy_install包管理工具下载requests库,BeautifulSoup库,在CMD命令行下,切换到easy_install的目录,运行命令easy_install 包名称。
easy_install requests
安装好requests包之后,我们就可以选择使用urllib,urllib2或requests库来抓取网页了
1.网页内容的抓取
#! /usr/bin/env python
#coding:utf-8
import urllib
import urllib2
import requests
import sys
url = 'http://www.csdn.net'
def urllib2Test():
req = urllib2.Request(url)
response = urllib2.urlopen(req)
thePage = response.read()
def requestsTest():
r = requests.get(url)
r.status_code
r.content
r.headers
def urllib2TestEx(url):
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
content = response.read()
except urllib2.URLError,e:
print e.reason
def urlhttperror(url):
req = urllib2.Request(url)
try:urllib2.urlopen(req)
except urllib2.HTTPError,e:
print e.