利用python抓取网页各种类型内容（静态、动态）

最新推荐文章于 2024-02-04 10:15:00 发布

VIP文章 Amiber

最新推荐文章于 2024-02-04 10:15:00 发布

阅读量1.3w

点赞数

分类专栏：数据挖掘算法实践文章标签： python Python 编程

本文链接：https://blog.csdn.net/chenzulong198867/article/details/8245691

版权

声明：

本实验的操作系统是ubuntu,python 2.X

Code-1:抓取静态的title数据（无需登录用户）

获取淘宝主页的页面静态数据

url:http://www.taobao.com

#!/usr/bin/env	python 
#-*- coding: utf-8 -*-
#@author	Amiber
#@date	2012-12-01
#@brief grap the static-web data with chinese languag

from BeautifulSoup import BeautifulSoup
import urllib2

url = r"http://www.taobao.com"

resContent = urllib2.urlopen(url).read()
resContent = resContent.decode('gbk').encode('utf8')

soup = BeautifulSoup(resContent)

print soup.title.string

url = r"http://www.news.baidu.com"
resContent = urllib2.urlopen(url).read().decode('gb18030').encode('utf8')

soup = BeautifulSoup(resContent)

print soup.title.string

Code-2:抓取静态网页中的table数据（无需登录用户）

获取的是国家统计局一个网上上的静态表格数据

#!/usr/bin/env	python
#!-*- coding:utf-8 -*-
#@author	Amiber
#@date	2012-12-01
#@brief grap the table-data in static-web

from BeautifulSoup import BeautifulSoup
import urllib2
import re
import string

def earse(strline,ch) :
	left = 0
	right = strline.find(ch)
	
	while right !=-1 :
			strline = strline.replace(ch,'')
			right = strline.find(ch)
	return strline

url = r"http://www.bjstats.gov.cn/sjfb/bssj/jdsj/2012/201211/t20121130_239295.htm"

resContent = urllib2.urlopen(url).read()

resContent = resContent.decode('gb18030').encode('utf8')

soup = BeautifulSoup(resContent)

print soup('title')[0].string

tab= soup.findAll('table')

trs = tab[len(tab)-1].findAll('tr')

for trIter in trs :
		tds = trIter.findAll('td')
		for tdIter in tds :
				span = tdIter('span')
				for i in range(len(span)) :
						if span[i].string :
								print earse(span[i].string,' ').strip(),
						else :
								pass
		print

Code-3:抓取静态网页中的文档数据（无需登录用户）

获取的是一个bbs网站的一个zip文档数据

#!/usr/bin/env	python 
#

最低0.47元/天解锁文章

Amiber

关注

0
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
利用python抓取网页各种类型内容（静态、动态）

声明：本实验的操作系统是ubuntu,python 2.XCode-1:抓取静态的title数据（无需登录用户）获取淘宝主页的页面静态数据url:http://www.taobao.com#!/usr/bin/env python #-*- coding: utf-8 -*-#@author Amiber#@date 2012-12-01#@bri
复制链接

扫一扫