Python爬虫之<单线程爬虫>

最新推荐文章于 2023-06-01 13:19:12 发布

WflytoC

最新推荐文章于 2023-06-01 13:19:12 发布

阅读量1.4k

点赞数 1

分类专栏： Python 文章标签： python 爬虫线程 html

本文链接：https://blog.csdn.net/weichuang_1/article/details/47666767

版权

Python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一.直接获取源代码

>>> import requests
>>> url='http://www.wutnews.net/'
>>> html=requests.get(url)
>>> print html.content
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="Keywords" content="经纬,经纬网,武汉理工大学,武汉理工大学门户,新闻经纬,时政视窗,校园文化,皮壳网,选修客,Token,拓垦团队" /><meta name="Description" content="武汉理工大学门户网站" /><meta name="robots" content="index, follow" /><meta name="googlebot" content="index, follow" /><meta name="author" content="Token Team" /><title>
    武汉理工大学经纬网
</title>

二.修改http头获取源代码
有的网站可能会对发送请求的程序进行审查，比如只会允许浏览器访问，而对爬虫进行拒绝，这时，我们可以添加http头，来让网站误认为我们的爬虫是浏览器。

>>> import requests
>>> url='http://www.wutnews.net/'
>>> headers={'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'}
>>> html=requests.get(url,headers=headers)
>>> print html.content
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="Keywords" content="经纬,经纬网,武汉理工大学,武汉理工大学门户,新闻经纬,时政视窗,校园文化,皮壳网,选修客,Token,拓垦团队" /><meta name="Description" content="武汉理工大学门户网站" /><meta name="robots" content="index, follow" /><meta name="googlebot" content="index, follow" /><meta name="author" content="Token Team" /><title>
    武汉理工大学经纬网
</title>

三.Requests与正则表达式
单线程简单爬虫的基本原理：使用Requests获取网页源代码，再使用正则表达式匹配出感兴趣的内容。

import requests
import re
url='http://www.wutnews.net/'
html=requests.get(url)
navTitles=re.findall('<img src="(.*?)"',html.text)
for each in navTitles:
    print each

部分结果为：

>>> 
images/news-more.jpg
images/news-more.jpg
images/subject.jpg
images/newsphoto-more.jpg
http://www.wutnews.net/thumbnail.aspx?width=84&height=80&path=http://www.wutnews.net/uploads/2015-07-26/23393633491043.jpg
http://www.wutnews.net/thumbnail.aspx?width=84&height=80&path=http://www.wutnews.net/uploads/2015-07-17/17164432633597.jpg
http://www.wutnews.net/thumbnail.aspx?width=84&height=80&path=http://www.wutnews.net/uploads/2015-07-17/08452651575529.jpg
images/culture-more.jpg
images/token.jpg

四.向网页提交数据
方法：requests.post

import requests
import re
url='https://www.crowdfunder.com/browse/deals&template=false'
data={
    'entities_only':'true',
    'page':'1'
    }
html=requests.post(url,data=data)
title=re.findall('"card-title">(.*?)</div>',html.text,re.S)
for each in title:
    print each

部分结果如下：

>>> 
Electric World Carnival
Intox-Detox
Aquavert by EIJ Industries
CafeBellas, Inc.
SU Labs Accelerator Seed Fund
Net Zero Urban Greens
SixthContinent Inc.
Paul Davis Restoration of Western Michigan
Vinavanti Urban Winery
Pipeline Wizard
Lavon Estates - Cavender Real Estate Group LLC
Jibehealth.com
EM&N8, Controllers Incorporated
AxCent Tuning Systems

WflytoC

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫之<单线程爬虫>

一.直接获取源代码>>> import requests>>> url='http://www.wutnews.net/'>>> html=requests.get(url)>>> print html.content<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xht
复制链接

扫一扫

专栏目录