python爬虫和数据挖掘的区别_python爬虫和数据挖掘

本文介绍了使用Python进行网络爬虫的基本方法,包括利用urllib和urllib2库抓取网页数据,通过BeautifulSoup进行数据清洗及解析,以及采用chardet库识别字符编码。此外还提到了selenium这一更为强大的自动化工具。
摘要由CSDN通过智能技术生成

考虑用python做爬虫,需要研究学习的python模块

1内置的 urllib, urllib2 库用来爬取数据

2 使用BeautifulSoup做数据清洗

http://www.crummy.com/software/BeautifulSoup/

编码规则

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

1 An encoding you pass in as the fromEncoding argument to the soup constructor.

2 An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.

3 An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.

4 An encoding sniffed by the chardet library, if you have it installed.

5 UTF-8

6 Windows-1252

可以用fromEncoding参数来构造BeautifulSoup

soup = BeautifulSoup(euc_jp, fromEncoding="gbk")

3 使用python chardet 字符编码判断

http://chardet.feedparser.org/download/

4 更加强大的 selenium

作者 张大鹏

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值