python搜索引擎根据关键词爬取内容

最新推荐文章于 2024-04-17 21:31:09 发布

ShawChen6

最新推荐文章于 2024-04-17 21:31:09 发布

阅读量6.1k

点赞数 2

分类专栏： python爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_39178473/article/details/105348291

版权

python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.常见搜索引擎搜索格式：
（1）百度搜索引擎：
http://www.baidu.com.cn/s?wd=’ 关键词’&pn=‘分页’。
wd是搜索的关键词，pn是分页的页面，由于百度搜索每页的结果是十个（最上面的可能是广告推广，不是搜索结果），所以pn=0是第一页，第二页是pn=10…
例如https://www.baidu.com/s?wd=python&pn=0，得到的是关于python的第一页搜索结果。
（2）必应搜索引擎：
http://global.bing.com/search?q=‘关键词’
（3）搜狗搜索引擎
https://www.sogou.com/web?query=‘关键词’
（4）360搜索引擎
https://www.so.com/s?q=‘关键词’

解决eclipse Pydev中import时报错Unresolved import requests的方法
https://blog.csdn.net/chris_111x/article/details/52312523

3.安装lxml
在这里插入图片描述
出现这样的错误，说明安装的不是对应python版本的库，下载的库名中cp27代表python2.7,其它同理。（自己安装的是python3.8的版本）
下载对应的3.8版本，安装成功

4.百度搜索得到的链接被重定向了，需要禁止自动跳转，获取原本指向的地址，然后得到正确的域名

def get_real(o_url):
    #获取重定向url指向的网址 
    #try except对网页中的视频进行捕获异常
    try:
        r = requests.get(o_url, allow_redirects=False)  #禁止自动跳转
    except:
        return None
    if r.status_code == 302:
        try:
            return r.headers['location']  #返回指向的地址
        except:
            pass
    return o_url

5.使用Python 内置的模块 urlparse可以解析网址
result = urlparse(real[“sub_url”]).netloc

6.在使用该方法的k[‘href’]读取网页链接时，编译器报错：
KeyError: ‘href’
修改为：
k.get(‘href’)
成功运行，取出href中的链接。

7.获取的网址去重：
为了可以多重爬虫，把网址放在队列里
此时需要判重，对于每一个网址，可以用一个set集合（set不会存重复）来存放，一遇到新的网站先判断该网址是否已存在set中，如存在只不许加入队列，不存在则加入队列中

8.jieba分词：
下载地址： https://github.com/fxsjy/jieba
可以将其放到python安装目录下，方便管理
启动cmd, 输入：python setup.py install

9.已当前时间命名

#记录当前时间            
	now = time.strftime("%Y%m%d%H%M%S",time.localtime(time.time())) 
#爬取的文件已当前时间命名并保存在data文件夹里
	filename="./文件名/"+now+r".txt"
# 写入文件
	file = open(filename, 'w+', encoding='utf-8')

10.安装pyMYSQL时加上镜像路径
pip3 install PyMySQL -i https://pypi.douban.com/simple

11.匹配拼接多个标签
def getWord(html):
bs = BeautifulSoup(html, “html.parser”) #实例化对象
listDiv = bs.findAll(“div”)
listP = bs.findAll(“p”)
listTd = bs.findAll(“td”)
listLi = bs.findAll(“li”)
listSpan = bs.findAll(“span”)
listDd = bs.findAll(“dd”)
listDt = bs.findAll(“dt”)
namelist = listDiv + listP + listTd + listLi + listSpan + listDd + listDt
return namelist

12.使用os.remove(path+dnsFile)出错
TypeError: can only concatenate str (not “_io.TextIOWrapper”) to str
路径不能用拼接的形式，应在之前进行拼接后传入单个变量
dataPath = path+dnsFile
os.remove(dataPath)

ShawChen6

关注

2
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
python搜索引擎根据关键词爬取内容

1.常见搜索引擎搜索格式：（1）百度搜索引擎：http://www.baidu.com.cn/s?wd=’ 关键词’&pn=‘分页’。wd是搜索的关键词，pn是分页的页面，由于百度搜索每页的结果是十个（最上面的可能是广告推广，不是搜索结果），所以pn=0是第一页，第二页是pn=10…例如https://www.baidu.com/s?wd=python&pn=0，得到的是关...
复制链接

扫一扫