Python数据抓取

最新推荐文章于 2024-07-07 08:00:00 发布

初遇见

最新推荐文章于 2024-07-07 08:00:00 发布

阅读量374

点赞数 3

本文链接：https://blog.csdn.net/weixin_44591656/article/details/105180691

版权

1.使用requests抓取网页内容

import requests


myUrl="http://politics.people.com.cn/GB/1024/index.html"//要抓取网页的网址
myContent=requests.get(myUrl).content.decode("GB2312")
print(myContent)

2.使用re 正则表达式抓取网页

import requests
import re

myUrl="http://politics.people.com.cn/GB/1024/index.html"
myContent=requests.get(myUrl).content.decode("GB2312")
# print(myContent)
myPatten="<a href='(.*)' target=_blank>(.*)</a> <em>(.*)</em>"
myList=re.findall(myPatten,myContent)
for item in myList:
    onNews={}
    onNews['title']=item[1]
    onNews['href']=item[0]
    onNews['time']=item[2]
    print(onNews)

3.使用xpath提取网页数据

import requests
from lxml import etree

myUrl="http://politics.people.com.cn/GB/1024/index.html"
myContent=requests.get(myUrl).content.decode("GB2312")
etreeHtml=etree.HTML(myContent)
myList=etreeHtml.xpath("//li")


for li in myList:
    oneNews={}
    oneNews['title']=li.xpath("./a")[0].text
    oneNews['href']=li.xpath("./a/@href")[0]
    oneNews['time']=li.xpath("./em")[0].text
    print(oneNews)

xpath,下载插件
lxml 一开始这块会报错，from lxml import etree ModuleNotFoundError: No module named 'lxml'，这样的错误，win+r,命令窗口，然后输入pip install lxml，在输入pip install tushare
在pycharm—file–settings–project Interpreter
在这里插入图片描述
点右面的加号+

点他
然后返回上一栏，点OK，就好了
4.使用BeautifulSoup提取网页数据
html5lib是一个html解析库，使用BeautifulSoup进行解析html会用到。当然也可以使用python自带的库html.parser不过兼容性没有html5lib好。 html5lib不是自带，使用的时候需要安装一下。

   win+r  输入cmd   输入
   pip3 install html5lib
   安装好后在
   pip3 install beautifulsoup4

然后按照上面的步骤pycharm—file–settings–project Interpreter–点加号–输入html5lib–点Install package—安装成功会写success…的
然后在按照上面步骤按bs4就可以了。

import requests

from bs4 import BeautifulSoup as  bs
myUrl="http://politics.people.com.cn/GB/1024/index.html"
myContent=requests.get(myUrl).content.decode("GB2312")

bsHtml=bs(myContent,"html5lib")
myList=bsHtml.find_all('li')
for item in myList:
    oneNews={}
    oneNews['title']=item.find('a').get_text()
    oneNews['href']=item.find('a').get_text('href')
    oneNews['time']=item.find('em').get_text()
    print(oneNews)