1.使用requests
抓取网页内容
import requests
myUrl="http://politics.people.com.cn/GB/1024/index.html"//要抓取网页的网址
myContent=requests.get(myUrl).content.decode("GB2312")
print(myContent)
2.使用re 正则表达式抓取网页
import requests
import re
myUrl="http://politics.people.com.cn/GB/1024/index.html"
myContent=requests.get(myUrl).content.decode("GB2312")
# print(myContent)
myPatten="<a href='(.*)' target=_blank>(.*)</a> <em>(.*)</em>"
myList=re.findall(myPatten,myContent)
for item in myList:
onNews={}
onNews['title']=item[1]
onNews['href']=item[0]
onNews['time']=item[2]
print(onNews)
3.使用xpath提取网页数据
import requests
from lxml import etree
myUrl="http://politics.people.com.cn/GB/1024/index.html"
myContent=requests.get(myUrl).content.decode("GB2312")
etreeHtml=etree.HTML(myContent)
myList=etreeHtml.xpath("//li")
for li in myList:
oneNews={}
oneNews['title']=li.xpath("./a")[0].text
oneNews['href']=li.xpath("./a/@href")[0]
oneNews['time']=li.xpath("./em")[0].text
print(oneNews)
xpath,下载插件
lxml 一开始这块会报错,from lxml import etree ModuleNotFoundError: No module named 'lxml'
,这样的错误,win+r,命令窗口,然后输入pip install lxml
,在输入pip install tushare
在pycharm
—file–settings–project Interpreter
点右面的加号+
点他
然后返回上一栏,点OK,就好了
4.使用BeautifulSoup提取网页数据
html5lib是一个html解析库,使用BeautifulSoup进行解析html会用到。当然也可以使用python自带的库html.parser不过兼容性没有html5lib好。 html5lib不是自带,使用的时候需要安装一下。
win+r 输入cmd 输入
pip3 install html5lib
安装好后在
pip3 install beautifulsoup4
然后按照上面的步骤pycharm
—file–settings–project Interpreter–点加号–输入html5lib–点Install package—安装成功会写success…的
然后在按照上面步骤按bs4就可以了。
import requests
from bs4 import BeautifulSoup as bs
myUrl="http://politics.people.com.cn/GB/1024/index.html"
myContent=requests.get(myUrl).content.decode("GB2312")
bsHtml=bs(myContent,"html5lib")
myList=bsHtml.find_all('li')
for item in myList:
oneNews={}
oneNews['title']=item.find('a').get_text()
oneNews['href']=item.find('a').get_text('href')
oneNews['time']=item.find('em').get_text()
print(oneNews)