Python库详解之网络(2)--解析网页

最新推荐文章于 2024-05-22 14:12:33 发布

小晏

最新推荐文章于 2024-05-22 14:12:33 发布

阅读量2k

点赞数

文章标签： python 网络 javascript file import website

本文链接：https://blog.csdn.net/xiadasong007/article/details/4521844

版权

昨天试了下用HTMLParser类来解析网页，可发现结果并不理想。不管怎么说，先写下过程，希望后来人能在此基础上解决我所遇到的问题。

写了2套解决方案，当然这2套只能对特定网站有效。我这里主要说明下对BBC主页www.bbc.co.uk和对网易www.163.com的解析。

对于BBC：

这套要简单得多，可能是该网页的编码比较标准吧

import html.parser
import urllib.request

class parseHtml(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a {} start tag".format(tag))
    def handle_endtag(self, tag):
         print("Encountered a {} end tag".format(tag))
    def handle_charref(self,name):
        print("charref")
    def handle_entityref(self,name):
        print("endtiyref")
    def handle_data(self,data):
        print("data")
    def handle_comment(self,data):
        print("comment")
    def handle_decl(self,decl):
        print("decl")
    def handle_pi(self,decl):
        print("pi")

#从这里开始看起，上面那个继承很简单，全部重载父类函数

#以二进制写的方式存储BBC网页，这是上篇内容(http://blog.csdn.net/xiadasong007/archive/2009/09/03/4516683.aspx),不赘述

file=open("bbc.html",'wb') #it's 'wb',not 'w'
url=urllib.request.urlopen("http://www.bbc.co.uk/")
while(1):
    line=url.readline()
    if len(line)==0:
        break
    file.write(line)

#生成一个对象

pht=parseHtml()

#对于这个网站，我使用'utf-8'打开，否则会出错，其他网站可能就不需要，utf-8是UNICODE编码
file=open("bbc.html",encoding='utf-8',mode='r')

#处理网页，feed
while(1):
    line=file.readline()
    if len(line)==0:
        break
    pht.feed(line)
file.close()
pht.close()

对于163：

#对于这个网页的解析，如果使用上面的方法，碰到CSS和javascript部分会发生异常，

#所以我在此去掉了那2部分，来看代码：

import html.parser
import urllib.request

#从这里看起，我定义了4个函数用于处理CSS和javascript部分

def EncounterCSS(line):
    if line.find("""<style type="text/css">""")==-1:
        return 0
    return 1
def PassCSS(file,line):
   # print(line)
    while(1):
        if line.find("</style>")!=-1:
            break
        line=file.readline()

def EncounterJavascript(line):
    if line.find("""<script type="text/javascript">""")==-1:
        return 0
    return 1
def PassJavascript(file,line):
    print(line)
    while(1):
        if line.find("</script>")!=-1:
            break
        line=file.readline()

website="http://www.163.com"
file=open("163.html",mode='wb') #it's 'wb',not 'w'
url=urllib.request.urlopen(website)
while(1):
    line=url.readline()
    if len(line)==0:
        break
    file.write(line)

pht=parseHtml()
file=open("163.html",mode='r')

while(1):
    line=file.readline()
    if len(line)==0:
        break

#在这个while循环中，先去掉CSS和Javascript部分
    if EncounterCSS(line)==1:
        PassCSS(file)
    elif EncounterJavascript(line)==1:
        PassJavascript(file)
    else:
        pht.feed(line)
file.close()
pht.close()

虽然都能成功，但却不是我所想要的，我希望处理网页有通用的方法。

本来想用下BeautifulSoup，希望这个类能帮忙解决，可惜咱python版本太新，不能用，等以后再看。

当然，处理网页也许并不需要HTMLParser类，我们可以自己写针对我们所需的代码，只有那样，我们才能对网页解析有更多的主动权，而且更能提升自身的能力。也就是说，我们只需要pyhon帮我们下载网页（网页元素），解析部分，还是自己来处理吧。

小晏

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python库详解之网络(2)--解析网页

昨天试了下用HTMLParser类来解析网页，可发现结果并不理想。不管怎么说，先写下过程，希望后来人能在此基础上解决我所遇到的问题。写了2套解决方案，当然这2套只能对特定网站有效。我这里主要说明下对BBC主页www.bbc.co.uk和对网易www.163.com的解析。对于BBC：这套要简单得多，可能是该网页的编码比较标准吧import html.parser imp
复制链接

扫一扫