前面主要是解析html,提炼出所需的部分
>>> import re,urllib.request
>>> from bs4 import BeautifulSoup
>>> from lxml import etree
>>> url = "http://zsb.szu.edu.cn/zanouse_1"
>>> page = urllib.request.urlopen(url)
>>> soup = BeautifulSoup(page,'lxml')
>>> f = open('E:/e8.txt','a+')
>>> print(soup.prettify(),file = f)
>>> f.close()
>>> f = open('E:/e8.txt','r')
>>> html = f.read()
>>> selector = etree.HTML(html)
>>> content = selector.xpath('/html/body/a/descendant-or-self::*')
>>> for i in content:
... print(i.text)
...
[2018-3-22]
>>> text = """ [2018-3-22]
..."""#直接复制了前面的运行结果,省略中间文段
--------------------------------------
下面要实现的是格式化,用的主要是正则
>>> list = text.split()
>>> print(type(list))
>>> text = " ".join(list)
>>> print(text)
>>> tt = open('E:/323232.txt','a+')
>>> tt.write(text)
500
>>> print(tt)
<_io.TextIOWrapper name='E:/323232.txt' mode='a+' encoding='cp936'>
>>> tt.close()
>>> tt = open('E:/323232.txt','r')
>>> lines = tt.readlines()
>>> for a,b in zip(d,t):
... print(a,"\t",b)
[2018-3-22] 深圳大学
>>>
还有有点问题,还不如直接用excel的数据来源于网页,抓取的更好些,不过先记录一下,说不定之后能改观