python 爬虫登陆学校教务系统之HTML解析

最新推荐文章于 2022-03-13 13:59:52 发布

「已注销」

最新推荐文章于 2022-03-13 13:59:52 发布

阅读量1k

点赞数

分类专栏： python学习文章标签： python 南科大教务系统爬虫

本文链接：https://blog.csdn.net/marxwolf/article/details/51548827

版权

python学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文介绍了使用Python的BeautifulSoup和正则表达式解析HTML来登录学校教务系统的过程。在解析过程中，作者发现HTML文档格式老旧且包含大量空格与换行，通过BeautifulSoup库解决了这一问题。文章提到了两个主要挑战：一是处理中文编码需要将编码设置为utf8，二是保存数据时因长度不一导致显示效果不佳，需要优化。

摘要由CSDN通过智能技术生成

利用python对HTML进行解析是比较方便的。我是利用的BeautifulSoup+正则表达式，正则表达式在处理字符串的时候非常强大，但是逻辑比较难。

<link href="/css/newcss/project.css" rel="stylesheet" type="text/css">

<body leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" style="overflow:auto;">



	

    

        

    	<a name="2013-2014学年秋(两学期)" /></a>

<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">

<tr><td class="Linetop"></td>

</tr>

</table>

<table width="100%"  border="0" cellpadding="0" cellspacing="0" class="title" id="tblHead">

<tr>

					<td width="80%" >

					<table border="0" align="left" cellpadding="0" cellspacing="0" >

					

					<tr>

					<td> </td>

					<td valign="middle"> <b>2013-2014学年秋(两学期)</b>

					 </td>

					</tr>

					</table>

					</td>

					<td width="20%" >					

						<table border="0" align="left" cellpadding="0" cellspacing="0" width="100%" >

						

						<tr>

						<td> </td>						

					<td width="5"></td>

					</tr>

					</table>

					</td>

					</tr>

</table>

<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0"  >

 <tr>

  <td class="Linetop"></td>

 </tr>

</table>

	<table width="100%" border="0" cellpadding="0" cellspacing="0" class="titleTop2">

					 <tr>

					  <td class="pageAlign">

					   <table cellpadding="0" width="100%" class="displayTag" cellspacing="1" border="0" id="user">

					    <thead>

							<tr>

							

             

       		<th align="center" width="10%" class="sortable">

                    课程号

           	</th>

           

       		<th align="center" width="10%" class="sortable">

                    课序号

           	</th>

           

       		<th align="center" width="10%" class="sortable">

                    课程名

           	</th>

           

       		<th align="center" width="10%" class="sortable">

                    英文课程名

           	</th>

           

       		<th align="center" width="10%" class="sortable">

                    学分

           	</th>

           

       		<th align="center" width="10%" class="sortable">

                    课程属性

           	</th>

           

       		<th align="center" width="10%" class="sortable">

                    成绩

           	</th>

           

	</tr>

	</thead>

          

        <tr class="odd" onMouseOut="this.className='even';" onMouseOver="this.className='evenfocus';">

			<td align="center">

                GE1102

            </td>

            <td align="center">

            	 02

            </td>

            <td align="center">

                 高等数学（上）

            </td>

            <td align="center">

                 Calculus I

            </td>

            <td align="center">

                 4

            </td>

            <td align="center">

                 通必

            </td>

            <td align="center">

            

              		<p align="center">90.0 </P>

             </td>

        </tr>

        

        <tr class="odd" onMouseOut="this.className='even';" onMouseOver="this.className='evenfocus';">

			<td align="center">

                GE1104

            </td>

            <td align="center">

            	 01

            </td>

            <td align="center">

                 大学物理（上）

            </td>

上图是原始的获得的HTML文档的一部分，可以看出开发者使用的多么老的写法啊，而且还那么多空格和换行，伤脑筋。

BeautifulSoup 是非常强大的解析工具，可以解析HTML，XML，json等。文档竟然有中文的，看来中国的开发者还是受到了认可的，详情见BeautifulSoup.

sid = xxxxxxxx

marx = Login(str(sid))
while True:
    imagePath = marx.get_code()
    image = marx.imagefilter(imagePath)
    marx.image_to_string(image)
    time.sleep(0.5)
    imageString = marx.getString()
    print imageString
    time.sleep(0.5)
    marx.mProcess.terminate()
    resp_login = marx.login(str(sid),str(sid),imageString)
    if resp_login.status_code == 200:
        break
resp_info = marx.get_info()
soup = BeautifulSoup(resp_info.text)
semester = soup.find_all('b')
semesterList = [i.string for i in semester]
semesterListLength = len(semesterList)
count = 0
if semesterListLength:    
    
    f = open(str(sid)+'info.txt','w')
    first = soup.find("table",attrs ={'border':"0",'cellpadding':"0",'cellspacing':"1",'class':"displayTag", 'id':"user", 'width':"100%"})
    info = marx.infoRe(first.text)
    f.write(str(sid)+'\n\n')
    f.write('\n'+semesterList[count].encode('utf8')+'\n')
    infoLength = len(info)
    for i in range(infoLength-1):
        if i%7==0:
            f.write('\n')
        else:
            f.write(info[i].encode('utf8')+'\t\t\t\t\t')
    f.close()

while semesterListLength-1:
    count+=1
    f = open(str(sid)+'info.txt','a')
    first = first.find_next("table",attrs ={'border':"0",'cellpadding':"0",'cellspacing':"1",'class':"displayTag", 'id':"user", 'width':"100%"})
    info = marx.infoRe(first.text)
    infoLength = len(info)
    f.write('\n\n'+semesterList[count].encode('utf8')+'\n\n')
    for i in range(infoLength-1):
        if i%7==0:
            f.write('\n')
        else:
            f.write(info[i].encode('utf8')+'\t\t\t\t\t')
    semesterListLength = semesterListLength - 1
    
f.close()

首先，根据HTML特征，进行了一个时间的提取，即