最近老大让从网站上获取数据,手动太慢,网上找了点python,用脚本操作。
1 importos2 importre3
4 importxlrd5 importrequests6 importxlwt7 from bs4 importBeautifulSoup8 from xlutils.copy importcopy9 from xlwt import *
10
11
12 defread_excel(path):13 #打开文件
14 workbook =xlrd.open_workbook(path)15 #获取所有sheet
16
17 #根据sheet索引或者名称获取sheet内容
18 sheet1 = workbook.sheet_by_index(0) #sheet索引从0开始
19
20 #sheet的名称,行数,列数
21 i =022 for sheet1_values insheet1._cell_values:23
24 str =sheet1_values[0]25 str.replace('\'','')26 print(str,i)27 response =get_responseHtml(str)28 soup =get_beautifulSoup(response)29 pattern1 = '^https://ews-aln-core.cisco.com/applmgmt/view-appl/+[0-9]*$'
30 pattern2 = '^https://ews-aln-core.cisco.com/applmgmt/view-endpoint/+[0-9]*$'
31 pattern3 = '^https://ews-aln-core.cisco.com/applmgmt/view-appl/by-name/'
32 if pattern_match(str,pattern1) orpattern_match(str,pattern3):33 priority = soup.find("table", class_="main_table_layout").find("tr", class_="centered sub_section_header").find_next("tr",34 align="center").find_all(35 "td")36 elifpattern_match(str,pattern2):37 priority = soup.find("table", class_="main_table_layout").find("tr",38 class_="centered").find_next(39 "tr",40 align="center").find_all(41 "td")42 else:43 print("no pattern")44 try:45 priorityNumble ='P' +get_last_td(priority)46
47 exceptException:48 print("没有找到"+str)49 priorityNumble = 'P' +get_last_td(priority)50 write_excel(path,i,1,priorityNumble)51 i = i + 1
52 defwrite_excel(path,row,col,value):53 oldwb =xlrd.open_workbook(path)54 wb =copy(oldwb)55 ws =wb.get_sheet(0)56 ws.write(row,col,value)57 wb.save(path)58 defget_last_td(result):59 for idx inrange(len(result)):60 returnResult =result[idx].contents[0]61 returnreturnResult62 defget_beautifulSoup(request):63 soup = BeautifulSoup(request, 'html.parser', from_encoding='utf-8', exclude_encodings='utf-8')64 returnsoup65 defget_responseHtml(url):66 headers ={67 'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}68 response = requests.get(url, auth=(userName, passWord),headers=headers).content69 returnresponse70 def pattern_match(str,pattern,flags =0):71 pattern =re.compile(pattern)72 returnre.match(pattern,str,flags)73
74 if __name__ == '__main__':75 userName = '*';76 passWord = '*'
77 path = r'*'
78 read_excel(path)
这里面坑可是不少
1.刚开始xlsx格式文件,save后不能打开,把excel格式改为xls才正确。
2.header网上找的,这样不会被认为是网络爬虫而报错:http.client.RemoteDisconnected: Remote end closed connection without response.
3.copy的参数要为workbook而不是xls的fileName,否则报错:AttributeError: ‘str’ object has no attribute ‘datemode’.
5.刚开始想往新的文件里save,用了新的路径,发现不可行,因为在for循环中每次都是从源excel中copy,所以实际结果只插入了一行。
6.python中beautiful soup的用法,很全的文档:Beautiful Soup 4.2.0 文档
8.从没写过python,第一次写,花了半天时间,还有很多可以改进的地方。