IMG_20160827_102332_看图王.jpg
本文旨在从某网站读取表格里面的内容,对其进行简单的处理后写入CSV文件。需要留意的是,查找某些字符是否存在,查找其位置,按照位置读取字符串的内容,Python真的蛮灵活的。后续还会做两个文件的比较,以及文件内容的删除。让已经实现的功能具有普适性,需要抽取函数共不同场景使用,而不是单独复制黏贴导致代码不容易维护,这就是持续重构的思想。
fileName = "SplitNo";
nowTime = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
filePath = fileName + nowTime + ".csv";
csvFile = open(filePath, 'w', newline='', encoding='utf-8')
writer = csv.writer(csvFile,dialect='excel')
head = ["页数","序号","编号", "产品名称", "原始文本"]
writer.writerow(head)
startPage = 1;
totalPage = 1260;
wait.until(EC.presence_of_element_located((By.ID, "content")))
browser.find_element_by_id("page").clear()
browser.find_element_by_id("page").send_keys(str(startPage));
browser.find_elements_by_xpath('//input[@src="images/dataanniu_11.gif"]')[0].click();
time.sleep(3)
n = startPage;
while n < totalPage:
wait.until(EC.presence_of_element_located((By.ID, "content")))
content = browser.find_element_by_id("content");
oneThanOneLine = False;
for attr in content.find_elements_by_tag_name("a"):
text = str(attr.get_attribute('innerHTML'))
text = text.replace('\r', '').replace('\n', '').replace('\t', '')
print(str(text) + "查询位置:" + (str)(text.find(".")))
if text.find(".") != -1:
csvRow = []
csvRow.append(str(n))
pos = findPos(text)
if pos != -1:
name = text[0:pos-1]
notext = text[pos:-1]
csvRow.append(name.split(".")[0])
csvRow.append(notext.split(" ")[0])
if name.__len__() > 1:
csvRow.append(name.split(".")[1])
csvRow.append(text)
writer.writerow(csvRow)
preText = text
oneThanOneLine = False
else:
preText = preText + text;
#p = re.compile(r'[]', re.S)
# matches = re.findall(cleanr, preText)\\
#for match in matches:
# print(match)
cleanr = re.compile('<.>')
preText = re.sub(cleanr, '', preText)
print(preText)
oneThanOneLine = True
n = n + 1
wait.until(EC.presence_of_element_located((By.ID, "page")))
browser.find_element_by_id("page").clear()
browser.find_element_by_id("page").send_keys(str(n))
browser.find_elements_by_xpath('//input[@src="images/xxxx.gif"]')[0].click()
print("已经切换到新一页:" + str(n))
csvFile.close()
browser.close()
碰到的问题:
1、TypeError: expected string or bytes-like object
使用场景:content = browser.find_element_by_id("content");
tdList = re.findall(r'
]>(.?)', str(content.get_attribute('innerHTML')), re.I | re.M)if tdList:
for item in tdList:
print(item)
使用函数:re.findall(pattern,string,flag)
pattern匹配的是字符串,需要把第二个参数转化为string类型就可以。
2、对循环列表的最后一个元素进行特别处理
使用场景:aTag = attr.find_elements_by_tag_name("a")
if aTag.len()>1:
for aText in aTag:
if aTag.len() - 1 == aTag.index(aText):
print(aText )
3、超时跳转使用wait元素比较靠谱些
wait.until(EC.presence_of_element_located((By.ID, "content")))
print(str(n) + "img[@src='images/dataanniu_07.gif'])")
content = browser.find_element_by_id("content");
content.find_elements_by_xpath('//img[@src="images/dataanniu_07.gif"]')[0].click();'''
4、查询某字符串里面的HTML字符
p = re.compile(r']', re.S)
matches = re.findall(p, preText)
for match in matches:
print(match)
5、清除某字符串里面的HTML字符
cleanr = re.compile('<.>')
preText = re.sub(cleanr, '', preText)
6、记录程序执行时间
import datetime
startTime = datetime.datetime.now()
endTime = datetime.datetime.now()
print(endTime - startTime).seconds