我正在尝试抓取多个网站(使用Python2.7),以确定其中是否存在特定的关键字。我的代码:import urllib2
import csv
fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js']
def csv_writerheader(path):
with open(path, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
def csv_writer(dictdata, path):
with open(path, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writerow(dictdata)
csv_output_file = 'EXPORT_Results!.csv'
# LIST OF KEY WORDS (TITLE CASE TO MATCH FIELD NAMES)
keywords = ['@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js']
csv_writerheader(csv_output_file)
with open('top1m-edited.csv', 'r') as f:
csv_f = csv.reader(f, lineterminator='\n')
for line in f:
strdomain = line.strip()
# INITIALIZE DICT
data = {'Website': strdomain}
if '.nl' in strdomain:
try:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
# ITERATE THROUGH EACH KEY AND UPDATE DICT
for searchstring in keywords:
if searchstring.lower() in str(html_content).lower():
print (strdomain, searchstring, 'found')
data[searchstring] = 'found'
else:
print (strdomain, searchstring, 'not found')
data[searchstring] = 'not found'
# CALL METHOD PASSING DICT AND OUTPUT FILE
csv_writer(data, csv_output_file)
except urllib2.HTTPError:
print (strdomain, 'HTTP ERROR')
except urllib2.URLError:
print (strdomain, 'URL ERROR')
except urllib2.socket.error:
print (strdomain, 'SOCKET ERROR')
except urllib2.ssl.CertificateError:
print (strdomain, 'SSL Certificate ERROR')
f.close()
然而,我的爬虫在这个问题上似乎不是很准确。在
例如:我正在抓取一个网站列表,以确定它们的源代码中是否包含@media和{}等关键字。当脚本运行完成后,我手动检查结果的准确性。在手动检查(通过Chrome使用Inspect元素搜索URL源代码中的关键字)之后,我发现某些网站的源代码中确实包含@media和/或{},而我的爬虫说这些网站不包含这些关键字。在
也许这与使用Chrome的“Inspect元素”找到的网站代码不完全匹配。例如,this网站在其“Inspect Element”代码中包含googleadservices.com/pagead/conversion_async.js,但在其“viewsource”代码中没有。在
我的问题是:我的爬虫是纯粹的刮取“查看源代码”网站的代码,而不是它们的“检查元素”代码(它也应该在哪里查找)?在
如果这是我的问题,我怎么解决?在