由于是第一次写作可能代码风格比较丑而且语言表达不好,各位看官请见谅.
下面进入正题临时接到一个任务爬取企查查的网络热词,并且要定时更新. 下面是要爬取的网页内容.
image
之前有写过这个页面的解析代码,但是事件过的太久已经找不到了.有点难受,不过这个页面没有反爬.话不多说直接上代码
url ='https://www.qichacha.com/cms_topsearch'
ht = requests.get(url=url,headers=headers)
et = etree.HTML(ht.text)
uls = et.xpath('//ul[@class="list-group topsearch-list"][1]/a')
# jinri热搜
for ulin uls[:51]:
type_ ='今日热搜'
search_num = ul.xpath('./span[last()]/text()')[0]
company = ul.xpath('./span[last()-1]/text()')[0]
company_url ='https://www.qichacha.com' + ul.xpath('./@href')[0]
date =str(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
print(company, search_num, company_url, date)
cursor = conn.cursor()
sql ='insert into top_search(type_,company,search_num,company_url,sj_time) values(%r,%r,%r,%r,%r)' % (
type_, company, search_num, company_url, date)
cursor.execute(sql)
conn.commit()
uls = et.xpath('//ul[@class="list-group topsearch-list"][1]/a')
# 一周热搜
for ulin uls[51:101]:
type_ ='一周热搜'
search_num = ul.xpath('./span[last()]/text()')[0]
company = ul.xpath('./span[last()-1]/text()')[0]
company_url ='https://www.qichacha.com' + ul.xpath('./@href')[0]
date =str(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
print(company, search_num, company_url, date)
cursor = conn.cursor()
sql ='insert into top_search(type_,company,search_num,company_url,sj_time) values(%r,%r,%r,%r,%r)' % (
type_, company, search_num, company_url, date)
cursor.execute(sql)
conn.commit()
uls = et.xpath('//ul[@class="list-group topsearch-list"][1]/a')
# 一月热搜
for ulin uls[101:]:
type_ ='一月热搜'
search_num = ul.xpath('./span[last()]/text()')[0]
company = ul.xpath('./span[last()-1]/text()')[0]
company_url ='https://www.qichacha.com' + ul.xpath('./@href')[0]
date =str(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
print(company, search_num, company_url, date)
cursor = conn.cursor()
sql ='insert into top_search(type_,company,search_num,company_url,sj_time) values(%r,%r,%r,%r,%r)' % (
type_, company, search_num, company_url, date)
cursor.execute(sql)
conn.commit()
页面解析比较简单,毕竟新手熟悉下流程
然后就是改成定时任务,我用的是python内置库 schedule
schedule.every(1).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
schedule.every(5).to(10).days.do(job)
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
每隔1分钟执行一次任务
每隔一小时执行一次任务
每天的10:30执行一次任务
每隔5到10天执行一次任务
每周一的这个时候执行一次任务
每周三13:15执行一次任务
def seach():
schedule.every(20).seconds.do(qcc_reci)
while True:
schedule.run_pending()
time.sleep(1)
seach()
run_pending:运行所有可以运行的任务
第一次写简书,很多格式不会用.....