爬取百度咨询
主要是抓取一些关键字的新闻,这篇博客是搜索的关于【华山】的最新资讯,输出结果为excel,主要包括三个字段:作者,时间,内容(新闻标题)。
大致的流程为:
- 获取url,就是把关键字进行urlencode。
- 整理爬取的内容,就是把一些【回车】,【空格】等杂七杂八的东西过滤掉。
- 输出结果
下面是代码:
import re
from urllib import parse
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
def html_decode(url):
Agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
headers = {"User-Agent": Agent}
time.sleep(5)
html = requests.get(url, headers=headers)
content = html.content
soup = BeautifulSoup(content, fromEncoding='utf-8')
content_list = []
author_list = []
time_list = []
for mulu in soup.findAll('div', attrs={'class': 'result'}):
# get subject
subject = mulu.h3.a.get_text()
author_time = mulu.div.p.get_text()
text1 = re.sub(r"\n| ", '', subject)
text2 = re.sub(r"\n| ", '', author_time)
text2.strip()
author = text2.split(' ')[0]
time = text2.split(' ')[1]
content_list.append(text1)
author_list.append(author)
time_list.append(time)
return content_list, author_list, time_list
def get_context(keyword, page):
all_content = []
all_author = []
all_time = []
Agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
headers = {"User-Agent": Agent}
parameter_page = page * 10
if type(keyword) == str:
main_url =r'https://www.baidu.com/'
parameter = {'ie':'utf-8',
'cl':2,
'medium':0,
'rtt':1,
'bsst':1,
'rsv_dl':'news_t_sk',
'tn':'news',
'word':keyword,
'rsv_sug3':8,
'rsv_sug4':445,
'rsv_sug1':7,
'rsv_sug2':0,
'inputT':3929,
'rsv_sug':1,
'x_bfe_rqs':'03E80',
'x_bfe_tjscore':0.481969,
'tngroupname':'organic_news',
'pn':parameter_page}
# print(str(url_data))
else:
print('url wrong!')
# get first page
url_data = parse.urlencode(parameter)
first_page_paramter = str(url_data).split('&inputT')[0]
first_page = parse.urljoin(main_url, 's?' + str(first_page_paramter))
first_item, first_author_list, first_time_list = html_decode(first_page)
all_content.extend(first_item)
all_author.extend(first_author_list)
all_time.extend(first_time_list)
# get other pages
for num in range(1, page+1):
print('this is page %d!'%(num))
parameter['pn'] = num * 10
url_data = parse.urlencode(parameter)
all_url = parse.urljoin(main_url, 's?'+str(url_data))
# print(all_url)
other_items, other_author_list, other_time_list = html_decode(all_url)
all_content.extend(other_items)
all_author.extend(other_author_list)
all_time.extend(other_time_list)
return all_content, all_author, all_time
if __name__ == '__main__':
date = time.strftime('%m-%d',time.localtime())
key_word = '华山'
page_num = 5
all_content, all_author, all_time = get_context(key_word, page_num)
result = pd.DataFrame({'author':all_author,'time':all_time,'content':all_content})
result.to_excel('result_%s.xlsx'%date)
print('总共%d条数据!'%len(all_content))
print('爬取完毕!')
有什么不对的地方,或者改进的地方可以留言,谢谢!
参考的博客有:
https://blog.csdn.net/yilovexing/article/details/80939039
https://blog.csdn.net/qq_40691189/article/details/100515940
https://blog.csdn.net/qq_38423499/article/details/103338930
https://www.cnblogs.com/hixiaowei/p/9734695.html