python爬虫实现hdu自动AC机
苦逼大学生的编程之旅都是从hdu开始的,当学习被强制要求就开始无趣了起来,这个时候就得学会自己给自己找点乐子了,前几天刚开始学爬虫我就有个想法,是否可以整一个自动交题的代码,跟智慧树脚本一样自动答题,那样孩子就再也不用担心hdu题数太少被骂了
第一步肯定是学会从csdn中爬取相应的代码,利用正则和BeautifulSoup来提取出csdn中的代码
def search_code(url): #输入代码网址,返回代码文本
headers = {
xxxxxxxx
}
request = urllib.request.Request(url, headers=headers)
html = ""
try:
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
except urllib.error.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
item=''
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('span', class_="cpp"):
item=str(item)
item= re.sub("<.*?>","",item)
item = re.sub("<", "<", item)
item = re.sub(">", ">", item)
item = translate_code(item)
if(item!=''):
return item
for item in soup.find_all('code'):
item = str(item)
item = re.sub("<.*?>", "", item)
item = re.sub("<", "<", item)
item = re.sub(">", ">", item)
item = translate_code(item)
return item
这只是得到一个网页上的代码,还需要引出多个网页的查找,目的是获取更多的关于题解的网址
https://so.csdn.net/so/search/s.do?q=hdu1100&t=blog&u=
可以发现其实只要变动q=后的就可以做到查询不同的题解
def search_answer(tihao): #输入题号,给出代码列表
url = 'https://so.csdn.net/so/search/all?q=hdu'+str(tihao)+'&t=all&p=1&s=0&tm=0&lv=-1&ft=0&l=&u='
这几天好像csdn页面改了,我之前的代码不对了只能改用webdriver来模拟
driver = webdriver.Chrome('C:\Program Files\Google\Chrome\Application\chromedriver.exe')
driver.get(url)
time.sleep(3)
html = driver.page_source
link = []
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('div',class_="list-item"):
item = str(item)
link1 = re.findall(findpic, item) # 返回csdn网址
if len(link1) >0:
link.append(link1[0])
之前的代码
url = 'https://so.csdn.net/so/search/s.do?t=all&s=&tm=&v=&l=&lv=&u=&q=hdu' + str(tihao)
request = urllib.request.Request(url, headers=headers)
html = ""
try:
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
except urllib.error.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
link=[]
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('div', class_="container-list container-other-list active"):
item = str(item)
link = re.findall(findpic, item) #返回csdn网址
最后利用一个for来实现提取代码去提交
for i in range(0 , len(link)): #遍历网址
it = str(link[i])
code = search_code(it) #返回代码
submit(tihao,code,i+1)
if query_result(tihao):
break
else:
print(str(tihao) + '失败了')
time.sleep(random.randint(1, 3))
if i>5:
break
当然还有submit部分,我的理解就是先模拟登录
session = requests.Session()
session.post(url, data=data, headers=headers)
然后再向hdu提交代码
提交的网址就从chrome里抓包得到
发送的内容也可以从这里得到
if(daima.find('import')!=-1):
data = {
'check': '0',
'problemid': str(tihao),
'language': str(5),
'usercode': daima
}
else:
data = {
'check': '0',
'problemid': str(tihao),
'language': str(0),
'usercode': daima
}
r = session.post(url, data=data, headers=headers)
提交完肯定就要去判断有无AC
def query_result(pid):
url = 'http://acm.hdu.edu.cn/status.php?first=&pid=' + str(pid) + '&user=' + "yourID" + '&lang=0&status=0'
headers = {
xxxxxxxxxxxxxxxxxxx
}
html = requests.get(url,headers = headers)
pattern_query = r'<td><font color=red>(.*?)</font>'
query_result = re.findall(pattern_query, html.text)
if len(query_result) > 0:
return True
else:
return False
最后来个主控函数,程序就基本能跑了
def start():
for i in range(1200, 1500):
if query_result(i):
print(str(i) + '已AC')
else:
search_answer(i)
代码写的异常丑陋,而且问题也特别多,但是勉强能跑,之后有空再来改进吧