*----------------------------------------------------------------编程届菜鸟-------------------------------------------------------*
任务:根据知乎用户页面,统计粉丝人数,并按照粉丝继续爬。最后按照粉丝数排序
import re
import requests
class crawlUser:
#定义构造函数
def __init__(self, userid,cookie):
self.userId = userid
self.fellowCount = 0
self.fellowlist = []
self.cookie = cookie
def getpage(self):
url = "http://www.zhihu.com/people/"+self.userId+"/followers"
self.response = requests.get(url, cookies=self.cookie)
def getfellowcount(self):
reg = 'data-tip="p\$t\$(.+)"'
pattern = re.compile(reg)
self.fellowCount = re.findall(pattern, self.page)
count=0
for x in self.fellowCount:
if (count%2)==0:
self.fellowlist.append(x)
count=count+1
return self.fellowlist
m_cookie={"_za":"****************","a2404_times":"129",
"q_c1":"*********************",
"_xsrf":"********************",
"cap_id":"**************************",
"__utmt":"1","z_c0":"***********************1d5358",
"unlock_ticket":"***********************6d",
"__utma":"********************************",
"__utmb":"15**************","__utmc":"*****",
"__utmz":"155**********************************/",
"__utmv":"***************************************1"}
userlist={}
tempUserList=["****"]
flag=0
iter=0
while flag<len(tempUserList) and iter<10:
user = crawlUser(tempUserList[flag],m_cookie)
user.getpage()
fellowlist=user.getfellowcount()
userlist[tempUserList[flag]]=len(fellowlist)
for x in fellowlist:
if x not in tempUserList:
tempUserList.append(x)
print("user %s fellowed %d"%(tempUserList[flag],userlist[tempUserList[flag]]))
flag=flag+1
iter=iter+1
print("crawl is done!")
sortRes=sorted(userlist.items(),key=lambda d:d[1])
"""f=open("userlist.txt","w+")
f.write(userlist)
f.close()"""
print(sortRes)
print("sort is done!")
【过程】
1、前两天小试的下载图片之类的练习后,今天这个比前两天的难度再加一点,但是不算难度太高的2、绕不开登陆验证,所以用了本地的cookie,方法是直接从浏览器里复制的,方法有点low,求指教更好的
3、后面设置只抓取了10人,因为访问得频繁就被服务器给挂了,我对这块的学习也不够,王先森说可以对进程添加sleep处理
【感受】
1、实践才是学习的最快途径
2、requests比urllib好用太多了
3、pycharm只比Notepad好用一点点
4、路漫漫其修远兮
*-------------------------------------------本博客旨在记录学习历程,望前辈能人留言指教----------------------------------------*