网址
https://space.bilibili.com/
打开之后可能会跳到登录界面,登录进去分析网页,个人信息的网页如下:
然后点击进去别人的个人中心,看看网址的区别:
区别就是后面的数字不一样了,可以尝试多点几个个人中心去试试。
接下来构造请求头。
代码如下:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Referer': 'https://space.bilibili.com/4899781/',
'Origin': 'http://space.bilibili.com',
'Host': 'space.bilibili.com',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
构建ip:
代码如下:
proxies = {
'http': 'http://118.190.95.35:9001',
'http': 'http://121.49.110.65:8888',
}
构建url列表:
代码如下:
urls = []
for x in range(1,2):
for i in range(x * 100,(x+1) * 100):
url = 'https://space.bilibili.com/' + str(i)
print(url)
urls.append(url)
获取数据:
def getSource(url): ua = random.choice(uas) headers = { 'User-Agent': ua, #随机产生的Referer 'Referer': 'https://space.bilibili.com/' + str(1) + '?from=search&seid=' + str(random.randint(10000, 50000)) } jscontent = requests.session().post('http://space.bilibili.com/ajax/member/GetInfo', headers=headers, data=payload, proxies=proxies).text time2 = time.time()
解析数据:
可以看出,我们要解析的是个字典,代码如下:
try: jsDict = json.loads(jscontent) statusJson = jsDict['status'] if 'status'in jsDict.keys() else False if statusJson == True: if 'data' in jsDict.keys(): jsData = jsDict['data'] mid = jsData['mid'] mid = jsData['mid'] name = jsData['name'] sex = jsData['sex'] rank = jsData['rank'] face = jsData['face'] #将时间转化成时间格式 regtimestamp = jsData['regtime'] regtime_local = time.localtime(regtimestamp) regtime = time.strftime("%Y-%m-%d %H:%M:%S", regtime_local) spacesta = jsData['spacesta'] birthday = jsData['birthday'] if 'birthday' in jsData in jsData.keys() else 'nobirthday' sign = jsData['sign'] level = jsData['level_info']['current_level'] OfficialVerifyType = jsData['official_verify']['type'] OfficialVerifyDesc = jsData['official_verify']['desc'] vipType = jsData['vip']['vipType'] vipStatus = jsData['vip']['vipStatus'] toutu = jsData['toutu'] toutuId = jsData['toutuId'] coins = jsData['coins'] print("Succeed get user info:"+ str(mid) + '\t'+str(time2 - time1)) except Exception as e: print(e)
接下来存入数据库
这里选择mysql数据库,首先我们需要先建一个表,建表的代码如下:
DROP TABLE IF EXISTS `bilibili_user_info`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `bilibili_user_info` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `mid` int(20) unsigned NOT NULL, `name` varchar(45) NOT NULL, `sex` varchar(45) NOT NULL, `rank` varchar(45) NOT NULL, `face` varchar(200) NOT NULL, `regtime` varchar(45) NOT NULL, `spacesta` varchar(45) NOT NULL, `birthday` varchar(45) NOT NULL, `sign` varchar(300) NOT NULL, `level` varchar(45) NOT NULL, `OfficialVerifyType` varchar(45) NOT NULL, `OfficialVerifyDesc` varchar(100) NOT NULL, `vipType` varchar(45) NOT NULL, `vipStatus` varchar(45) NOT NULL, `toutu` varchar(200) NOT NULL, `toutuId` int(20) unsigned NOT NULL, `coins` int(20) unsigned NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Dumping data for table `bilibili_user_info` -- LOCK TABLES `bilibili_user_info` WRITE; UNLOCK TABLES;
存入数据库,代码如下:
try: # Please write your MySQL's information. conn = pymysql.Connect(host='localhost', user='root', passwd='123456', db='weixin', charset='utf8') cur = conn.cursor() cur.execute('INSERT INTO bilibili_user_info values(%s,"%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s", "%s","%s")'% (1,mid, name, sex, rank, face, regtime, spacesta,birthday, sign, level, OfficialVerifyType, OfficialVerifyDesc, vipType, vipStatus, \ toutu, toutuId, coins,)) conn.commit() except Exception as e: print('存入数据库失败',e,url)
这样就完成了整个目标的实现。
然后我们将以上代码整合。就可以实现大量爬取。
数据库的效果如下:
这样我们就完成了已经注册人员信息的爬取。