小程序云开发教程二:数据的获取(爬虫)

最新推荐文章于 2024-09-19 15:25:50 发布

JohnnyLiao_WJ

最新推荐文章于 2024-09-19 15:25:50 发布

阅读量9.1k

点赞数 2

分类专栏：小程序 python 文章标签：小程序云函数爬虫

本文链接：https://blog.csdn.net/u013338742/article/details/82843099

版权

小程序同时被 2 个专栏收录

13 篇文章 4 订阅

订阅专栏

python

3 篇文章 0 订阅

订阅专栏

数据从哪儿来呢?这是个很好解决的问题，我们只需要参考一下网上的爬虫代码，再自己改动一下，加上一下自己想要的东西,就可以了
我们就参考一下知乎的一篇爬取糗百的文章吧： https://zhuanlan.zhihu.com/p/37626163
直接把他的代码拿过来，然后改动一下，在代码中引入time函数（为了延迟请求，对糗百网站友好），再在存入数据库之前，加入几个我们想要展示的值。
先看一下总的数据结构:
在这里插入图片描述
前提是你已经安装好了python环境: 我的环境是python2.7

qiubai.py代码如下:

#!/usr/bin/python
#-*- coding:utf-8 -*-

import pymongo
import requests
from lxml import etree
import time
import random

ID = 0

def getPage(url):
	# 构建请求头
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
    response = requests.get(url, headers = headers, timeout=10)
    time.sleep(random.randint(1, 5)) #在1-5秒随机延时请求

    if response.status_code == 200:
    	response.encoding = 'UTF-8'
    	return response.text
    else:
    	print('err')

def parsePage(html):
	global ID
	print(ID)
	html_lxml = etree.HTML(html)
	datas = html_lxml.xpath('//div[contains(@id, "qiushi_tag")]')
	item = {}

	for data in datas:
		username = data.xpath('.//h2')[0].text.strip()

		content = data.xpath('.//div[@class="content"]/span')[0].text.strip()

		comments = data.xpath('.//span[@class="stats-comments"]/a/i')[0].text

		vote = data.xpath('.//span[@class="stats-vote"]/i')[0].text

		image = data.xpath('.//div[@class="thumb"]/a/img/@src')


		item = {
			'username': username,
			'content': content,
			'vote': vote, #点赞数
			'image': image, #图片链接,不一定有
			'id': ID, #到时候翻页需要的id,其实使用微信的skip函数最好
			'shareNum': 0, #分享数
			'comment': '', #评论
			'commentNum': 0 #评论人数
			}
		ID += 1
		print(item )
		print(ID)
		insertMongoDB(item)


def insertMongoDB(item):
	client = pymongo.MongoClient(host = 'localhost', port=27017)
	db = client.qiubai
	colle = db.duanzi
	result = colle.insert(item)
	print('储存成功')


def main(num):
    # num决定爬取页面数
    for index in range(num):
    	#https://www.qiushibaike.com/8hr/page/
    	#https://www.qiushibaike.com/hot/page/ + str(index)+
	    url = 'https://www.qiushibaike.com/imgrank/page/9/'
	    html = getPage(url)
	    parsePage(html)


if __name__ == '__main__':
	#因为糗百只有13页,这里只爬1页
    main(1)

然后我们下载一个Studio 3T, 选择免费试用安装;

然后在qiubai.py文件所在文件夹下, 按住shift按右键,选择在所在文件夹下打开命令行:
输入: