python3 爬虫新手笔记（二）：PRAW、API爬取Reddit

最新推荐文章于 2024-06-04 16:22:39 发布

Letitia96

最新推荐文章于 2024-06-04 16:22:39 发布

阅读量7k

点赞数 2

分类专栏： python3 爬虫文章标签： python3 爬虫 reddit

本文链接：https://blog.csdn.net/Letitia96/article/details/88525429

版权

python3 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

文章目录

python3 爬虫新手笔记（二）：PRAW、API爬取Reddit

python3 爬虫新手笔记（二）：PRAW、API爬取Reddit

一. Reddit

1. 前期准备

API · Reddit 阅读说明
OAuth2 · Reddit create a app，得到client id和secret

acquire a token

import requests
import requests.auth

client_auth = requests.auth.HTTPBasicAuth('client_id','secret')
post_data = {"grant_type": "password", "username": "XXX", "password": "XXX"}
headers = {"User-Agent": "ChangeMeClient/0.1 by YourUsername"}
response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
print(response.json())

得到结果（会过期）

{'access_token': 'XXX', 'token_type': 'bearer', 'expires_in': 3600, 'scope': '*'}

2. 表结构

参数	类型	描述
`id`	int	唯一标识
url	varchar(255)
url_md5	varchar(255)
title	varchar(255)
author	varchar(255)
created_utc	datetime
selftext	text	本身的文本或跳转到的超链接
score	int	得分
num_comments	int	评论的数量
upvote_ratio	float	up的比例

需要注意的是，在标题、文本等地方可能出现表情符号，所以需要代码和数据库同一编码为utf8mb4

3. 实现

3.1 API包装器PRAW

PRAW: The Python Reddit API WrapperPRAW是一个reddit API的包装器，提供了使用API的接口。

爬虫:

class RedditSpider(scrapy.Spider):
	name = "reddit"
	allowed_domains = ["reddit.com"]
	start_urls = [
		"https://www.reddit.com"
	]

	def parse(self, response):
        #使用client id 和secret 进行登陆
		reddit = praw.Reddit(client_id='XXX', client_secret='XXX',
			grant_type='client_credentials', user_agent='mytestscripts/1.0')
		
		"""
		sub = reddit.submission(id='9klf7s')
		#print(sub.title)
		#pprint.pprint(vars(sub))
		"""
		#可以通过 subreddit.stream.submissions()来监控某一个子版块出现的新帖子
		#subreddit = reddit.subreddit('dapps')
		#for sub in subreddit.stream.submissions():
        
        #limit=None来获取所有的贴子，默认为100
        #每次得到的属性类别数量可能不一样
		subs = reddit.subreddit('dapps').new(limit=None)
		for sub in subs:
			item = RedditItem()

			item['html'] = response.body
			#print(item['html'])

            #permalink是网站下该帖子的前缀，需要和网站地址拼接构成该帖子的链接地址
			url = 'https://{}{}'.format(self.allowed_domains[0], sub.permalink)
			item['url'] = url
            ......

			redditor = sub.author
            #作者可能为空
			#print("author:", redditor.name)
			if redditor is not None:
				item['author'] = redditor.name
			else:
				item['author'] = ""

            #sub.created_utc是一个utc时间戳，需要转换成datetime格式
			#print("created utc:", sub.created_utc)
			item['created_time'] = datetime.datetime.utcfromtimestamp(sub.created_utc)

            #如果帖子本身只是一个超链接，那么sub.selftext为空
			item['selftext'] = sub.selftext
			if sub.is_self==False :
				item['selftext'] = sub.url
            ......

			yield item

pipeline:

html_insert = '''insert into reddit_dapps_html(html) values('{html}')'''
	reddit_insert = '''insert into reddit_dapps(url, url_md5, title, author, 
	created_time, selftext, score, num_comments, upvote_ratio)
						values('{url}', '{url_md5}', '{title}', '{author}', 
						'{created_time}', '{selftext}', '{score}', '{num_comments}', '{upvote_ratio}')'''

	def process_item(self, item, spider):
		html = item['html']
		if html:
			item['html'] = html.strip().decode(encoding="utf-8")
		......

		#将时间格式化
		#created_time
		created_time = item['created_time']
		if created_time:
			item['created_time'] = created_time.strftime("%Y-%m-%d %H:%M:%S")

		selftext = item['selftext']
		if selftext:
			item['selftext'] = selftext.replace('\n', '').replace('  ', ' ')
		......

		sqltext1 = self.html_insert.format(
			html = pymysql.escape_string(item['html']))

		#由于score等是数字，需要先转换为字符串格式
		sqltext2 = self.reddit_insert.format(
			url = pymysql.escape_string(item['url']),
			......
			score = pymysql.escape_string(str(item['score'])),
			num_comments = pymysql.escape_string(str(item['num_comments'])),
			upvote_ratio = pymysql.escape_string(str(item['upvote_ratio'])))
		self.cursor.execute(sqltext1)
		self.cursor.execute(sqltext2)

		return item


	def open_spider(self, spider):
	    # connet database
        # 选择字符集为'utf8mb4'
	    self.connect = pymysql.connect(
	        host=self.settings.get('MYSQL_HOST'),
	        port=self.settings.get('MYSQL_PORT'),
	        db=self.settings.get('MYSQL_DBNAME'),
	        user=self.settings.get('MYSQL_USER'),
	        passwd=self.settings.get('MYSQL_PASSWD'),
	        charset='utf8mb4',
	        use_unicode=True)

数据库字符集的设定

3.2 直接通过API（仅测试）

需要使用oauth token访问
reddit.com: api documentation

Many endpoints on reddit use the same protocol for controlling pagination and filtering. These endpoints are called Listings and share five common parameters: after / before, limit, count, and show.

Listings do not use page numbers because their content changes so frequently. Instead, they allow you to view slices of the underlying data. Listing JSON responses contain after and before fields which are equivalent to the “next” and “prev” buttons on the site and in combination with count can be used to page through the listing.

The common parameters are as follows:

after / before - only one should be specified. these indicate the fullname of an item in the listing to use as the anchor point of the slice.
limit - the maximum number of items to return in this slice of the listing.
count - the number of items already seen in this listing. on the html site, the builder uses this to determine when to give values for before and after in the response.
show - optional parameter; if all is passed, filters such as “hide links that I have voted on” will be disabled.

To page through a listing, start by fetching the first page without specifying values for after and count. The response will contain an after value which you can pass in the next request. It is a good idea, but not required, to send an updated value for count which should be the number of items already fetched.

slice_headers = {'Authorization':'token_type access_token'}
		print(slice_headers)
		
		params = {'limit':'1'}	#限制一次取得的数量
		count = 1
		while count<3:
			response = requests.get("https://oauth.reddit.com/r/dapps/new", headers = slice_headers, params=params)

			print(response.status_code)

            #需要判断response的状态码
			if response.status_code==200:
				response_json = response.json()
				#print(response_json)

                #需要的submission数据在返回的json['data']['children']中
				for child in response_json['data']['children']:
					print("submission json:",child)

					url = 'https://{}{}'.format(self.allowed_domains[0], child['data']['permalink'])
					print("url:", url)
				
					print("title:", child['data']['title'])

					print("author:", child['data']['author'])

					print("created time:", datetime.datetime.utcfromtimestamp(child['data']['created_utc']))	
				
				
					if child['data']['is_self']==False:
						print("self text:", child['data']['url'])
					else:
						print("self text:", child['data']['selftext'])

					print("score: ", child['data']['score'])
					print("num comments:", child['data']['num_comments'])

                #
				after = response_json['data']['after']
				if after==None:
					break
				params = {'limit':'1', 'after':after}

			else:
				print("null")
				break

参考

Letitia96

关注

2
点赞
踩
21

收藏

觉得还不错? 一键收藏
10
评论
python3 爬虫新手笔记（二）：PRAW、API爬取Reddit

文章目录python3 爬虫新手笔记（二）：PRAW、API爬取Reddit一. Reddit1. 前期准备2. 表结构3. 实现3.1 API包装器PRAW3.2 直接通过API（仅测试）参考python3 爬虫新手笔记（二）：PRAW、API爬取Reddit一. Reddit1. 前期准备API · Reddit 阅读说明OAuth2 · Reddit create a ...
复制链接

扫一扫

专栏目录