reddit_自动从reddit收集数据以投资股票

最新推荐文章于 2023-12-01 11:19:26 发布

吴雄辉

最新推荐文章于 2023-12-01 11:19:26 发布

阅读量1.2k

点赞数 1

文章标签： python java 人工智能

原文链接：https://medium.com/social-media-theories-ethics-and-analytics/automating-data-collection-from-reddit-to-invest-in-stocks-2c86fe365db9

版权

NO INVESTMENT ADVICE. The Content is for informational purposes only.

没有投资建议。 内容仅供参考。

During the COVID-19 pandemic lockdown many new retail investors jumped into stock trading given that there wasn’t anything else to do but also because new electronic brokers like Robinhood, TD, Ameritrade, offered lower to no commissions for stock trading. This along with the boredom of staying at home, have been the main drivers of this abnormally high levels in retail trading.

在COVID-19大流行封锁期间，由于没有其他事情要做，许多新零售投资者跳入了股票交易，而且还因为像Robinhood，TD，Ameritrade这样的新电子经纪人提供了低至无佣金的股票交易。加上无聊的呆在家里，已经成为零售交易中异常高水平的主要驱动力。

Among these ‘new’ traders, the group that has seen the largest growth are ‘Millennials’. Many used their free time at Reddit where subreddits like r/wallstreetbets lured users into ‘getting rich quick’ by trading options. This subreddit has seen a huge growth in popularity during the pandemic, as anonymous users showed great profits as a seemingly easy process. Interestingly enough, users that lost money were treated as heroes and a source of memes.

在这些“新”商人中，增长最快的群体是“千禧一代”。许多人在Reddit上利用他们的空闲时间，在那里像r / wallstreetbets这样的subreddit吸引用户通过交易期权“快速致富”。在大流行期间，这种匿名行为已经获得了极大的普及，因为匿名用户通过看似容易的过程显示出巨大的利润。有趣的是，亏钱的用户被视为英雄和模因来源。

r/wallstreetbets has many daily posts on several different topics related to stocks and ‘investing’. Among them are specific posts called ‘DD’ which stand for Due Diligence. ‘DD’s are post that aim to research a specific stock or group of stocks before taking an investment decision. People upvote those ‘DD’s that consider useful or interesting. Or in some cases, posts that are just ‘funny’. This data however, doesn’t necessarily mean is of high quality. Most of the users in the platform are amateur traders and the community is well known for taking high risk trades (also known as YOLO’s). On the other hand, many experts are analyzing these recommendations and the market has seen price movement related to r/wallstreetbets activity and are attributing price changes to this community https://www.bloomberg.com/news/articles/2020-09-15/big-investors-are-dying-to-know-what-the-little-guys-are-doing.

r / wallstreetbets每天都有许多帖子，涉及与股票和“投资”相关的几个不同主题。其中有称为“ DD”的特定职位，代表尽职调查。发布“ DD”的目的是在做出投资决定之前研究特定股票或一组股票。人们赞成那些认为有用或有趣的“ DD”。或者在某些情况下，帖子只是“有趣”。但是，此数据并不一定意味着高质量。该平台中的大多数用户是业余交易者，社区以从事高风险交易(也称为YOLO)而闻名。另一方面，许多专家正在分析这些建议，市场已经看到与r / wallstreetbets活动相关的价格走势，并将价格变化归因于该社区https://www.bloomberg.com/news/articles/2020-09- 15位/大投资者都想知道小家伙正在做什么。

This means, in some cases the recommendations might be useful. Our goal, is to analyze these posts to see if there’s any consensus for trade activity coming from these posts, and if they help make investment decisions. For this we will use ‘PRAW’ package from Python and will build the workflow to extract and analyze the data.

这意味着，在某些情况下，建议可能有用。 我们的目标是分析这些职位，以查看是否有来自这些职位的贸易活动共识，以及它们是否有助于做出投资决策。为此，我们将使用Python中的“ PRAW”包，并将构建工作流以提取和分析数据。

适用于Python的PRAW (PRAW for Python)

PRAW, an acronym for “Python Reddit API Wrapper”, is a python package that allows for simple access to reddit’s API. PRAW aims to be as easy to use as possible and is designed to follow all of reddit’s API rules.

PRAW是“ Python Reddit API Wrapper”的首字母缩写，是一个python软件包，可轻松访问reddit的API。 PRAW的目标是尽可能易于使用，并旨在遵循reddit的所有API规则。

PRAW has many limitations due to the data that is provided by reddit. Which limits the amount of queries, or the time frame from which we can pull data. But for this use, we can automate our code on a daily basis and expand our library of content as we apply this script.

由于reddit提供的数据，PRAW有很多限制。这限制了查询的数量或我们可以提取数据的时间范围。但是对于这种用途，我们可以每天自动执行代码，并在应用此脚本时扩展内容库。

In order to use PRAW, we first need to open an account on Reddit and also create an API from their platform. Here we will assume you know how to open a Reddit account, but if not, I recommend you to go here and open your account https://upcity.com/blog/how-to-create-an-account-and-recommend-content-on-reddit/

为了使用PRAW，我们首先需要在Reddit上开设一个帐户，并从其平台创建API。在这里，我们假设您知道如何开设Reddit帐户，但是如果不知道，我建议您去这里并开设您的帐户https://upcity.com/blog/how-to-create-an-account-and-recommend -reddit上的内容/

创建一个Reddit应用 (Creating a Reddit App)

As we mentioned, the second step is to create an API. In order to do this, we first need to log in into our reddit account and then access this site. https://www.reddit.com/prefs/apps

如前所述，第二步是创建一个API。为此，我们首先需要登录我们的reddit帐户，然后访问该站点。 https://www.reddit.com/prefs/apps

Then, click on the button <are you a developer? create an app…> Next, fill up the form:

然后，单击按钮<您是开发人员吗？创建一个应用程序…>接下来，填写表格：

1- Create your own name for the app2- Pick the app to be a script. So that you can run it from your computer3- Add a description for your app. Optional4- About URL is a URL where you have the documentation for your app. Optional5- redirect uri. Is the location where the authorization server sends the user once the app has been successfully authorized and granted an authorization code or access token. Because you are going to be using it in your own computer use http://localhost:8080

1-为应用创建自己的名称2-选择应用作为脚本。这样您就可以从计算机上运行它了。3-为您的应用添加说明。 可选的4- About URL是一个URL，您可以在其中找到应用程序的文档。 可选的5-重定向uri。是应用成功获得授权并获得授权代码或访问令牌后，授权服务器向用户发送用户的位置。因为您将要在自己的计算机上使用它，所以请使用http://localhost:8080

Now, click on <create app> and you should get a personal use script (14 character) and secret key (24 character).

现在，单击<创建应用程序>，您将获得一个个人使用脚本(14个字符)和密钥(24个字符)。

Save those numbers and make sure you don’t share them. Now you can use this in your python code.

保存这些数字，并确保您不共享它们。现在，您可以在python代码中使用它。

使用PRAW (Using PRAW)

Now we are ready to start downloading the data from reddit. First open a python notebook and make sure you have PRAW installed. Let’s first import PRAW and the other libraries we will be using.

现在我们准备开始从reddit下载数据。首先打开一个python笔记本，并确保已安装PRAW。首先，导入PRAW和我们将要使用的其他库。

# Pip install praw. Uncomment if you don't already have the package
# !pip install praw# Imports
import praw # imports praw for reddit access
import pandas as pd # imports pandas for data manipulation
import datetime as dt # imports datetime to deal with dates

Next, let’s access our Reddit app using our usual log in credentials and client id (14 character key) and secret key (24 character key).

接下来，让我们使用常规登录凭据和客户端ID(14个字符密钥)和秘密密钥(24个字符密钥)访问Reddit应用程序。

reddit = praw.Reddit(client_id='Your_14_character_client_id', 
                     client_secret='Your_24_character_secret_key', 
                     user_agent='Your_api_name', 
                     username='Your_Reddit_user_name', 
                     password='Your_Reddit_password')

This object we called ‘reddit’ is a handle with that connect us to the Reddit site. Now we need to access the ‘subreddit’ where we want to pull the data. In this case, r/wallstreebets.

我们称为“ reddit”的对象是一个将我们连接到Reddit站点的句柄。现在，我们需要在要提取数据的位置访问“ subreddit”。在这种情况下，r / wallstreebets。

# Access subreddit r/wallstreetbets
subreddit = reddit.subreddit('wallstreetbets')

Finally, within the subreddit we need to filter the content we want to see within all the posts in the site. Normally, PRAW allows you to pull posts in different ways: .hot, .new, .controversial, .gilded , .search and .top . To find out more about these accessing methods we recommend to refer to the documentation https://praw.readthedocs.io/en/latest/code_overview/models/multireddit.html?highlight=.gilded#praw.models.Multireddit.gilded

最后，在subreddit中，我们需要过滤要在网站所有帖子中看到的内容。通常，PRAW允许您以不同的方式发布帖子：.hot，.new，.controversary，.gilded，.search和.top。要了解有关这些访问方法的更多信息，建议您参考文档https://praw.readthedocs.io/en/latest/code_overview/models/multireddit.html?highlight=.gilded#praw.models.Multireddit.gilded

Since ‘DD’ are a specific flair (which is the equivalent of a tag in other forum formats). We want to sort the data by date and we are going to pull the latest 100 posts.

由于“ DD”是一种特定的风格(与其他论坛格式的标记等效)。我们想按日期对数据进行排序，我们将提取最新的100个帖子。

# Pull latest 100 posts with flair 'DD' sorted from newest to oldest
DD_subreddit = subreddit.search('flair:"DD"', limit=100,sort='new')

If we inspect this object, this is a ListingGenerator which means the output has a set of items in a list. We can choose what variables we want to keep in our analysis by inspecting the keys in the list generated. We show an example below on how to do this.

如果我们检查此对象，则这是ListingGenerator ，这意味着输出在列表中具有一组项目。通过检查生成的列表中的键，我们可以选择要保留在分析中的变量。我们在下面显示有关如何执行此操作的示例。

# In case we don't have the package pprint we install it
#!pip install pprint # Import 
import pprint# Loop through the variable names in a post
for posts in DD_subreddit:
    pprint.pprint(vars(posts))

The output will show something like this. Be careful because when I run this sometimes the following script statements stop working and we need to invoque the subreddit.search() statement again in order to get results.

输出将显示如下内容。请小心，因为当我运行此脚本时，以下脚本语句有时会停止工作，并且我们需要再次调用subreddit.search()语句才能获得结果。

Finally we need to convert our data pull to tabular form so that we can manipulate the text. To do this, we create a dictionary where we store the posts we retrieved.

最后，我们需要将数据提取转换为表格形式，以便我们可以处理文本。为此，我们创建了一个字典，用于存储检索到的帖子。

# Create a dictionary with the variables we want to save
DD_dict = { "title":[],
            "score":[],
            "id":[],
            "url":[],
            "comms_num": [], 
            "date": [],
            "body":[]}# We now loop through the posts we collected and store the data
for posts in DD_subreddit:
    DD_dict["title"].append(posts.title)
    DD_dict["score"].append(posts.score)
    DD_dict["id"].append(posts.id)
    DD_dict["url"].append(posts.url)
    DD_dict["comms_num"].append(posts.num_comments)
    DD_dict["date"].append(posts.created)
    DD_dict["body"].append(posts.selftext)

We are almost done. There’s a minor fix we need to do with the date variable. In order to do this we convert the numeric value from the ‘created’ field using datetime library.

我们快完成了。我们需要对date变量做一个较小的修复。为此，我们使用日期时间库从“创建的”字段中转换了数值。

# First convert dictionary to DataFrame
DD_data = pd.DataFrame(DD_dict)# Function takes a variable type numeric and converts to date
def get_date(date):
    return dt.datetime.fromtimestamp(date)# We run this function and save the result in a new object
_date = DD_data["date"].apply(get_date)# We replace the previous date variable with the new date variableDD_data = DD_data.assign(date = _date)# Let's check the output table
DD_data

数据处理(Data Manipulation)

And now is time for the fun. Given that this part is more involved in terms of the script used, we will omit the code. But the entirety of the code used here can be found in the repository.

现在是玩的时候了。鉴于此部分更多地涉及所使用的脚本，因此我们将省略代码。但是这里使用的所有代码都可以在存储库中找到。

We can take this data to perform our exploratory analysis and see how are these ‘traders’ operating. First, we are going to see the in a time series how do the DD breakout. To do this we index the data by date and then plot the time series to see when was the day with most posts.

我们可以利用这些数据进行探索性分析，并查看这些“交易者”的运作方式。首先，我们将在时间序列中了解DD突围的方式。为此，我们按日期对数据进行索引，然后绘制时间序列以查看何时是发布最多的日期。

We can also see how the market performed during the same time period to compare the activity in these DD’s.

我们还可以看到市场在同一时期内的表现如何，以比较这些DD中的活动。

We can guess that as the index SP500 built momentum from September 10th through September 15th, many speculations about growth could continue. However, as the market declined in the following days the number of DD’s declined as well. Could this be because the r\wallstreetbets is mainly ‘bullish’ with the market? Let’s test this hypothesis

我们可以猜测，随着SP500指数从9月10日到9月15日建立动力，许多有关增长的猜测可能会继续。但是，随着接下来几天市场的下跌，DD的数量也下降了。难道是因为r \ wallsbettbets主要是与市场“看涨”？让我们检验这个假设

Let’s create a word cloud for the DD’s we have downloaded, to see what words are most predominant in these posts.

让我们为下载的DD创建一个词云，以查看这些帖子中最主要的词。

Here we see that the main sentiment of these posts suggest that the most commented stocks are Nikola ($NKLA) and Tesla ($TSLA). It might be that the underlying trend is to buy, but since the key terms we are seeing are related to both options trades, calls and puts, both actions seem to have relevance in these posts. So we need to find when are these terms being used related to the stocks we are seeing. For simplicity we will stick to Nikola and Tesla. And just for the reader’s understanding, the term ‘tendie’ refers to the profit or reward once a trade has been successful.

在这里，我们看到这些帖子的主要观点表明，评论最多的股票是尼古拉(NKLA)和特斯拉($ TSLA)。潜在的趋势可能是购买，但是由于我们看到的关键术语与期权交易，看涨期权和看跌期权都相关，因此这两种操作似乎都与这些职位相关。因此，我们需要找到这些术语何时与我们所看到的股票相关。为简单起见，我们将坚持使用Nikola和Tesla。仅出于读者的理解，术语“小费”是指交易成功后的利润或报酬。

So now, we can filter the posts that contain a recommendation on Nikola or Tesla and filter what recommendations are mostly buy vs sell on any given day. Since Reddit users give ‘Karma’ points to those posts that they consider useful, we can use that score to compare the strength of the post direction. We can show this by date in order to see the progression of these posts by date. This analysis is very simplistic utilizing words in the post like bullish or buy in order to estimate the trade recommendation.

因此，现在，我们可以过滤包含有关Nikola或Tesla的推荐的帖子，并过滤在给定的一天主要进行买或卖的推荐。由于Reddit用户为他们认为有用的帖子提供“业力”积分，因此我们可以使用该分数来比较帖子指导的力度。我们可以按日期显示此信息，以便按日期查看这些帖子的进度。该分析利用诸如牛市或购买之类的词来估计贸易建议非常简单。

As we can see in Sept 10th the point of view was mainly to sell for these two companies. However, as we approached Sept 14th, optimism grew. Following this, a higher volume indicated strategies buying and selling both stocks, and Finally after Sept 17th, no real consensus on whether to buy or sell on these stocks was reached. For reference this was the performance of both stocks during the same time period.

正如我们在9月10日看到的那样，观点主要是为这两家公司出售。但是，随着9月14日的临近，乐观情绪增强了。在此之后，更高的交易量表明了买卖这两只股票的策略，最后，在9月17日之后，对于买卖这些股票并没有达成真正的共识。作为参考，这是两种股票在同一时期内的表现。

As we can see there’s a pretty high correlation between price movement and DD postings. Does this mean traders on r/wallstreetbets are geniuses? The answer is ‘clearly not’. This correlation simply means that when prices move in one direction, users are more likely to ‘upvote’ a DD post that is in line with the current daily trend on the market.

正如我们所看到的，价格变动和DD帖子之间有很高的相关性。这是否意味着r / walletbets上的交易者都是天才？答案是“显然不是”。这种相关性仅意味着，当价格朝一个方向移动时，用户更有可能“赞扬” DD帖子，该帖子与市场上当前的每日趋势一致。

This of course is only the beginning. And in order to to a more in depth analysis we should consider better understanding the body of the post and the sentiment in order to better extract the insights from theses posts. But this initial steps should get you going using reddit for your investment analyses.

当然，这仅仅是开始。并且为了进行更深入的分析，我们应该考虑更好地理解该职位的主体和观点，以便更好地从这些职位中提取见解。但是，这些初始步骤将使您开始使用reddit进行投资分析。

结论 (Conclusion)

We can see there’s consensus coming from the DD posts and we could make decisions using mainly these posts. However, the correlation happens on the trend the market is moving that day, rendering these recommendation a little suspicious. This is a big limitation of this process, since the data that gets most upvotes could be due to the market euphoria from that day.

我们可以看到DD职位之间达成了共识，我们可以主要使用这些职位来做出决策。但是，相关性发生在当天市场变化的趋势上，因此这些建议有些可疑。这是此过程的一个很大限制，因为获得最多支持的数据可能是由于那一天的市场兴高采烈。

There aren’t major implications in terms of ethical considerations, given that from a personal information point of view, most users are anonymous on reddit, but it could be argue that in some cases this information could by used by corporations to benefit from the trade decisions of these users. Other aspects such as harassments is uncommon in this subreddit given the nature of the application, but people using this API’s should considering deleting any personal identification they find and deleting non-aggregated data after it has been processed for the application

考虑到从个人信息的角度来看，大多数用户在reddit上都是匿名的，因此在道德考虑方面并没有重大影响，但是可以说，在某些情况下，公司可以利用此信息从交易中受益。这些用户的决定。考虑到应用程序的性质，此子reddit中的其他方面(例如骚扰)并不常见，但是使用此API的人应考虑删除他们发现的任何个人身份，并在为应用程序处理数据后删除未汇总的数据

总结思想(Closing Thoughts)

Investing is a very complex task that takes a lot practice and training to master. And even after an understanding companies and the market, this can be a treacherous process. So the obvious question is: Should we solely rely on the opinion of people on the internet in order to invest? You probably answered a sound No, but taking into consideration different data points including this one could improve your investment model.

投资是一项非常复杂的任务，需要大量的实践和培训才能掌握。即使了解了公司和市场，这也可能是一个危险的过程。因此，显而易见的问题是：我们是否应该仅依靠互联网上人们的意见进行投资？您可能会回答“不”，但考虑到包括该数据在内的不同数据点可以改善您的投资模型。

By the way, if your answer was actually Yes, then you surely belong in r/wallstreetbets Go check it out!

顺便说一句，如果您的答案实际上是“是”，那么您肯定属于r / wallstreetbets 。

翻译自: https://medium.com/social-media-theories-ethics-and-analytics/automating-data-collection-from-reddit-to-invest-in-stocks-2c86fe365db9

吴雄辉

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
reddit_自动从reddit收集数据以投资股票

redditNO INVESTMENT ADVICE. The Content is for informational purposes only.没有投资建议。内容仅供参考。During the COVID-19 pandemic lockdown many new retail investors jumped into stock trading given that there wa...
复制链接

扫一扫