Scrape social news from Reddit

/r/news
Reddit is a news aggregator and discussion forum with approximately 1.1 billion users worldwide in 2022. It is one of the most-visited websites online.

Reddit is divided into “subreddits”, or communities of interest, where users share links to relevant articles and comment on them. “Good” links get voted to the top, “bad” or irrelevant links enjoy lower visibility.

Today we will explore /r/news - a leading community for news on Reddit. We will compile a news dataset that can be used for qualitative analysis.

Other similar subreddits include /r/worldnews, /r/politics, etc., which can be a good source of more focused links and discussions.

Step 1. Installing packages

For this project, we will need two third-party libraries: pmaw which is a wrapper/helper around the Pushshift API, the ever-updating archive of snapshots of Reddit submissions and comments, and newspaper3k that will help us extract information from online articles, e.g. authors, publish date, text, and top image.

You can learn more about these two packages here:
https://github.com/mattpodolak/pmaw
https://pypi.org/project/newspaper3k/#description

pip install pmaw newspaper3k

If you are working inside a Jupyter Notebook environment, please use the following command instead:

!pip install pmaw newspaper3k

Step 2: Writing the program

First, we import the datetime package and give it a short-hand name dt as per convention. Because computers typically don’t understand human-readable dates such as “April 26”, we need to use the datetime method of the dt module to prepare date and time in a computer-readable format to later use with the API.

Second, we import the pprint module. Although similar to the built-in print function, it will help us better visualize (“pretty print”) structured data (in our case a list of dictionaries) that we receive from the API.

import datetime as dt
from pprint import pprint

Then, we import the PushshiftAPI module from pmaw, instantiate and assign it to a variable api.

PushshiftAPI has two methods that let us perform searches: search_comments and search_submissions. Because we are looking to collect links to news articles, this type of search falls under the submissions category.

If we look at the documentation, we can see that these two functions accept a number of parameters. In our case, we can supply the minimum needed: q (our query/search term), before (fetch results before a certain date), after (fetch results after a certain date), subreddit, and limit (the number of articles to return - we will set it to 10 for demonstration purposes, but you can omit this parameter if you want to receive all of the articles available).

First, we need to construct a query. In this tutorial we will be looking at submissions that have the word “vaccine” in the title and its variations. Pushshift allows us to perform complex queries and combine or exclude certain terms. For example, because the word “vaccine” can be used in different variations and contexts such as vaccination, vax, antivax, etc., we can instruct Pushshift to search for either terms with the logical OR | operator. Other operators include AND + (e.g. vaccine+Uzbekistan if we want to search for mentions of vaccine in the context of Uzbekistan), negation - (e.g. vax-antivax if we want to search for “vax” but NOT “antivax”), etc. For a full list of operators please refer to this page: https://reddit-api.readthedocs.io/en/latest/#comment-attribute-parameters

In this case, I would like to look for every possible variation of the word, so we end up with the following query:
vaccine|vaccination|vaccinate|vaccinates|vaccinating|vaccinated|vax|vaxx|antivax|antivaxx|anti-vax|anti-vaxx

I would also like to limit the search to a certain time period, e.g. the year 2021. However, as I have mentioned before, PushshiftAPI doesn’t understand human-readable dates - we need to provide a UNIX timestamp, or the number of seconds that have elapsed since January 1, 1970. If we take our “after” date (January 1, 2021 at 00:00:00) and convert that to a timestamp, we get 1609430400. We do the same for our “before” date (December 31, 2021 at 23:59:59) - 1640966399 (the difference is 31,535,999 seconds, or 1 whole year).

Conveniently, the datetime module can help us with that: int(dt.datetime(2021, 1, 1, 0, 0, 0).timestamp()).

from pmaw import PushshiftAPI

api = PushshiftAPI()

q = "vaccine|vaccination|vaccinate|vaccinates|vaccinating|vaccinated|vax|vaxx|antivax|antivaxx|anti-vax|anti-vaxx"

submissions = api.search_submissions(
    q=q,  # query
    after=int(dt.datetime(2021, 1, 1, 0, 0, 0).timestamp()),
    before=int(dt.datetime(2021, 12, 31, 23, 59, 59).timestamp()),
    subreddit="news",
    limit=10,
)

# We convert the special Response object to a list of dictionaries that is easier to work with
submissions = list(submissions)

pprint(submissions[0])  # print first result

在这里插入图片描述
If we look at the first result in the list, we can see that it is a dictionary that contains keys and values with information about a submission, such as author, view count, etc. Because we are only interested in article links, the key that we are looking for is url.

urls = set()

for submission in submissions:
    urls.add(submission["url"])

pprint(urls)

在这里插入图片描述
We can loop through the list of submissions and extract the url key from every submission dictionary and add it to a set.

There is a chance that we can get duplicate URLs in submissions (for example if two people post the same link under different titles). In order to mitigate that, we define an empty set that will help us de-duplicate the links when added to it. In Python, a set is a data structure that is similar in quality to both a list and a dictionary - you can think of it as a list that can only have unique values just like a dictionary can only have unique keys. Moreover, very much like dictionaries, if we add the same key (URL) to the set, it will overwrite the previous occurrence of said key, therefore help us to de-duplicate the URLs.

Step 3: Extracting information from an article

In the previous step, we scraped Reddit for news articles and got a set of unique URLs that we can now feed to the newspaper package and compile a dataset for qualitative analysis.

First, we import the Article module from the package. Then we loop through the set of URLs and parse individual articles by supplying its URL to the module. We prepare an empty “data” list that will hold extracted article information in a dictionary format I unimaginatively called “datum”. We use a list of dictionaries, because it will be very easy to convert that structure to a Pandas dataframe in the next step.

Because article is an object, an instance of Article class, we access extracted information as properties (e.g. article.title). We then assign these properties to respective dictionary keys (e.g. datum["title"]) and append them to data. We also add the original article url to the dictionary for more context.

from newspaper import Article, ArticleException

data = []
for url in urls:
    article = Article(url)

    article.download()

    try:
        article.parse()
    except ArticleException:
        pass

    datum = dict()

    datum["title"] = article.title
    datum["authors"] = article.authors
    datum["top_image"] = article.top_image
    datum["text"] = article.text
    datum["publish_date"] = article.publish_date

    datum["url"] = url

    data.append(datum)

Sometimes when there is a connection error or the article that you are trying to download had been deleted, especially if you are dealing with historical data dating back several years, the article.parse method can generate an ArticleException. In order to account for the error, we first need to import the error definition from the newspaper module. Then, we put our article.parse method inside a try... except... block and pass, or ignore, that error. In this case, article fields such as “title” and “authors” will be empty (see below).

Step 3. Create a Pandas dataframe

In this step we will create a table representation of our data that we can inspect, clean, and analyze. We use the pandas.DataFrame method to read our list of dictionaries, and print the table.

import pandas as pd

df = pd.DataFrame(data)

# print(df)
df

在这里插入图片描述
As we can see, the newspaper package did a good job extracting article features such as title, text, and top image, however because different websites have different nonuniform structures, extracting correct author and date information is a difficult task, and the result may not always be perfect. If we look at row 0 and 6, we can see that the author column contains words like “Https” and “December” which are not valid names. Moreover, the published date values are outright missing, and if we encounter an ArticleException, all rows except for the url will be empty too (row).

Now we can save the dataset to a CSV file. CSV stands for Comma-Separated Values; this means that the default sep (separator) is a comma ,, but because our dataset columns such as author and text are very likely to contain commas, we set the delimiter to be the tab \t character, because it is very unlikely to be present in our data.

df.to_csv("data.csv", sep="\t")

In the next tutorial, we will look at how to remove invalid information from this dataset.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值