Reddit is a news aggregator and discussion forum with approximately 1.1 billion users worldwide in 2022. It is one of the most-visited websites online.
Reddit is divided into “subreddits”, or communities of interest, where users share links to relevant articles and comment on them. “Good” links get voted to the top, “bad” or irrelevant links enjoy lower visibility.
Today we will explore /r/news - a leading community for news on Reddit. We will compile a news dataset that can be used for qualitative analysis.
Other similar subreddits include /r/worldnews, /r/politics, etc., which can be a good source of more focused links and discussions.
Step 1. Installing packages
For this project, we will need two third-party libraries: pmaw
which is a wrapper/helper around the Pushshift API, the ever-updating archive of snapshots of Reddit submissions and comments, and newspaper3k
that will help us extract information from online articles, e.g. authors, publish date, text, and top image.
You can learn more about these two packages here:
https://github.com/mattpodolak/pmaw
https://pypi.org/project/newspaper3k/#description
pip install pmaw newspaper3k
If you are working inside a Jupyter Notebook environment, please use the following command instead:
!pip install pmaw newspaper3k
Step 2: Writing the program
First, we import the datetime
package and give it a short-hand name dt
as per convention. Because computers typically don’t understand human-readable dates such as “April 26”, we need to use the datetime
method of the dt
module to prepare date and time in a computer-readable format to later use with the API.
Second, we import the pprint
module. Although similar to the built-in print
function, it will help us better visualize (“pretty print”) structured data (in our case a list of dictionaries) that we receive from the API.
import datetime as dt
from pprint import pprint
Then, we import the PushshiftAPI
module from pmaw
, instantiate and assign it to a variable api
.
PushshiftAPI
has two methods that let us perform searches: search_comments
and search_submissions
. Because we are looking to collect links to news articles, this type of search falls under the submissions category.
If we look at the documentation, we can see that these two functions accept a number of parameters. In our case, we can supply the minimum needed: q
(our query/search term), before
(fetch results before a certain date), after
(fetch results after a certain date), subreddit
, and limit
(the number of articles to return - we will set it to 10 for demonstration purposes, but you can omit this parameter if you want to receive all of the articles available).
First, we need to construct a query. In this tutorial we will be looking at submissions that have the word “vaccine” in the title and its variations. Pushshift allows us to perform complex queries and combine or exclude certain terms. For example, because the word “vaccine” can be used in different variations and contexts such as vaccination, vax, antivax, etc., we can instruct Pushshift to search for either terms with the logical OR |
operator. Other operators include AND +
(e.g. vaccine+Uzbekistan
if we want to search for mentions of vaccine in the context of Uzbekistan), negation -
(e.g. vax-antivax
if we want to search for “vax” but NOT “antivax”), etc. For a full list of operators please refer to this page: https://reddit-api.readthedocs.io/en/latest/#comment-attribute-parameters
In this case, I would like to look for every possible variation of the word, so we end up with the following query:
vaccine|vaccination|vaccinate|vaccinates|vaccinating|vaccinated|vax|vaxx|antivax|antivaxx|anti-vax|anti-vaxx
I would also like to limit the search to a certain time period, e.g. the year 2021. However, as I have mentioned before, PushshiftAPI
doesn’t understand human-readable dates - we need to provide a UNIX timestamp, or the number of seconds that have elapsed since January 1, 1970. If we take our “after” date (January 1, 2021 at 00:00:00) and convert that to a timestamp, we get 1609430400
. We do the same for our “before” date (December 31, 2021 at 23:59:59) - 1640966399
(the difference is 31,535,999 seconds, or 1 whole year).
Conveniently, the datetime
module can help us with that: int(dt.datetime(2021, 1, 1, 0, 0, 0).timestamp())
.
from pmaw import PushshiftAPI
api = PushshiftAPI()
q = "vaccine|vaccination|vaccinate|vaccinates|vaccinating|vaccinated|vax|vaxx|antivax|antivaxx|anti-vax|anti-vaxx"
submissions = api.search_submissions(
q=q, # query
after=int(dt.datetime(2021, 1, 1, 0, 0, 0).timestamp()),
before=int(dt.datetime(2021, 12, 31, 23, 59, 59).timestamp()),
subreddit="news",
limit=10,
)
# We convert the special Response object to a list of dictionaries that is easier to work with
submissions = list(submissions)
pprint(submissions[0]) # print first result
If we look at the first result in the list, we can see that it is a dictionary that contains keys and values with information about a submission, such as author, view count, etc. Because we are only interested in article links, the key that we are looking for is url
.
urls = set()
for submission in submissions:
urls.add(submission["url"])
pprint(urls)
We can loop through the list of submissions and extract the url
key from every submission dictionary and add it to a set
.
There is a chance that we can get duplicate URLs in submissions (for example if two people post the same link under different titles). In order to mitigate that, we define an empty set
that will help us de-duplicate the links when added to it. In Python, a set
is a data structure that is similar in quality to both a list and a dictionary - you can think of it as a list that can only have unique values just like a dictionary can only have unique keys. Moreover, very much like dictionaries, if we add the same key (URL) to the set
, it will overwrite the previous occurrence of said key, therefore help us to de-duplicate the URLs.
Step 3: Extracting information from an article
In the previous step, we scraped Reddit for news articles and got a set of unique URLs that we can now feed to the newspaper
package and compile a dataset for qualitative analysis.
First, we import the Article
module from the package. Then we loop through the set of URLs and parse
individual articles by supplying its URL to the module. We prepare an empty “data” list that will hold extracted article information in a dictionary format I unimaginatively called “datum”. We use a list of dictionaries, because it will be very easy to convert that structure to a Pandas dataframe in the next step.
Because article
is an object, an instance of Article
class, we access extracted information as properties (e.g. article.title
). We then assign these properties to respective dictionary keys (e.g. datum["title"]
) and append them to data
. We also add the original article url
to the dictionary for more context.
from newspaper import Article, ArticleException
data = []
for url in urls:
article = Article(url)
article.download()
try:
article.parse()
except ArticleException:
pass
datum = dict()
datum["title"] = article.title
datum["authors"] = article.authors
datum["top_image"] = article.top_image
datum["text"] = article.text
datum["publish_date"] = article.publish_date
datum["url"] = url
data.append(datum)
Sometimes when there is a connection error or the article that you are trying to download had been deleted, especially if you are dealing with historical data dating back several years, the article.parse
method can generate an ArticleException
. In order to account for the error, we first need to import the error definition from the newspaper
module. Then, we put our article.parse
method inside a try... except...
block and pass
, or ignore, that error. In this case, article fields such as “title” and “authors” will be empty (see below).
Step 3. Create a Pandas dataframe
In this step we will create a table representation of our data that we can inspect, clean, and analyze. We use the pandas.DataFrame
method to read our list of dictionaries, and print the table.
import pandas as pd
df = pd.DataFrame(data)
# print(df)
df
As we can see, the newspaper
package did a good job extracting article features such as title, text, and top image, however because different websites have different nonuniform structures, extracting correct author and date information is a difficult task, and the result may not always be perfect. If we look at row 0 and 6, we can see that the author column contains words like “Https” and “December” which are not valid names. Moreover, the published date values are outright missing, and if we encounter an ArticleException
, all rows except for the url
will be empty too (row).
Now we can save the dataset to a CSV file. CSV stands for Comma-Separated Values; this means that the default sep
(separator) is a comma ,
, but because our dataset columns such as author
and text
are very likely to contain commas, we set the delimiter to be the tab \t
character, because it is very unlikely to be present in our data.
df.to_csv("data.csv", sep="\t")
In the next tutorial, we will look at how to remove invalid information from this dataset.