In the first tutorial of the series, we looked at how to use the Pushshift Reddit API and newspaper3k
package together to create a news dataset on the topic of vaccination. In this tutorial, we will look at how to clean and prepare the dataset for further analysis.
Loading data
First, we load the data using the read_csv
method of the Pandas module. If you remember from the last tutorial, we chose to assign a tab character \t
as the CSV delimiter, because we cannot utilize the default comma ,
, as it is being used in the text of the news articles.
import pandas as pd
df = pd.read_csv("data.csv", sep="\t")
df
Exploring the dataset
In this step, we will explore and identify issues with our data and implement possible solutions. We will use standard pandas
methods, as well as RegEx and a language detection package pycld2
from the Chromium project to clean and normalize the data.
It is a good idea to move from left to right and work with a single column at once.
Issue 1: Leftover column “Unnamed: 0”
The very first issue comes from the way Pandas reads and writes data. If we look at the offending column, we can see that it corresponds to and duplicates the index of the dataframe.
Every dataframe needs an index; this is how Pandas identifies the rows. Since we already have index present in our data and we didn’t specify which column acts as index when we loaded the data, it automatically skipped the leftover index and chose to re-index the dataframe which led to duplication. In this case, the leftover index was loaded as a normal column and became duplicated.
There are two ways to fix this issue. On the one hand, we can set the index ourselves by passing in the index_col
parameter with the index of the column that we want to act as dataframe index to the read_csv
method:
df = pd.read_csv("data.csv", sep="\t", index_col=0)
Alternatively, since we have already loaded the data, we can simply drop the column because we already have the index assigned to us automatically.
We get the same result in both cases.
df.drop("Unnamed: 0", axis=1, inplace=True)
The axis=1
parameter specifies that we want to remove a column with the name “Unnamed: 0” and not a row with the index name “Unnamed: 0”, which is the default behavior of the drop
method. We also ask Pandas to perform this operation in place, otherwise it would return a new dataframe that we will have to reassign to a variable like the following:
df = df.drop("Unnamed: 0", axis=1)
Issue 2: Authors column contains unwanted symbols like [ ] ’ ’ and incorrect values like “Https” and “December” for author names
This issue comes from the newspaper3k
package and how it extracts data from a webpage (see Part 1 of the tutorial series). If we look closely at one of the values in the authors
column, we can see that it is a string representation of a list of values such as “Https” and “Abc News”. There is a clear pattern that we can utilize to extract correct values from this string.
First, we note that the values that we need are located inside single quotes ''
and, second, based on how Western names usually consist of first and last names, we can see that two-word strings like “Kaitlan Collins” would be valid but single- and multi-word strings such as “Https” and “Media-Cldnry.S-Nbcnews.Com Image Up…” would be invalid names.
Given the observations above, we can start by preparing a regular expression that we will use to separate valid names from invalid ones. We start by writing an empty expression r""
. Because what we need is hidden inside single quotes, we plug that first piece of the puzzle in the expression r"''"
and place the so-called group identifier - a set of brackets - to signify that we want to extract this exact pattern, or group of characters that is in between those quotes: r"'()'"
.
Moreover, we are looking for two-word strings, therefore our next step is to construct an expression that will only look for two-word strings and discard everything else. If we look at a sample name “Kaitlan Collins”, we can see it consists of both upper- and lower-case characters and a single space in between the names. The RegEx cheatsheet tells that we need the following pattern: [A-Za-z]
for names and \s
for a single space. We now plug that in our pattern:
r"'([A-Za-z]+\s[A-Za-z]+)'"
The special character +
specifies that we are looking for 1 and more appearances of a single character. For example, the name Kaitlan
is 7 characters long and Ameya
is 6. It is obvious that a name cannot be 0 characters long, and there is no upper limit on how long a name can be, so this specifier fits us well. Another way to write a length specifier is to use the curly braces {}
. In this case we can replace the +
character with the following expression {1,}
where no number after the ,
means an unlimited amount of characters. In order to set a more precise limit, we can specify the upper bracket: {1, 9}
.
Our regular expression is now ready to be used inside a cleaning function.
import re
def extract_authors(row_val):
regex = r"'([A-Za-z]+\s[A-Za-z]+)'"
# The `findall` method returns a list of matches
authors: list = re.findall(regex, row_val)
# We then return our list of authors as a string separated by commas
authors_str: str = ", ".join(authors)
return authors_str
df["authors"] = df["authors"].apply(extract_authors)
Next, we select a column (series) that we want to clean and use the apply
method to pass every value in that column to the cleaning function and overwrite it with a return value of that function.
This helps us discard unwanted elements and extract clean author names from the supplied strings.
Issue 3: Empty rows
If we look at row 2 or 4 in our dataset, we can see that we have empty values in columns such as authors
, publish_date
, and text
. While authors
, or publish_date
are probably not detrimental to our analysis (unless we are doing a longitudinal study), text
however is the unit of analysis that we are after, therefore it cannot be empty.
df.dropna(subset=["text"], inplace=True)
df
Because of that, we simply remove empty rows in the text
column.
Issue 4: Presence of \n\n
(new line) characters in text
If we look closely at the first row of the text
column, we can see that it contains two new line characters that serve to visually separate the paragraphs. These symbols serve no informational purpose and are purely stylistic.
df["text"][0]
We can remove those characters with the following string method (in this case, we substitute them with a single space):
df["text"] = df["text"].str.replace("\n\n", " ")
Issue 5: German language news in the otherwise English-centric dataset
Although English is the main language of Reddit, however as a multilingual forum, we often see posts in other languages. If we look at row 5 in the dataset, we can see a news article written in German. Depending on the requirement of our study, we have two ways of dealing with that irregularity. One is to remove the row; another is to translate it into English, which is a good way of normalizing the dataset and not losing any entries. However, in this tutorial, we will go with the former solution, as both are acceptable.
First, we need to identify which rows are in English and which are not. In order to do that, we will use a package called pycld2
.
pip install pycld2
Just like in the previous use of the apply
method, we write a function get_language
that is applied to every value in the title
column with the return value being written to a new column called lang
. We chose title, because it consists of only one or two sentence at most (as opposed to text
) which is computationally efficient.
We use the detect
method of the pycld2
module, which takes a string of text and returns three values isReliable
, textBytesFound
, and details
. We discard the first two values, because we are only interested in the third one - details
, which is a tuple of tuples similar to this:
(('ENGLISH', 'en', 98, 957.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0))
What we are looking to extract is either the first or second value - “ENGLISH” or “en”, respectively. We can work with either one, but I prefer the short-hand version en
, which is a language code.
Then we filter out the rows that are only in English and drop the lang
column, as we no longer need it.
import pycld2 as cld2
def get_language(text: str) -> str:
# We discard the first two return values.
# We can set their names to _
# in order to make it clear to others
# reading the code that these two variables
# are unused
# _, _, details = cld2.detect(text)
isReliable, textBytesFound, details = cld2.detect(text)
return details[0][1]
df["lang"] = df["title"].apply(get_language)
df = df[df["lang"] == "en"]
df.drop("lang", axis=1, inplace=True)
Issue 6: Empty strings ""
should be NaN
If we look at the authors
column, we can see some empty strings where we couldn’t identify author names. Another step toward normalizing the data is to uniformly mark all the empty values. By default, Pandas uses a special notation NaN
to mark empty values. We can do a quick scan of the whole dataframe and find and replace empty string (if any) with NaN
.
import numpy as np
df = df.replace("", np.nan, regex=True)
Bonus step: Sorting the dataframe by date
Currently, our dataset appears to be in random order. It is always a good idea to sort datasets according to certain criteria. For example, if we want to see the progression of headlines over time, the logical choice is to sort them by publish_date
. We can do that with the following method:
df.sort_values(by="publish_date")
Finally, we save our cleaned dataset to data_cleaned.csv
.
df.to_csv("data_cleaned.csv", sep="\t")
In the next tutorial, we will continue with a full textual analysis of the dataset.