Analyzing Reddit posts

Dataset

First of all, we need a dataset. We could use the Reddit API but it has quite a small number of posts you can retrieve. Luckily, you can find a dump of everything from Reddit at files.pushshift.io/reddit. Let’s download a few datasets:

wget https://files.pushshift.io/reddit/submissions/RS_2020-02.zst
wget https://files.pushshift.io/reddit/submissions/RS_2020-03.zst

Next, we need to read the data and select only subreddits and columns we’re interested in. Every dataset takes a lot even compressed (over 5 Gb), and uncompressed will take much more, up to 20 times. So, instead, we will read every line one by one, decide if we need it, and only then process. We can do it using zstandard library (and tqdm to see how it is going).

from datetime import datetime
import json
import io

import zstandard
from tqdm import tqdm

paths = [
    '/home/gram/Downloads/RS_2020-02.zst',
    '/home/gram/Downloads/RS_2020-03.zst',
]
subreddits = {'python', 'datascience'}
posts = []

for path in paths:
    with open(path, 'rb') as fh:
        dctx = zstandard.ZstdDecompressor()
        stream_reader = dctx.stream_reader(fh)
        text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
        for line in tqdm(text_stream):
            post = json.loads(line)
            if post['subreddit'].lower() not in subreddits:
                continue
            posts.append((
                datetime.fromtimestamp(post['created_utc']),
                post['domain'],
                post['num_comments'],
                post['id'],
                post['score'],
                post['subreddit'],
                post['title'],
            ))

In the real world, you’d better use NamedTuple to store filtered records. However, it’s ok to sacrifice readability for simplicity for one-time scripts like this.

On my machine, it took about half an hour to complete. So, take a break.

Pandas

Let’s convert the filtered data into a pandas data frame:

import pandas
df = pandas.DataFrame(posts, columns=['created', 'domain', 'comments', 'id', 'score', 'subreddit', 'title'])
df.head()

At this point, we can save the data frame, so later we can get back to work without the need to filter data again:

# dump
df.to_pickle('filtered.bin')
# load
df = pandas.read_pickle('filtered.bin')

Number

Let’s see some numbers. Feel free to play with the data as you like. For example, this is the percent of posts with the rating above a threshold:

threshold = 5
subreddit = 'python'

(df[df.subreddit.str.lower() == subreddit.lower()].score > threshold).mean()

Table

Now, we’ll make a new dataset where the amount of total and survived (having the rating above 5) posts is calculated for every hour:

# filter the subreddit
df2 = df[df.subreddit.str.lower() == subreddit.lower()]
# leave only the hour and the flag if the post is survived
df2 = pandas.DataFrame(dict(
    hour=df2.created.apply(lambda x: x.hour),
    survived=df2.score > threshold,
))
# group by hour, find how many survived and how many in total posts in every hour
df2 = df2.groupby(['hour'], as_index=False)
df2 = pandas.DataFrame(dict(
    hour=range(24),
    survived=df2.survived.sum().survived,
    total=df2.count().survived,
))

Charts

Now, let’s draw charts. This is what you need:

Jupyter Lab to easier display and debug the charts.
plotnine to draw.

Chart for total and survived posts:

import plotnine as gg
(
    gg.ggplot(df2)
    + gg.theme_light()
    + gg.geom_col(gg.aes(x='hour', y='total', fill='"#3498db"'))
    + gg.geom_col(gg.aes(x='hour', y='survived', fill='"#c0392b"'))
    # make a custom legend
    + gg.scale_fill_manual(
        name=f'rating >{threshold}',
        guide='legend',
        values=['#3498db', '#c0392b'],
        labels=['no', 'yes'],
    )
    + gg.xlab('hour (UTC)')
    + gg.ylab('posts')
    + gg.ggtitle(f'Posts in /r/{subreddit} per hour\nand how many got rating above {threshold}')
)

Chart for ratio:

(
    gg.ggplot(df2)
    + gg.theme_light()
    + gg.geom_col(gg.aes(x='hour', y=f'survived / total * 100'), fill="#c0392b")
    + gg.geom_text(
        gg.aes(x='hour', y=1, label='survived / total * 100'),
        va='bottom', ha='center', angle=90, format_string='{:.0f}%', color='white',
    )
    # scale the chart by oy to be always 0-100%
    # so charts for different subreddits can be visually compared
    + gg.ylim(0, 100)
    + gg.xlab('hour (UTC)')
    + gg.ylab(f'% of posts with rating >{threshold}')
    + gg.ggtitle(f'Posts in /r/{subreddit} with rating >{threshold} per hour')
)

Results

There is what I’ve got for some subreddits.

r/datascience

r/python

r/golang

Dataset#

Pandas#

Number#

Table#

Charts#

Results#