Statistics on reddit's top 10,000 titles with NLTK

07 February 2012

Drawing inspiration from this blog post on title virality I wanted to investigate what makes these top 10,000 titles the best of their breed. Which are the best superlatives? Who/what's the most popular subject? Let's start with some statistics:

Adjectives - reddit loves "new", "old", "good" and "right"

Adjectives

Top Adjective, Superlative - "Best" is the best

Questions reddit loves how?

Questions

What's reddit talking about? People.

Or news, the president, man...

Reddit appreciates personal content about you, this, it and I.

Even NLTK doesn't understand these...

I'm pretty sure you don't need example links for these...

The top 10,000 seem to come mostly from 17:00 UTC and rarely from around 12:00 UTC

This isn't exactly the probability of succeeding to hit the front page as it's not clear at what time submission count is highest. But it's something.

An apology

This is my first time using NLTK and though I'm ok at coding I most certainly have no idea how to parse natural language. Here's hoping this was somewhat insightful. I have no idea what I'm doing

More graphs and data

Appendix