Statistics on reddit's top 10,000 titles with NLTK
Drawing inspiration from this blog post on title virality
I wanted to investigate what makes these top 10,000 titles the best of their breed. Which are the best superlatives? Who/what's the most popular subject? Let's start with some statistics:
- On Feb. 03, 14:10:45 (UTC) the all-time top 10,000 submissions on reddit (/r/all) had a total of 82,751,429 upvotes and 62,655,532 downvotes (56.9% liked it).
- 5.2 years between the oldest and newest submission
- 8,331,382 comments. That's about 833 comments per submission.
- The #1 post has 26,758 - 4,882 = 21,876 points
- The #10,000 post has 15,166 - 13,679 = 1,487 points
- And now some graphs....
Adjectives - reddit loves "new", "old", "good" and "right"
Top Adjective, Superlative - "Best" is the best
Questions reddit loves how?
What's reddit talking about? People.
Or news, the president, man...
Reddit appreciates personal content about you, this, it and I.
Even NLTK doesn't understand these...
I'm pretty sure you don't need example links for these...
The top 10,000 seem to come mostly from 17:00 UTC and rarely from around 12:00 UTC
This isn't exactly the probability of succeeding to hit the front page as it's not clear at what time submission count is highest. But it's something.
This is my first time using NLTK and though I'm ok at coding I most certainly have no idea how to parse natural language. Here's hoping this was somewhat insightful.