Hacker News Dataset Update October 2016

Our latest project on Sizzle is a visualization of the Top 10k Posts of All Time on Hacker News .

To create the visualization, we first needed to collect the data.

I noticed that there was an old copy of the hacker news dataset available on Big Query . But I needed an up-to-date copy, so I looked into the Hacker News Firebase API .

The API allows you to get each item by Id. You can start by retrieving the current Max ID, then walking backwards from there. (Items my be stories, comments, etc., it's the same API endpoints for all types of items.)

There is no rate limit, so I created the following script that will generate a text file with 10MM lines containing all of the URIs to retrieve. (we will then feed this file into wget using xargs)

Note: 10MM items was ~5 years worth of data.

Script to create the 10MM line file of URIs to retrieve:

https://gist.github.com/aaronhoffman/1f753c660d7364bb594a36af350b227c

That script takes about 10 minutes to produce a file that's around 560MB in size.

After the file is generated, you can feed it to wget using xargs to retrieve all the URIs.

ex:

cat hn-uri.txt | xargs -P 100 -n 100 wget --quiet

wget will save the result of each GET request in a separate file with the format {id}.json

Caution: That command took just over 30 hours to complete on my macbook. (it also killed Finder a couple times and I had to disable spotlight on the folder I was saving all the .json files to)

I found that it can be difficult to work with 10MM files in a single directory on my mac, so I will try to save you the trouble.

Here are a couple copies of the dataset I retrieved:

1. Here is a zip of the directory containing 10MM json files. (4GB)

2. Here is a SQL Server backup of a database containing a single table, that contains one record per json file. (2GB)

Hope this helps!

Aaron

Hacker News Dataset Update October 2016

Trending Articles

金士顿V300拆的FT64G08UCM1-27或者FT64G08UCT1-8B用SM3257NEBA主控量产

活得更真实：10个行动建议重拾幸福人生

[沸班亚马制作组] 胆大党第一季 - 01-12 [BDRip AI Ultimate 2160p HEVC-10bit OPUS]

[转载]煞貢、直星、人專吉日\金神七煞歌

帳務小管家Life 2024 免安裝中文版 (2024/01/05) - 中文記帳軟體

[攻略] 《魔獸世界》乾了啦！6.2.2 啤酒節新戰寵和玩具已報到

泰语每日一词：ของ“的”，“东西”（Day 252）

creator3.0怎么去掉自带加载界面

Win11 任务管理器进程无法选择GPU和GPU引擎（灰色无法选择）

出售: Jmlab Profil 5b 單元 x 4

皇家騎士團1、2，SFC超級任天堂經典SRPG遊戲下載，模擬器+攻略+詳細流程資料+金手指！

[家庭教师.HITMAN REBORN!]音乐全集+手机铃音[度盘下载][3G]

「圖紙集」+「功能變數」_管理圖號及張號

出售: ASI 銅杯/擴散器/牛肉粒/相位器

中／世唯生乳捲超Q彈

S3/U5變速箱CVT 7保養&照顧

[心得] MVPmods版 MVP2015模組中文化測試報告。

cocos creator 3.5.2 與 Android Studio 3.5.2 打包 aab 一直上不了 Google Play store

[DBD-Raws][占领电视台/They wanted to fly far away on the...

黑龙江省民代幼教师致省政府的诉求信