Hacker News book suggestions

Analyzing Hacker News book suggestions inpython

An analysis of an Hacker News thread, using Python, Hacker News API and Goodreads API, and the definitive top 20 book suggestion list!

Alessandro Mozzato

A few days ago the traditional “what books did you read this year” thread popped up on Hacker News. The thread is full of very nice book suggestions. Attempting to make a reading list for next year I though it would be fun to get the data and analyze it. In the following article I will show how I used Hacker News’ API to scrape the posts content, how I selected the most common titles and checked them against Goodreads API and finally how I came up with the definitive top 20 most recommended books. As always, dealing with text data is anything but straightforward. The final result, however, is quite satisfying!

Scraping the thread: Hacker NewsAPI

The first step is getting the data. Luckily, Hacker News provides a very nice API to freely scrape all of its content. The API has endpoints for posts, users, top posts a few others. For this article we will use the one for posts. It’s very simple to use, here is the basic syntax: v0/item/{id}/.json where id is the item we are interested in. In this case the thread’s id is 18661546 , so here is an example on how to get the main page data:

import requests
main _page = requests.request(‘GET’, ‘https://hackernews.firebaseio.com/v0/item/18661546.json').json())

The same API call is also used for the sub posts of a thread or a post, whose ids can be found in the kids key of the parent post. Looping over the kids we can get the text of every post in the thread.

Cleaning thedata

Now that we have the text data we want to extract book titles from it. One possible approach would be to look for all Amazon or Goodreads links in the article and just group by that. This is a clean approach because it doesn’t depend on any text processing. However, just from taking a quick look at the thread it is clear that the vast majority of suggestions do not have any link associated to them. So I decided to go for the more difficult route: grouping ngrams together and match those ngrams with possible books.

So, after eliminating special characters from the text I grouped together bigrams, trigrams, 4-grams and 5-grams and count the occurrences. This is an example to count bigrams:

import re from collections import Counter import operator # clean special characters text_clean = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for t in text for k in t.split("\n")] # count occurrences of bigrams in different posts countsb = Counter() words = re.compile(r'\w+') for t in text_clean: w = words.findall(t.lower()) countsb.update(zip(w,w[1:])) # sort results bigrams = sorted( countsb.items(), key=operator.itemgetter(1), reverse=True )

Usually in text application one of the first thing to do while processing the data is to eliminate stopwords, i.e. the most common words in a language, like articles and prepositions. In our case we did not eliminate stopwords from our text yet, therefore most of these ngrams would be almost exclusively composed of stopwords. In fact, here is a sample output of the top 10 most common bigrams in our data:

[((u'of', u'the'), 147), ((u'in', u'the'), 76), ((u'it', u's'), 67), ((u'this', u'book'), 52), ((u'this', u'year'), 49), ((u'if', u'you'), 45), ((u'and', u'the'), 44), ((u'i', u've'), 44), ((u'to', u'the'), 40), ((u'i', u'read'), 37)]

Having stopwords in our data is fine, most title books would have stopwords in them so we want to keep these. However, to avoid looking up too many combinations we eliminate the ngrams that are solely composed of stopwords, keeping all the others.

Checking book titles: the Goodreads API

Now that we have a list of possible ngrams we will use the Goodreads API to check if these ngrams correspond to book titles. In case multiple matches are available for a search I decided to take the most recent publication as the result of the search. This is assuming that the most recent book would be the most likely match for this context. This is of course an assumption that might lead to errors.

The Goodreads API is a bit less straightforward to use than the Hacker News one as it returns results in XML, which is less friendly to use than the JSON format. In this analysis I used the xmltodict python package to convert the XML to JSON. The API method we need is search.books which allows to search books by title, author or ISBN. Here is a code sample to get book title and author for the most recently published search result:

import xmltodict res = requests.get("<a href="https://www.goodreads.com/search/index.xml" data-href="https://www.goodreads.com/search/index.xml" rel="nofollow noopener" target="_blank">https://www.goodreads.com/search/index.xml</a>" , params={"key": grkey, "q":'some book title'}) xpars = xmltodict.parse(res.text) json1 = json.dumps(xpars) d = json.loads(json1) lst = d['GoodreadsResponse']['search']['results']['work'] ys = [int(lst[j]['original_publication_year']['#text']) for j in range(len(lst))] title = lst[np.argmax(ys)]['best_book']['title'] author = lst[np.argmax(ys)]['best_book']['author']['name']

This method allows us to associate ngrams to possible books. We check the list of books we get matching all ngrams with the Goodreads API against the full text data. Before performing the actual check we cut the book names eliminating punctuation (particularly semicolumns) and subtitles. We only consider the main title with assumption that most of the time only this part of the title would be used (some of the full titles in the list are actually really long!). Ranking the results we get by number of occurences in the thread we get this list:

Books with more than 3 counts in thethread So Bad Blood looks to be the top most recommended book in the thread. Checking the other results most of them seems to make sense and match with the thread, including the counts. The only big mistake I could spot in the list is for position number 2, where the book Magi was identified instead of The Magicians by Lev Grossman. The latter is indeed cited 7 times in the text. This error is caused by the assumption we

Hacker News book suggestions

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎