Channel: CodeSection,代码区,网络安全 - CodeSec
Viewing all articles
Browse latest Browse all 12749

Hacker News book suggestions

Analyzing Hacker News book suggestions inpython

An analysis of an Hacker News thread, using Python, Hacker News API and Goodreads API, and the definitive top 20 book suggestion list!

Alessandro Mozzato

Hacker News book suggestions

A few days ago the traditional “what books did you read this year” thread popped up on Hacker News. The thread is full of very nice book suggestions. Attempting to make a reading list for next year I though it would be fun to get the data and analyze it. In the following article I will show how I used Hacker News’ API to scrape the posts content, how I selected the most common titles and checked them against Goodreads API and finally how I came up with the definitive top 20 most recommended books. As always, dealing with text data is anything but straightforward. The final result, however, is quite satisfying!

Scraping the thread: Hacker NewsAPI

The first step is getting the data. Luckily, Hacker News provides a very nice API to freely scrape all of its content. The API has endpoints for posts, users, top posts a few others. For this article we will use the one for posts. It’s very simple to use, here is the basic syntax: v0/item/{id}/.json where id is the item we are interested in. In this case the thread’s id is 18661546 , so here is an example on how to get the main page data:

import requests
main _page = requests.request(‘GET’, ‘https://hackernews.firebaseio.com/v0/item/18661546.json').json())

The same API call is also used for the sub posts of a thread or a post, whose ids can be found in the kids key of the parent post. Looping over the kids we can get the text of every post in the thread.

Cleaning thedata

Now that we have the text data we want to extract book titles from it. One possible approach would be to look for all Amazon or Goodreads links in the article and just group by that. This is a clean approach because it doesn’t depend on any text processing. However, just from taking a quick look at the thread it is clear that the vast majority of suggestions do not have any link associated to them. So I decided to go for the more difficult route: grouping ngrams together and match those ngrams with possible books.

So, after eliminating special characters from the text I grouped together bigrams, trigrams, 4-grams and 5-grams and count the occurrences. This is an example to count bigrams:

import re from collections import Counter import operator # clean special characters text_clean = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for t in text for k in t.split("\n")] # count occurrences of bigrams in different posts countsb = Counter() words = re.compile(r'\w+') for t in text_clean: w = words.findall(t.lower()) countsb.update(zip(w,w[1:])) # sort results bigrams = sorted( countsb.items(), key=operator.itemgetter(1), reverse=True )

Usually in text application one of the first thing to do while processing the data is to eliminate stopwords, i.e. the most common words in a language, like articles and prepositions. In our case we did not eliminate stopwords from our text yet, therefore most of these ngrams would be almost exclusively composed of stopwords. In fact, here is a sample output of the top 10 most common bigrams in our data:

[((u'of', u'the'), 147), ((u'in', u'the'), 76), ((u'it', u's'), 67), ((u'this', u'book'), 52), ((u'this', u'year'), 49), ((u'if', u'you'), 45), ((u'and', u'the'), 44), ((u'i', u've'), 44), ((u'to', u'the'), 40), ((u'i', u'read'), 37)]

Having stopwords in our data is fine, most title books would have stopwords in them so we want to keep these. However, to avoid looking up too many combinations we eliminate the ngrams that are solely composed of stopwords, keeping all the others.

Checking book titles: the Goodreads API

Now that we have a list of possible ngrams we will use the Goodreads API to check if these ngrams correspond to book titles. In case multiple matches are available for a search I decided to take the most recent publication as the result of the search. This is assuming that the most recent book would be the most likely match for this context. This is of course an assumption that might lead to errors.

The Goodreads API is a bit less straightforward to use than the Hacker News one as it returns results in XML, which is less friendly to use than the JSON format. In this analysis I used the xmltodict python package to convert the XML to JSON. The API method we need is search.books which allows to search books by title, author or ISBN. Here is a code sample to get book title and author for the most recently published search result:

import xmltodict res = requests.get("<a href="https://www.goodreads.com/search/index.xml" data-href="https://www.goodreads.com/search/index.xml" rel="nofollow noopener" target="_blank">https://www.goodreads.com/search/index.xml</a>" , params={"key": grkey, "q":'some book title'}) xpars = xmltodict.parse(res.text) json1 = json.dumps(xpars) d = json.loads(json1) lst = d['GoodreadsResponse']['search']['results']['work'] ys = [int(lst[j]['original_publication_year']['#text']) for j in range(len(lst))] title = lst[np.argmax(ys)]['best_book']['title'] author = lst[np.argmax(ys)]['best_book']['author']['name']

This method allows us to associate ngrams to possible books. We check the list of books we get matching all ngrams with the Goodreads API against the full text data. Before performing the actual check we cut the book names eliminating punctuation (particularly semicolumns) and subtitles. We only consider the main title with assumption that most of the time only this part of the title would be used (some of the full titles in the list are actually really long!). Ranking the results we get by number of occurences in the thread we get this list:

Hacker News book suggestions
Books with more than 3 counts in thethread So Bad Blood looks to be the top most recommended book in the thread. Checking the other results most of them seems to make sense and match with the thread, including the counts. The only big mistake I could spot in the list is for position number 2, where the book Magi was identified instead of The Magicians by Lev Grossman. The latter is indeed cited 7 times in the text. This error is caused by the assumption we

Viewing all articles
Browse latest Browse all 12749