I’ve worked for Security Layer for NutchServer of Apache Nutch asmy GSoC 2016 project and finished it. At this blog post, I’ll explain how it works and how to use it. First of all, I suggest you to read my previous posts about my GSoC 2016 acceptance: http://furkankamaci.com/gsoc-2016-acceptance-for-apache-nutch/ if you haven’t read it.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:
Nutch 1.x:A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.
Nutch 2.x:An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexible model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.
Nutch 2.x had a REST API but it didn’t have a security layer on it. I’ve implemented Basic Authentication, Digest Authentication, SSL supports asauthentication mechanisms and also fine-grained authorization support into NutchServer.
When you want to enable security at your NutchServer API you should follow this way:
Enable security at nutch-site.xml with setting: restapi.auth property to either BASIC , DIGEST or SSL . NONE is default and provides no security. Set restapi.auth.users property if you have selected BASIC or DIGEST as authentication type.Username, password and role should be delimited by pipe character (|) Every user should be separated with comma character (,). i.e. admin|admin|admin,user|user|user.Default is admin|admin|admin,user|user|user Set restapi.auth.ssl.storepath , restapi.auth.ssl.storepass and restapi.auth.ssl.keypass properties if you have selected SSL as authentication mode at restapi.auth property.You can follow this ways when you want to connect to NutchServer API via your client code:
1. Basic Authentication ClientResource resource = new ClientResource(protocol + "://" + domain + ":" + port + path); resource.setChallengeResponse(challengeScheme, username, password); try { resource.get(); } catch (ResourceException rex) { //catch it } 2. Digest AuthenticationUse the same code at step 1 and add these after it:
// Use server's data to complete the challengeResponse object ChallengeRequest digestChallengeRequest = retrieveDigestChallengeRequest(resource); ChallengeResponse challengeResponse = new ChallengeResponse(digestChallengeRequest, resource.getResponse(), username, password.toCharArray()); resource.setChallengeResponse(challengeResponse); try { resource.get(); } catch (ResourceException rex) { //catch it } ... private ChallengeRequest retrieveDigestChallengeRequest (ClientResource resource) { ChallengeRequest digestChallengeRequest = null; for (ChallengeRequest cr : resource.getChallengeRequests()) { if (ChallengeScheme.HTTP_DIGEST.equals(cr.getScheme())) { digestChallengeRequest = cr; break; } } return digestChallengeRequest; } 3. SSLFollow the same procedure at Basic Authentication but do not forget to add SSLcertificate into your trust store.
NutchServer provides access to many functionalities over its REST API. Implementing authentication and authorization let users to communicate with it via a secure way.