User Profile Store: Advanced Data Modeling Part 1

Virtually all organizations have a need to record and recall user data. Frequently, this user data is permanent, such as user accounts with defined settings, histories and preferences. Equally frequently, the data is transient, as in the case of anonymous sessions with only recent activity. In both cases, the organization will require a user profile store/session store to persist and query the data. As this type of data underpins many applications within most organizations, it is imperative that the profile store be performant, available, and scalable. This post explores such a system.

We’ll look at a user profile store service (REST API) utilizing Couchbase as the database with a data model that plays to the strengths of Couchbase. We will start with all of the data in one document. Then we will look at how and why we might normalize this document into separate documents and the rationale behind the changes. All of this in perspective of what will work the best for the user profile service and how we plan to utilize the powers of Couchbase Server.

Defining the User Profile Service

Let’s start by defining the example service and its parameters. Here are the design parameters I gave myself for this project:

It shall be the authoritative profile store for all of a fictitious company’s customer facing micro-services and expose a REST API to support CRUD operations against the data.

The API shall be versioned with functionality retained and deprecated according to the design document.

Uptime target is 99.999% or better.

Be capable of setting user profiles to be kept for X years after their last login and preferably the database shall manage the deletion of the account data in conjunction with the application setting time to live (TTL) on each user's objects.’

The service should be able to perform a minimum of 50,000 authentications and 10,000 main user profile lookups per second on relatively modest AWS instance sizes.

Defining the Services Exposed by the REST API

It is necessary to start by mapping how the data will be used. The interfaces my service will expose and which will be used the most and the least. This will help direct me how best to design my documents. Couchbase allows me to design a document schema based on how, when and the frequency in which the application will use data and not on how the database will store it, like you might be forced to in a RDBMS.

Here are a few example methods one might have for this service. I am not going to list them all and waste time, but you get the idea.

getUserProfile(userID) - This gets all of the objects in the DB for the user. It will read in an array of the established key pattern object name (e.g. ) and then do a bulk get operation on all the user’s objects.

getSecurityQuestions(userID) - Checks the login-info document that the user is enabled and if so, returns the user’s security question document. If the user is disabled, it will return an error.

setSecurityQuestions(userID) - writes the user’s security question document.

authorizeUser(userID, passwordHash, IP) - gets the user’s authentication document, calls the isEnabled function and if the account is enabled it checks to see if the password’s match. If matched, it updates the document with the IP and last-login.

isEnabled(userID) - checks the user's login-info document to see if the account is enabled or not in the system. It will return true or false.

Look for a full list of the methods in the later blog post that discusses the application development and sample code.

The Initial Main User Document

Here is an example of a user profile document I might start with that captures all of the data that my service will need and this would be a perfectly valid way of doing this.

key : hernandez94 { "username" : "hernandez94", "firstName" : "Jennifer", "middleName" : "Maria", "lastName" : "Hernandez", "addresses" : [ { "type" : "home", "addr1" : "1929 Crisanto Ave", "address" : "Apt 123", "addr3" : "c/o J. Hernandez", "city" : "Mountain View", "state" : "CA", "country" : "USA", "pcode" : "94040" }, { "type" : "work", "addr1" : "2700 W El Camino Real", "addr2" : "Suite #123", "city" : "Mountain View", "state" : "CA", "country" : "USA", "pcode" : "94040" }, { "type" : "billing", "addr1" : "P.O. Box 123456", "city" : "Mountain View", "state" : "CA", "country" : "USA", "pcode" : "94041" } ], "emails" : [ { "type" : "work", "addr" : "work@email.com" }, { "type" : "personal", "addr" : "personal@email.com", "primary" : true } ], "phones" : [ { "type" : "work", "num" : "+12345678900" }, { "type" : "mobile", "num" : "+12345678901" } ], "createdate" : “2014101454”, "lastlogin": "a-date-goes-here", "pword": "app-hashed-password", "loc": "IP or fqdn", "enabled" : true, "sec-questions" : [ { "question1" : "Security question 1 goes here", "answer" : "Answer to security question 1 goes here" }, { "question2" : "Security question 2 goes here", "answer" : "Answer to security question 2 goes here" }, { "question3" : "Security question 3 goes here", "answer" : "Answer to security question 3 goes here" } ], "sec-roles" : [101, 301, 345], "type" : "user" }

This document is about 1.41kb and all of the data is not even fully filled out yet. So this document could get larger depending on the user. If you have developers or are a developer that likes to add and add and add data to a document or table, then it could grow a lot over the evolution of your application/service. Let’s look at optimizing this document for how/when the application will use this data.

A Quick Word on Data Access in Couchbase

For the main parts of this service I need the power and performance of accessing my data via key and querying for data as little as possible. Without rehashing my other blog post on data access too much, Couchbase has multiple ways of getting at data. If you provide the ObjectID of the object(s) you want, your application can read and write those objects exceptionally fast from the managed cache of the Data Service.

If you want to query using Couchbase’s N1QL language (~98% compliant with ANSI 92 SQL) that means you have a question of the data and that can be done as well through the Query Service. By it’s nature of having to do more work, Querying is less performant than working with data by ObjectID. It is the difference between having a question (query) and already knowing the answer (ObjectID). (For more information on why I say this, refer to my other blog post on " How Functional and Performance Needs Determine Data Access in Couchbase ").

For this schema design, we will make sure that our objects can be accessed efficiently by both means when needed. The “hot paths” of the application will be access by ObjectID as much as we can, but we will query where that is needed as well. We have to meet the stated requirements laid out above in this document.

Optimizing the Main User Document

In designing a data model for this service, we have some options available that are somewhat unique to Couchbase and how it scales, performs and manages memory. Since all objects are eligible to be kept in the managed cache of the Data Service, we want the document data model and key pattern used by our service to be flexible enough so that the data we need frequently is in the managed cache and the data used infrequently may just be on disk, which is ok. Other NoSQL document databases might have you put everything in one document and then query that data. Querying with N1QL is available in Couchbase too, and we will be using that for targeted purposes. Again, Couchbase’s architecture with its flexible data model and built-in managed cache affords us the ability to pick and choose where/how we want to utilize this power to our advantage.

I will use a key pattern and a normalized data model, where it makes sense. By doing this, we have the ability of only reading and writing the exact data we need, when we need it and how we need it, but when we need to query data with N1QL, we can do that too. Some of these design patterns may look odd to you, but this is how you will be able to easily run and scale Couchbase to achieve tens or hundreds of thousands of operations a second...or even more. In later blogs, I will go over how we want to query for some data to do analytics or something later on.

Again, I want to try and design the documents around the application getting only the data it needs, exactly when it needs it and reading it with the key and not querying. No need for indexes, views, joins, etc. That is how you will get the absolute best performance and scalability with Couchbase.

The Login-Info Document

I broke out the user’s login information into its own document for a few reasons and this same kind of philosophy I will use for other documents as well later on.

The login information is needed quite often as users authenticate a lot in our application and when the service does need this data, it doesn’t need the rest of the user’s information most times. Yes, Couchbase has sub-document editing, but like most NoSQL databases, on the back end the document is still reassembled and transmitted in full, e.g. inter-cluster replication(XDCR). I prefer to avoid this where I can.

I’d rather the service only pulls back from the DB to the application a document that is 141 bytes than all 1.41k or more with the main document. Sub-document editing can mitigate this, but not the next two reasons.

For performance I’d rather keep a ton of 141 byte documents in the managed cache instead of the larger ones when again I do not always need the full document. Keeping the whole document in cache may not matter too much when I have 100,000 users and am only doing 1000 ops a second, but when I have millions of users and need to do 10s if not 100s of thousands of ops a second and need to keep your performance up, it adds up. Both in performance and in monetary resources, namely RAM.

Doing this also means for my service’s regular visitors, their document should always be in the managed cache so they get the best performance.

I also want to update this document with the current date into the lastlogin value and with the IP they came in from each time the user successfully logs in. This is not tracked by Couchbase automatically, but I need it for business analysis purposes and to delete users who have not logged in for a long time. In this case, I am only updating a very small document as well.

Since it is not critical to the daily operations of authenticating users, for the requirement of deleting users that have not logged into the system in X years, we will this for another blog post.

There is also an entity in the document called "enabled" that can be used in the service to see if the user is even allowed to log in or not.

key : login-info::hernandez94 { "lastlogin": "2016-08-01 15:03:40", "pword": "app-hashed-password", "loc": "IP or fqdn", "enabled" : true, "type" : "login-info", "username" : "hernandez94" }

A critical part here is the key for the document. This kind of key pattern is something that is consistent in my application as you’ll see later. I can easily have the application concatenate that with the unique username. That key is important as I will not have to query for this document in Couchbase, I have the key. Knowing the key means the Couchbase SDK already knows the answer to the question “Where is this piece of data I am looking for?”, whereas querying a database is asking that question of the DB and using DB’s horsepower (and all that goes with that) to go find the answer and bring back the data. So knowing the key and telling the database to read/write that object, especially if it is already in the managed cache, will be extremely fast with Couchbase. Even if the object is on disk and not in the managed cache, it will be fast and the placed in the managed cache to be ready for use on follow up queries. That cache management is behind the scenes and not something your application has to know about.

I was also thinking of breaking out the "enabled" entity of the JSON document into its own key/value pair object since Couchbase supports this. I was thinking that there might be quite a few places where I might just need to know if the user is enabled in the system or not. While possible, I decided not to because I was already starting to normalize the data a lot and I have to stop somewhere. This is one where for complexity's sake, it made sense to let it be.

One of the other things to notice in this new document is that I added the doc-type and username entities to the document schema. You’ll see these in all of the other documents as well. The reason for the doc-type is then I can do secondary indexing on that later. For the username is for indexing, but also because then I can do JOINs in N1QL between any of these documents and the new main document you’ll see we end up with. In later parts of this series, you’ll see more about why that matters and how we use it to our advantage.

To recap, we broke out this chunk of information from the main user document into a smaller document for quite a few good reasons. We need the document often, we keep it small for efficient reads/writes and so it should be kept in the managed cache for the bulk of the users we care about. On top of all of that, because we can get the object by its key, we do not have to query the database so lookups are very fast. We also added values to the JSON document for easier querying later on.

Security Questions Document

I created the security questions document, as unlike the login-info document that is needed very often, this data should hopefully be needed very rarely. The times when the service needs this information, it specifically does not need any of the rest of the user’s information. Therefor it made sense to break it out into its own object and add a key that again could be concatenated by the application easily to get the document by key.

key : sec-questions::hernandez94 { "sec-questions" : [ { "question1" : "Security question 1 goes here", "answer" : "Answer to security question 1 goes here" }, { "question2" : "Security question 2 goes here", "answer" : "Answer to security question 2 goes here" }, { "question3" : "Security question 3 goes here", "answer" : "Answer to security question 3 goes here" } ], "doc-type" : "sec-questions", "username" : "hernandez94" }

In addition, because the document is not used very often it is possible this document could fall out of the managed cache and I do not care. It is used so infrequently by users, if there is a minor performance hit to go to disk for it, suck it up user! I am not paying to keep all this data in RAM because you cannot remember something about your user account.

User’s Security Roles Document

I wanted to have a list of security roles associate to a user. Like lots of other things, when I need this information, I do not need other info about the user. I will probably have another document somewhere defining the security roles these map to.

key : user-sec-roles::hernandez94 { "sec-roles" : [101, 301, 345], "doc-type" : "user-roles", "username" : "hernandez94" } Security Roles Document

I will have another set of documents defining the security roles these map to.

key : sec-roles::101 { "name" : "Administrator", "descr" : "Administrators of the service", "doc-type" : "sec-role" }

This is an interesting one as it could get me into trouble later on though. Because this document could be read quite often, I might create a hot spot depending on how often my application uses it. I have seen one customer that kept their application configuration in Couchbase in one document, but read it like 10,000+ times per second. While a single node of Couchbase can handle that easily, this was on top of all their other traffic. So the one node of the cluster that had this document was overwhelmed. This should not be a problem here, but just something to watch. I will probably follow up later with a post on how one might fix this, but 99 times out of 100, this should not be a problem. I have only ever seen a hot spot in Couchbase this one time and it really was a app design problem that caused it.

Email, Addresses, Phones Documents

You could separate the email or addresses array out from the original document depending on how your application uses the data. Perhaps we have a getEmail() or getAddress() method that is exposed as part of this service. The getMail() method would take as an input the unique username and would be able to easily construct the key (email-addr::hernandez94) for the document. Like normalizing in any database, you can go too far. So be careful and only do it if you have a solid reason to.

The New Main Document

Here is main document now. If my application user patterns dictated it, I would break out things like addresses and phones, but for example purposes I decided to leave it as all the data I left in here is data I do not need all that often or when I do, I need it at the same time. Could I normalize more? Yes, for example I could break out the createdate into a key value object since I want to record it, but will very rarely need it, but for the moment this gets me where I need to be. I do not want to go too crazy. Also the more documents you have, the more meta data you have as well.

key : hernandez94 { "firstName" : "Jennifer", "middleName" : "Maria", "lastName" : "Hernandez", "addresses" : [ { "type" : "home", "addr1" : "1929 Crisanto Ave", "address" : "Apt 123", "addr3" : "c/o J. Hernandez", "city" : "Mountain View", "state" : "CA", "country" : "USA", "pcode" : "94040" }, { "type" : "work", "addr1" : "2700 W El Camino Real", "addr2" : "Suite #123", "city" : "Mountain View", "state" : "CA", "country" : "USA", "pcode" : "94040" }, { "type" : "billing", "addr1" : "P.O. Box 123456", "city" : "Mountain View", "state" : "CA", "country" : "USA", "pcode" : "94041" } ], "phones" : [ { "type" : "work", "num" : "+12345678900" }, { "type" : "mobile", "num" : "+12345678901" } ], "createdate" : "2016-08-01 15:03:40", "type" : "user" } Sample Java Code

This is a Java code example of a bulk get operation to get all of the document associated with the user:

This allows you to get all of the documents that are associated with this user very easily and very quickly. Kind of like putting different pieces of data into different tables in a relational database and then doing a join to get all of the information. This would just be MUCH faster than that.

Cluster cluster = CouchbaseCluster.create(); Bucket bucket = cluster.openBucket(); List<JsonDocument> foundDocs = Observable .just( userID, "sec-questions::" + userID, "login-info::" + userID, "user-sec-roles::" + userID) .flatMap(new Func1<String, Observable<JsonDocument>>() { @Override public Observable<JsonDocument> call(String id) { return bucket.async().get(id); } }) .toList() .toBlocking() .single();

I might think about modifying the code to make sure that I had all of the documents when I was done with this call and if not do a replica read for the missing documents, in case a node is down, but you get the idea.

For more information on how to do bulk gets, please see your respective Couchbase SDK documentation.

Summary

Couchbase affords you a different way of modeling your data as opposed to other document databases and especially relational databases. By breaking out some data into different documents, but related by information in the key and document, we can get the exact data we need for the application when we need it. We keep exactly the data we want fresh in the managed cache that is used often, but if we need all of the user’s documents, we can do a bulk operation on them all very quickly. The final important part is the key pattern used here. Again, accessing objects by key is the difference between already knowing the answer and having to ask a question like in other databases.

All of this talk of performance and key patterns ignores the times when we do want to actually query the data. When you have questions like “How many users have a home address postal code of 98038?”. But like I said before, that we will save for another blog post in this series.

Latest Images

Trending Articles

Latest Images