Data-Ink Maximization is the concept of making every keystroke count (including the delete character), popularized by Edward Tufte . One famous example of this is how he redesigned the scatterplot into what is known as a rugplot .
Simplify, then minimize. Add lines, that’s key.So, let’s tryvisualizing time-dependent graphs with Tufte’s inspiration, with a twist. Let’s visualizerotating infrastructures. That is, let’s capture new hostnames (for example mail.google.com) that are resolving to a hosting IP from hour to hour.
One additional restriction is to find a solution using Matplotlib and NetworkX . Maybe we can write something quickly. Pasted below is source code to do this yourself.OneHosting IP
Given two graphs of fictitious hosting IPshosting hostnamesat one hour, then the next, we can build a graph for each time. The challenge is to visualize the evolution. In other words, thechallenge is to compare two graphs that are time-dependent.
Here’s our simple answer: draw lines from one hour to the next. Draw a line from hosting IPA to A between the time windows. Below is an example of doing just this:
FIGURE 1: Following hosting IPA from one hour to the next.
With the guideline following hosting IPA from hour to hour, we see the density of hostnamesconnected begin to vary. This variation is due to A resolvingmore hostnames in the second hour.From a security perspective, an increase in the number of hostnamesresolving on a hosting IPmay indicate malicious or unintended behavior. For example, if we assume a hosting IP resolvesa constant number of hostnamesfrom one hour to the next (obviously a huge assumption), the increase in the number of hostnames resolvingmay be due to an IP starting to host a series of Exploit kit  or phishing domains   . MultipleHosting IPs
Our next example, just builds on the first by overlaying more lines. Notice, how the lines begin to convey a certain amount of information about the complexity and density of the clusters in the graph.
FIGURE 2: Following hosting IPs:A,B,C,H,S from one hour to the next.
By increasing the number of guidelines we are now tracing multiple hosting IPsfrom one hour to the next. We can compare the density of the connected hostnamesper hosting IP. In addition, we can begin to identify any connections from Hosting IPto hostnameto Hosting IP.
That is, hosting IPA and H in the first hour had nothing in common while in the second hour they had two hostnamesin common. With the guidelines we can quickly re-trace two time-dependent subgraphs and map their evolution.
From a security perspective, if hosting IPA and H had something in common in both hours, the resulting grid-lines would have completed a rectangle, a cycle, between the two time-dependent graphs. In this case, they form a tree-like structure. What makes this interesting, is that while hosting IPA and H obviously have something in common in the later hour, it is not clear they did in the previous. With the grid-lines we recognize there might be evidence that the hostnamesin the previous hour may be related.
We may therefore proceed, perhaps, from a known malicious hostname and begin to test whether other hostnameswithin the weakly drawn cluster (of the hostnamesresolving toA and H) in the previous hour are also malicious.Next
The above example simply traced one, two, or three hosting IPsfrom one hour to the next. But notice, we could vary this. We could trace domains just as easily, or, a combination of users and domains.
If you’re interested in graph analytics on time-dependent graphs definitely check out this paper authored by folks at AmpLab , Databricks , and Uber .Source
You’ll want your data stored in files like g1.txt and g2.txt looking like this:
Then you can run: