Capturing 0day Exploits with PERFectly Placed Hardware Traps By: Cody Pierce Matt Spisak and Kenneth Fitch / August 22, 2016
As we discussed in an earlierpost, most defenses focus on the post-exploitation stage of the attack, by which point it is too late and the attacker will always maintain the advantage. Instead of focusing on the post-exploitation stage, we leverage the enforcement of coarse-grained Control Flow Integrity (CFI) to enhance detection at the exploitation stage. Existing implementations of CFI require recompilation, extensive software updates, or incur a significant performance penalty, making them difficult to adopt and use in the enterprise. At Black Hat USA 2016 , we presented our hardware-assisted technique that has proven successful at blocking exploits, while minimizing the impact on performance to ensure operational utility at scale. To enable earlier detection while limiting the impact on performance, we have developed a new concept we’re calling Hardware Assisted Control Flow Integrity, or HA-CFI. This technology utilizes hardware features available in Intel processors to monitor and prevent exploitation in real time, with manageable overhead. By leveraging hardware features we can detect exploits before they reach the “Post-Exploitation” stage and provide stronger protections while defense still has the upper hand.Prior Art and Operational Constraints
Our work builds on previous research that identified the Performance Monitoring Unit (PMU) of microprocessors as a good candidate for enforcing control-flow integrity. The PMU is a specialized unit in most microprocessor architectures that provides useful performance measuring facilities for developers. Most features of the unit are intended to count hardware level events during program execution to aid in program optimization and debugging.In their paper, Yuan et al. [YUAN11] introduced the novel application of these events to exploit detection for software security. Their researchfocused on using PMU events along with the Branch Trace Store (BTS) messages to correlate and detect code-injection and code-reuse attacks without source code. Xia et al. explored the idea further in their paper CFIMon [XIA12], combining precise event context gathering with the BTS and PEBS to enforce real-time control-flow integrity. In addition to these foundational papers, others have pursued variations on the idea to specifically target exploit techniques such as Return-Oriented- Programming. Alternatively, just-in-time CFI solutions have been proposed using dynamically instrumented frameworks such as PIN [PIN12] or DynamoRIO [DYN16]. These frameworks dynamically interpret code as it is executed while providing instrumentation functionality to developers. Applying control flow policies with a framework like PIN allows for the flexible and reliable checking of code. However, it often incurs a significant CPU over-head, in the area of 10 to 100x, making it unusable in the enterprise.
Our research into dynamic run-time CFI included parameters we feel would make this approach relevant to enterprise security, while also providing significant detection and prevention assurances. To ensure our approach is resilient for enterprise security while also providing significant detection and prevention assurances, we established several functional requirements, such as ensured functionality on 32 and 64bit Operating Systems, application without software recompilation, or access to source code.Approach
HA-CFI uses PMU-based traps to apply coarse-grained CFI on indirect calls on the x86 architecture. The system uses the PMU to count and trap mispredicted indirect branches in order to validate branch destinations in real-time. In addition to gaining assistance from a carefully tuned PMU, a practical implementation of this approach requires support from Intel’s Last Branch Record (LBR) feature, and a method for tracking thread context switching in a given OS. It also requires an algorithm for validating branch destination addresses, all while keeping performance over-head to a minimum. After more than a year of fine-tuning these hardware features, we have proven our model is capable of generically detecting control-flow hijacks in real-time with acceptable performance over-head on both windows and linux. Because control-flow hijack attacks often stem from a corrupted or modified VTable, many CFI designs focus on validating all indirect branches. Because these call sites have never before jumped to the attacker controlled address, this indirect call is almost always mispredicted by the branch prediction unit. Therefore, by only focusing on mispredicted indirect call sites we greatly limit the number of places that a CFI check is necessary.
HA-CFI configures the Intel PMU on each core to count and generate an interrupt on every mispredicted indirect branch. The PMU is capable of delivering an interrupt any time an event counter overflows, and thus HA-CFI sets the initial counter value to -1 and resets the counter to -1 from the interrupt service routine to generate a trap for every occurrence of the event. In this way, the HA-CFI interrupt service routine becomes our CFI component capable of validating each mispredicted call and determining whether it is the result of malicious behavior. To validate target indirect branch addresses, HA-CFI builds a comprehensive whitelist of valid code pointer addresses as each.dll/.so is loaded into protected processes. When a counter overflows, the Interrupt Service Routine (ISR) called is then able to compare the mispredicted branch to a whitelist, and determine if the branch is anomalous.hacfi_1.png
Figure 1: High level design of HA-CFI using the PMU to validate mispredicted branches
To ensure we minimized the overhead of HA-CFI while maintaining an extremely low false-positive rate, several key design decisions had to be made, and are described below.
The Indirect Branch:On the Intel x86 architecture, an indirect branch can occur at both a CALL or JMP instruction. We focus exclusively on the CALL instruction for several reasons, including the frequent use of indirect JMP branch locations for switch statements. In our experimentation on Linux, we found roughly 12% of hijacked indirect branches occurred as part of an indirect JMP, but occurred even less frequently on Windows. Secondly, ignoring mispredicted JMP instructions further reduces the overhead of HA-CFI. Therefore, we opted to omit mispredicted JMP branches during this research, which can be achieved with settings on the PMU and LBR.hacfi2.png
Figure 2: A breakdown of hijackable indirect JMP vs CALL instructions found in Windows and Linux x64 binariesAdded Precision with the LBR:Given our requirement for real-time detection and prevention of control-flow hijacks, unlike the majority of previous research, we couldn’t use the Intel Branch Trace Store (BTS), which does not permit analysis of the trace data in real-time. Instead, to precisely resolve the exact branch that caused the PMU to generate an interrupt, we make use of Intel’s Last Branch Record (LBR) stack. A powerful feature of the LBR is the ability to filter the types of branches that are recorded. For example, returns, indirect calls, indirect jumps, and conditional branches ca