As the need for safety and security grows across application areas such as automotive, industrial, and in the cloud, the semiconductor industry is searching for the best ways to protect these systems. The big question is whether it is better to build security and safety into hardware, into software, or both.
In the early days of embedded systems development, software was rather minimal, and often something of an afterthought, said Colin Walls, embedded software technologist at Mentor, a Siemens Business . “Commonly, it was developed by the same engineer(s) who had designed the hardware, and naturally their code interacted very closely with the electronics. They understood all the nuances of the hardware’s behavior, so it was not seen as a particular challenge.”
As systems became more sophisticated, software specialists began to get involved. These specialists tended to be engineers with a significant knowledge and understanding of hardware, so they were quite happy programming close to the hardware. But rising complexity has made this much more difficult.
“As complexity increased, the single software engineer became a team,” Walls said. “Different team members would have different types of expertise. Those with good hardware knowledge would encapsulate that expertise in software modules, which provided a clean interface and concealed the complexity of hardware interaction. These modules were termed drivers.”
With increasingly powerful microprocessors/microcontrollers and larger memories, the need for a rational program structure drove the adoption of real-time operating systems (RTOSes) that enabled the use of a multi-tasking model. It was a natural progression for the drivers to become part of the RTOS.
Fig. 1: Software stack. Source: Mentor
Bare metal software
When developing an embedded system, an early decision to make is whether to employ an RTOS or not. Many engineers give this very little thought because they are used to coding on top of an operating system. An RTOS is code written on bare metal, and it’s an important choice for design teams.
The simplest structure for an embedded application is an infinite loop―do something, do something else, do something else, then repeat.
“This simplicity has real value, as the behavior of the code is quite predictable,” Walls said. “The issue is that each part of the code is dependent on other parts of the code for its opportunity to run. This becomes a problem if the code is modified/updated and the equilibrium thus disturbed. The code structure does not scale. The (perhaps obvious) way to restructure the software to reduce the interdependency is to unload some of the hardware responsive code into interrupt service routines (ISRs). The ISRs should be small and fast, primarily concerned with queueing up work to be done in the main loop. This structure is more scalable, but still ultimately depends on all the application code being ‘well behaved.'”
Here, the most flexible and scalable program structure is a multi-tasking (multi-threading) model, where each piece of software functionality is coded as an independent program that is allocated CPU time by a scheduler (see Fig. 2). That, in turn, is part of an RTOS.
Fig. 2: Multi-tasking model. Source: Mentor
Increasingly, there is interest in creating SoC monitoring systems that simply ignore things like run control, which is the classic debug of software running on a processor. Instead, they non-intrusively observe a system in real time, without affecting the behavior of the system. Working at the bare metal layer, i.e., exclusive of the operating system, can be an option.
Programming challenges and options
Although the largest proportion of modern embedded software designs are implemented utilizing an OS of some kind, there are a couple of circumstances when doing without―programming on bare metal―may be a reasonable decision. This could include situations where the application is extremely simple and is implemented, perhaps, on a low-end processor. It also could include situations where there is a need to extract every last cycle of CPU power for the application, and the overhead introduced by an OS is unacceptable.
In both cases, thought must be given to possible future enhancements to the software. If further development is likely, starting out with a scalable program structure is a worthwhile investment, Walls said.
There seems to be growing interest in this approach. While programming on bare metal is not mainstream today, a number of companies are kicking the tires for in-life analytics, said Gajinder Panesar, CTO ofUltraSoC. The goal is to observe and detect anomalies while a system is running, which is essential in autonomous vehicles if the anomaly can cause a safety-related malfunction.
“There are people moving toward that, to be able to use the metrics or the rich data that bare metal monitors generate, and they want to chew that data and then decide if that’s anomalous or not,” Panesar said. “The next step would be to take that data and say, ‘Ah, this is why it happened. It was because somebody did this seconds earlier, or nanoseconds earlier.’ It’s primarily the safety and high integrity systems, where it will be used for things like making sure the system is performing and functioning as well as expected, and then to make sure the system is continuously behaving.”
This can be extremely useful in bothsafety andsecurity applications. “Simple cases could be the observation of how a set of things within the system are playing―the orchestration of software and hardware and how that’s going,” he said. “You can look at this by stepping back a bit and saying, ‘The way the system behaves is that this set of things must talk to this other set of things, and there should be this interaction.’ If this pattern or tune changes slightly or is off pitch, we can detect that. So we can detect things that should happen but haven’t happened, or things that have happened that shouldn’t happen. Also, we can watch when things start drifting. If you think about it as a tune or a regular set of things, when there’s a blip or when the tune changes, the words are still the same but the tune is different. One example is a stuck pixel for the automotive app, where by observing what’s happening in theSoC and the communication between things like a camera input and the memory, we can make a judgment call about whether that camera has got some stuck pixels or not.”
This can be done purely in software, but it would require software running in the stack to detect this. The big concern there is latency and the time it takes to detect an anomaly, and software closer to the metal reacts more quickly than software way up in the stack.
“Interestingly, you don’t necessarily know what you’re looking for to begin with,” Panesar said. “You realize that this SoC is going to go into, say, the engine management of a car, and you know the set of accesses or sequences of transactions that should take place, and off you go. But then you realize that it’s actually connected to something like aCAN bus or automotive Ethernet, so it hasn’t got an interface. And by the way, the other end of the Ethernet there is a user console for infotainment, and why is it accessing the engine management system? Is that sensible? So at runtime, you actually can make sure that only these communications can access any part of the SoC. You can incrementally build this without re-spinning the SoC, without having to change the application software running.”
Market drivers
In the automotive world, standards such asISO 26262 are the gatekeepers. If you don’t follow those standards, you can’t sell your chip into a specific system.
“That’s really where the need for bare metal programming in automotive is coming from today,” said Frank Schirrmeister, senior group director for product management and marketing for emulation, FPGA-based prototyping and hardware/software enablement atCadence. “It stems from the failure rates you see for certain components in the system. If you look into the car, there are certain rates for how often things are allowed to fail. That trickles down into the components underneath―how often they are allowed to fail. And then it’s all about the multiplication of the different probabilities. The problem is the more you multiply, the bigger the probability that one of them fails.”
Many engineering teams look at the safety-related aspects at the chip level where they examine whether the system will still behave safely if this bit is stuck at a certain level.
“In that context, we also are checking for items that involve the software at that level,” Schirrmeister said. “And then it’s really bare metal. This is the first layer of contact to the functional safety in the chip through an extension offault simulation tools, which test to see what the system will do if a certain node is stuck at zero or stuck at one. So in the automotive case, it’s all about the ISO 26262-type definitions. Software plays a role in that it runs on a processor at the bare metal level. Then you will want to figure out if the system will go back into a safe state. The main problem there becomes the planning of the fault campaigns, which are the things you really want to test, because you want to test if this particular part of my chip fails, will my system go into the safe state or not?”
And while this needs to be accounted for at the architectural level or very early in the design process, there are also some of the mechanisms to allow the system to falls back into a safe state. Those are implemented a bit lower down.
“For a software system, you want it to not just crash,” Schirrmeister said. “You want it to get into a safe state, and that’s where the bare metal layer of software may be helpful to basically identify, ‘What happens if this routine fails? Or if the hardware fails at this point I’ll trap into an interrupt routine or what have you.’ And that needs to get the system into a safe, predictable state.”
Safety risks and benefits
While the benefits to visibility at the bare metal level is clear, there are valid concerns about providing different levels of access to a chip. Some industry experts wonder if vulnerabilities may be introduced along the way.
This is one area where formal verification can play a vital role, because it can identify potential problems across a complex system that may not be obvious.
“You are looking at the unknown use cases, and most of functional verification is built with use cases,” Sergio Marchese, technical marketing manager atOneSpin Solutions. “I once found a bug in an Arm core. The instruction was being marked as valid, when it was not. The designer who is, of course, very, very busy tells me this is a crazy scenario that is not going to happen. ‘This is not the recommended use case. It’s not something that a normal human being would use, so it’s safe. I don’t have time to deal with this. I need to fix bugs that are gonna mess up my use cases.’ But when it comes to security, let’s say this kind of bug leaks. It could potentially lead to a vulnerability, because that’s exactly what an adversary is looking for. The adversary is looking not for normal use cases. It’s looking for funny things that can compromise the security of the chip. So that’s one aspect of it security. Then I think in terms of problems, and there are two categories. One is security itself, which means, ‘Let’s say, we’ll never build this through genuine mistakes, so to speak, or mistakes that can be at the architecture level, at the implementation level, functional bugs, whatever.’ And then there are vulnerabilities perhaps due to malicious mistakes.”
Panesar stressed the intention is not to replace conventional security methods. Rather, it is to augment those methods. “The likes of public key encryption, etc., that should all be in place,” he said. “In a typical example, maybe an SoC has been hacked somehow, and someone’s managed to download some crypto mining software. How do you detect that? You can detect this by anomalous CPU loading. You can detect this by knowing or observing. There are a number of ways you can observe CPU utilization, even during idle periods when there’s no activity. Even when a car isn’t moving, this information can be transmitted over a secure channel, maybe an SSH channel, to some supervisor system.”
This approach also works for identifying ramsomware. “You have to detect this anomalous activity when the system is potentially idle and get that across to people,” he said. “This data is sent periodically in systems that are always connected. The automotive industry will be doing vehicle-to-vehicle communication, and they’ll always be connected just to make sure that the car hasn’t suddenly broken down and they’ve not heard anything. So this can exploit that connection. You can be periodically sending some sort of heartbeat. And from that you can see all of a sudden a CPU has gone to 90% loading, when actually it’s stuck in a car park.”
Conclusion
Clearly, there is work yet to be done, and solutions are still evolving―especially in security.
“This area still is much less established,” Marchese said, “How do you trade off the security architecture, so to speak, with power, with area, with complexity, with the extra design and engineering work? Safety, in a sense, is easier because you have a visible adversary. You model your random faults to say, ‘I want to see these types of faults in this type of logic.’ You can quantify it. It’s rather hard work, but at least you know exactly what kind of adversary you are defending against. With security, that’s not the case. You have some known facts, but ultimately the tricky things you don’t know about are the things you want to defend against, so everything becomes more complicated. Even when you add new logic, you need to be careful not to add new vulnerability because with security, things are pretty crazy.”
Bare metal programming may be the ultimate compromise between hardware and software, but it requires a deep understanding of both at the very outset of the design process. So while there are clear benefits, this stuff isn’t easy.