You're Doing IoT Security RNG: The Crack in the…

There’s a crack in the foundation of Internet of Things (IoT) security, one that affects 35 billion devices worldwide. Basically, every IoT device with a hardware random number generator (RNG) contains a serious vulnerability whereby it fails to properly generate random numbers, which undermines security for any upstream use.

See Dan and Allan discuss their research at DEF CON 29:

In order to perform most security-relevant operations, computers need to generate secrets via an RNG. These secrets then form the basis of cryptography, access controls, authentication, and more. The details of exactly how and why these secrets are generated varies for each use, but the canonical example is generating an encryption key:

A canonical example on how encryption key is generated, Alice and Bob trying to have a private conversation using RNG

FIGURE 1: Alice and Bob trying to have a private conversation using RNG

In order for Alice and Bob to communicate secretly, away from the prying eyes of Eve, they need to produce a shared secret by using an RNG. The fact that Eve does not know this number is the only thing keeping her from compromising the secrecy of the communications. This same story plays out for other aspects of security: whether it’s SSH keys for authentication or session tokens for authorization, random numbers are one of the bedrock foundations of computer security.

But it turns out that these “randomly” chosen numbers aren’t always as random as you’d like when it comes to IoT devices. In fact, in many cases, devices are choosing encryption keys of 0 or worse. This can lead to a catastrophic collapse of security for any upstream use.

As of 2021, most new IoT systems-on-a-chip (SoCs) have a dedicated hardware RNG peripheral that’s designed to solve exactly this problem. But unfortunately, it’s not that simple. How you use the peripheral is critically important, and the current state of the art in IoT can only be aptly described as “doing it wrong.”

INCORRECTLY CALLING THE HARDWARE RNG

One of the more glaring pitfalls happens when developers fail to check error code responses, which often results in numbers that are decidedly less random than required for a security-relevant use.

When an IoT device requires a random number, it makes a call to the dedicated hardware RNG either through the device’s SDK or increasingly through an IoT operating system. What the function call is named varies, of course, but it takes place in the hardware abstraction layer (HAL). This is an API created by the device manufacturer and is designed so you can more easily interface with the hardware through C code and not need to mess around with setting and checking specific registers unique to the device. The HAL function looks something like this:

u8 hal_get_random_number(u32 *out_number);

There are two parts that we care about:

An output parameter called out_number. This is where the function will put the random number; it’s a pointer to an unsigned 32-bit integer.
A return value to specify any error cases. Depending on the device, it could be Boolean or any number of enumerated error conditions.

So, the first question you might be asking is, “How many people out there in the wild actually check this error code?” Unfortunately, the answer is almost nobody.

For instance, just look at GitHub results for use of the MediaTek 7697 SoC HAL function (seen on GitHub here):

Or even FreeRTOS’s (a popular IoT operating system) abstraction layer (seen on GitHub here):

example of FreeRTOS’s abstraction layer with code pervasively not checked

Notice that the return code is pervasively not checked – though this isn’t unique to these two examples. This is just how the IoT industry does it. You’ll find this behavior across basically every SDK and IoT OS.

WHAT'S THE WORST THAT COULD HAPPEN?

Okay, so devices aren’t checking the error code of the RNG HAL function. But how bad is it really? It depends on the specific device, but potentially bad. Very bad. Let’s take a look.

The HAL function to the RNG peripheral can fail for a variety of reasons, but by far the most common (and exploitable) is that the device has run out of entropy. Hardware RNG peripherals pull entropy out of the universe through a variety of means (such as analog sensors or EMF readings) but don’t have it in infinite supply. They’re only capable of producing so many random bits per second. If you try calling the RNG HAL function when it doesn’t have any random numbers to give you, it will fail and return an error code. Thus, if the device tries to get too many random numbers too quickly, the calls will begin to fail.

But that’s the thing about random numbers; it’s not enough to just have one. When a device needs to generate a new 2048-bit private key, as a conservative example, it will call the RNG HAL function over and over in a loop. This starts to seriously tax the hardware’s ability to keep up, and in practice, they often can’t. The first few calls may succeed, but they will typically start to cause errors quickly.

So… what does the HAL function actually give you for a random number when it fails? Depending on the hardware, one of the following:

Partial entropy
The number 0
Uninitialized memory

example of IoT device generating a private key with errors.

FIGURE 2: That shouldn’t be there.

None of those are very good, but uninitialized memory?! How does that happen? Well, recall that the random number is an output pointer. Then consider the following pseudocode (which you can find many examples of on GitHub if you care to look):

u32 random_number;
hal_get_random_number(&random_number);
// Sends over the network
packet_send(random_number);

The random_number variable is declared and lives on the stack but is never initialized. If the HAL function behaves such that it never overwrites the output variable in the event of an error (which is common behavior), then the value in the variable will contain uninitialized RAM. Which we then send out over the network to someone else. Not great.

These are not contrived or unrealistic scenarios. Devices are out there right now using crypto keys of 0 or worse.

An analysis by Keyfactor back in 2019 of publicly available RSA certificates found that 1 in 172 of all certificates are vulnerable to known attacks. The authors pointed to random number generation in IoT devices as being one of the culprits, but stopped short of identifying the exact cause. While we can't say for sure that our research is responsible for those results... widespread instances of weak RSA keys in IoT devices is exactly what you'd expect to find. It sure seems like this is an exploitable large scale issue in practice, not just in theory.

DON'T BLAME THE USER

It’s easy to look at the current state of affairs and conclude that it’s simply user error, but this is not the case. The users have been put in a no-win situation. You see, random numbers are incredibly critical when you need them. Oftentimes you can’t just “handle” the error in an elegant way and move forward.

For example, the MediaTek documentation contains the following example code for developers of the MT7697 SoC:

if(HAL_TRNG_STATUS_OK != hal_trng_get_generated_random_number(&random_number)) {
    //error handle
}

But if you’re the networking stack in the middle of generating a crypto key for secure communications, how are you supposed to “handle” the error? There are really only two options:

Abort, killing the entire process.
Spin loop on the HAL function for an indefinite amount of time until the call completes, blocking all other processes and using 100% CPU in the process.

Neither are acceptable solutions. This is why developers ignore the error condition — the alternatives are terrible and the ecosystem around RNG hardware has done them no favors.

Things aren’t much better even when developers have the benefit of time on their side. Some devices, like the STM32, have sizable documentation and even vendor-provided proof of randomness whitepapers, but these are an exception. Few devices have even a basic description of how the hardware RNG is supposed to work, and fewer still have any kind of documentation about basic things like expected operating speed, safe operating temperature ranges, and statistical evidence of randomness.

Anecdotally speaking, we attempted to follow the STM32 documentation carefully and still managed to create code that incorrectly handled error responses. It took multiple attempts and substantial code to block additional calls to the RNG and spin loop properly when there were error responses — and even then, we observed questionable results that made us doubt our code. It’s no wonder developers are doing IoT RNG, well, wrong, but more on that below.

CSPRNG SUBSYSTEM

So you might be wondering, “What makes this unique to the IoT? Does this issue affect laptops and servers too?” The answer is that it’s a unique issue to the IoT world because this sort of low-level device management is usually handled by an operating system that’s notably missing from typical IoT devices.

FIGURE 3: CSPRNG subsystem components

When an application needs a cryptographically secure random number on a Linux server, it doesn’t read from the hardware RNG directly or make a call to some HAL function and fight with error codes. No, it just reads from /dev/urandom. This is a cryptographically secure pseudo-random number generator (CSPRNG) subsystem made available to applications as an API. There are similar subsystems on every major operating system, too: Windows, iOS, MacOS, Android, BSD, you name it.

Importantly, calls to /dev/urandom never fail and never block program execution. The CSPRNG subsystem can produce an endless sequence of strong random numbers immediately. This squarely solves the problem of HAL functions that either block program execution or fail.

Another critical design feature of a CSPRNG subsystem is the entropy pool. This is designed to take in entropy from a variety of sources, hardware RNGs (HWRNGs) included. Due to the magic of the xor operation, all of these individually weak sources of entropy can be combined into one strong one. The entropy pool also removes any single points of failure among the entropy sources: in order to break the RNG, you’d need to predict every entropy source simultaneously.

This is the right way to generate cryptographically secure random numbers and is already industry standard everywhere … except the IoT.

Why you really do need a CSPRNG Subsystem

Designing an entire CSPRNG subsystem sounds really hard, especially when your gadget isn’t using one of those new IoT operating systems. Maybe it’s enough to just bite the bullet and spin loop on the RNG HAL function. That way you’re always getting good random numbers, right?

This section will disavow you of the idea that hardware RNGs are safe to use on their own.

VULNERABLE REFERENCE CODE

Nobody writes source code entirely from scratch, especially in the world of IoT devices. There’s always some reference code or example doc that developers start from. Interfacing with hardware is tricky to get right for any device, let alone one as finicky as a hardware RNG peripheral.

When the device’s reference code contains a vulnerability, it propagates down to every device using it. After all, how is a developer supposed to know the difference between a vulnerable reference implementation and a quirky one? Here are some examples:

Using the HWRNG to Seed an Insecure PRNG

PRNGs such as libc rand() are wildly insecure since the numbers they produce reveal information about the internal state of the RNG. They’re fine for non-security-relevant contexts because they’re fast and easy to implement. But using them for things like encryption keys leads to catastrophic collapse of the device’s security, as all of the numbers are predictable.

Unfortunately, many SDKs and operating systems that support hardware RNGs use an insecure PRNG under the hood. The Contiki-ng IoT operating system for Nordic Semiconductor’s nrf52840 SoC does precisely this by seeding the insecure libc rand() function with the hardware RNG:

https://contiki-ng.readthedocs.io/en/release-v4.6/_api/arch_2cpu_2nrf52840_2dev_2random_8c_source.html

void
 random_init(unsigned short seed)
 {
   (void)seed;
   unsigned short hwrng = 0;
   NRF_RNG->TASKS_START = 1;
 
   NRF_RNG->EVENTS_VALRDY = 0;
   while(!NRF_RNG->EVENTS_VALRDY);
   hwrng = (NRF_RNG->VALUE & 0xFF);
 
   NRF_RNG->EVENTS_VALRDY = 0;
   while(!NRF_RNG->EVENTS_VALRDY);
   hwrng |= ((NRF_RNG->VALUE & 0xFF) << 8);
 
   NRF_RNG->TASKS_STOP = 1;
 
   srand(hwrng);
 }

FIGURE 4: Seeding libc rand() with the hardware RNG entropy
unsigned short random_rand(void) { return (unsigned short)rand(); }

FIGURE 5: Future calls to random_rand() call the insecure libc rand()

To be clear, there is NO secure way to use libc rand() to generate secure values. The fact that the seed is created using the hardware is irrelevant since an attacker can just derive or enumerate it using untwister. What’s important is that when the user calls random_rand(), the output will come from the insecure libc rand() call.

You can also see identical vulnerable behavior in the MediaTek Arduino code (on GitHub here):

/**
* @brief       This function is to get random seed.
* @param[in]   None.
* @return      None.
*/
static void _main_sys_random_init(void)
{
#if defined(HAL_TRNG_MODULE_ENABLED)
    uint32_t            seed;
    hal_trng_status_t   s;

    s = hal_trng_init();

    if (s == HAL_TRNG_STATUS_OK) {
        s = hal_trng_get_generated_random_number(&seed);
    }

    if (s == HAL_TRNG_STATUS_OK) {
        srand((unsigned int)seed);
    }

FIGURE 6: Insecure libc rand() usage in MediaTek SDK

So even if your device has a hardware RNG peripheral, and even if you think that you’re using it, you might not be.

Usage Quirks

Sometimes, devices work in a very quirky way that is not at all immediately obvious. Failing to properly account for these quirks can lead to a catastrophic collapse of a device’s security. One example of this is the LPC 54628.

When testing the LPC 54628, we noticed that we were getting extremely poor-quality random numbers from the hardware RNG — so bad that we suspected there might be something wrong with our tooling. Turns out we were right. If you read the user manual carefully, on page 1,106 (of 1,152), you’ll notice the following instructions:

"The quality of randomness (entropy) of the numbers generated by the Random Number Generator relies on the initial states of internal logic. If a 128 bit or 256 bit random number is required, it is not recommended to concatenate several words of 32 bits to form the number. For example, if two 128 bit words are concatenated, the hardware RNG will not provide 2 times 128 bits of entropy.

…omitted for brevity…

To constitute one 128 bit number, a 32 bit random number is read, then the next 32 numbers are read but not used. The next 32 bit number is read and used and so on. Thus 32 32-bit random numbers are skipped between two 32-bit numbers that are used."

So in order to actually use the RNG peripheral properly, you’re supposed to get a random number and then throw way the next 32. Then keep doing that in a loop! In our testing code, we were just calling the RNG HAL function and using the results…, which is a fairly sensible way to write code. How many people bother to read that user manual? Or even if they find the “correct” code that discards numbers, do they remove it because it seems unnecessary?

Clearly, it’s unacceptable for security to rely on such behavior. Even if some developers get it right some of the time, having this kind of an API is guaranteed to produce vulnerable devices given a large enough audience.

STATISTICAL ANALYSIS

Okay, so suppose that you’ve gone through the trouble of auditing your device’s code to make sure that it’s actually using the hardware RNG, and you made sure to check any error conditions and spin until the device was ready, and you laboriously read through your device’s 1,000-page user manual to make sure you caught any quirks…. Surely you must be safe, right? Not even close.

It turns out that the raw entropy of IoT hardware RNG peripherals varies wildly in quality. Most devices fail statistical analysis tests, which we will summarize below. But keep in mind that these results can often depend on the individual device itself, as small manufacturing defects (or the relative positions of Jupiter and Saturn for all we know) produce diverging results for even the same make and model of a device.

MediaTek 7697

One of the things we tested for when analyzing entropy is the relative distribution of bytes produced by the RNG. In a perfect world, we should expect each byte to be equally likely, so the distribution of bytes ought to be a flat line (Modulo any minor random blips of course). But that’s not what we see for the MT7697:

FIGURE 7: Histogram of the frequency of each byte 0 to 255 on the MediaTek 7697 SoC

The diagram above is a histogram showing the relative frequency of each possible byte (0 -> 255) empirically measured from a MediaTek 7697 SoC’s hardware RNG. Note that sawtooth pattern: That’s definitely “not random”. There shouldn’t be patterns of any kind. How comfortable do you feel about using that directly as your crypto key?

Nordic Semiconductor nrf52840

The Nordic Semiconductor nrf52840 SoC’s hardware RNG exhibited a repeating 12-bit pattern of 0x000, occurring every 0x50 bytes:

screenshot showing repeating 0x000 in the nrf52840

FIGURE 8: Repeating 0x000 in the nrf52840 SoC

The fact that this was 12 bits, and not 8 or 16, was quite curious. We don’t have a great explanation for that, but likely depends on the internal workings of the black-box RNG hardware. **Update: We worked with the Nordic Semiconductors team to track down the eventual source of this pattern, and it turned out to be a bug in our byte-gathering code. (And not a problem in the underlying RNG hardware) As it happens, even security folks aren't immune from calling APIs incorrectly.

STM32-L432KC

Unfortunately, we don’t have a nice diagram to show you for this one, but here are the results from dieharder, an industry-standard statistical randomness testing tool:

#=============================================================================#
#            dieharder version 3.31.1 Copyright 2003 Robert G. Brown          #
#=============================================================================#
   rng_name    |           filename             |rands/second|
 file_input_raw|STM32L432randData.4gb.32768trans.fullblocking.bin|  1.72e+07  |
#=============================================================================#
        test_name   |ntup| tsamples |psamples|  p-value |Assessment
#=============================================================================#
rgb_minimum_distance|   0|     10000|    1000|0.00000000| FAILED

FIGURE 9: Sample of dieharder statistical analysis results for the STM32-L432KC SoC

As you can see, this fails the dieharder “RGB Minimum Distance” test. What this test does is use the RNG to plot numbers randomly in a large grid and then calculate the smallest distance between any two points. This test is especially good at identifying subtle correlations between numbers, such as repeats. Failing this test likely indicates that the numbers produced by the RNG are not independent of each other or that they repeat in some way.

While many hardware RNGs we tested seemed fine, such as the LPC54628 (with 32*32-bit discarding) and the ESP32, the only reasonable conclusion to make here is that raw entropy taken from the hardware RNG should not be trusted in isolation. The key-stretching and entropy pooling that a CSPRNG subsystem offers are fundamentally required to avoid doing IoT RNG.

Conclusions

One of the hard parts about this vulnerability is that it’s not a simple case of “you zigged where you should have zagged” that can be patched easily. In order to remediate this issue, a substantial and complex feature has to be engineered into the IoT device.

This affects the entire IoT industry. The core vulnerability here doesn’t lie in a single device’s SDK or in any particular SoC implementation.
The IoT needs a CSPRNG subsystem. This issue can’t be fixed by just changing the documentation and blaming users. The most elegant place for such a CSPRNG subsystem is in one of the increasingly popular IoT operating systems. If you’re designing a new device from scratch, we’d recommend implementing a CSPRNG in an operating system.
RNG code should be considered dangerous to write on your own, just like crypto code. It doesn’t matter how clever you are, never write your own code to interface with the RNG hardware. You will almost certainly get it wrong. You should instead use a CSPRNG subsystem made available by a lower abstraction layer.
Never use entropy directly from RNG hardware. Across the board, hardware RNGs are not suitable for (immediate) cryptographic use. Weak entropy can and should be fixed through software, via a CSPRNG.

In the short term, here’s what you can do:

DEVICE OWNERS

Keep an eye out for updates and make sure to apply them when they become available. This is an issue that can be solved with software, but it may take some time. In the meantime, be careful about trusting your IoT gadgets too much. For home devices that require an internet connection, place them in a dedicated network segment that can only reach out externally. This will help contain any breach from spreading to the rest of your network.

IOT DEVICE DEVELOPERS

If possible, select IoT devices that include a CSPRNG API seeded from a variety of entropy sources including hardware RNG’s. If there’s no CSPRNG available and you have no other choice, carefully review both the libraries you’re relying on as well as your own code to ensure you’re not working with code that reads from uninitialized memory, ignores hardware RNG peripheral registers or error conditions, or fails to block when no more entropy is available. Carefully consider the implications for real-time situations where blocking isn’t a viable option.

DEVICE MANUFACTURERS / IOT OPERATION SYSTEMS

Deprecate and/or disable any direct use of the RNG HAL function in your SDK. Instead, include a CSPRNG API that is seeded using robust and diverse entropy sources with proper hardware RNG handling. The Linux kernel’s implementation of dev/urandom can serve as an excellent reference.

You're Doing IoT RNG