Benchmarking and Comparison of Two Open-source RTOSs for Embedded Systems Based on ARM Cortex-M4 MCU

Objective: To evaluate the performance of two open-source real-time operating systems (RTOSs), Keil RTX5 and FreeRTOS. Besides, a comparison between them has been also established based on four timing metrics: task switching time, pre-emption time, semaphore shuffling time, and inter-task messaging latency. All the tests have been performed on an STM32F429 discovery board based on Cortex-M4 MCUs. Methods: To measure the timing metrics, the ARM cycle counter register implemented in the DWT unit was used. Findings: The DWT cycle counter allows us to capture the number of cycles that occurred in the execution of a part of the code. Therefore, the time measurements of the metrics selected show that FreeRTOS has good performance with the lowest value of switching time, preemption time, and semaphore shuffling time. Instead, Keil RTX5 has fast message passing. Novelty: The study provides an evaluation and comparison of the latest version of the most used open-source RTOSs, Keil RTX5 and FreeRTOS v10.2.0. Furthermore, the timing metrics have been measured accurately with the ARM cycle counter register without using any other hardware or GPIO pin that may disturb the measurement. The comparison is based on four critical timing metrics that affect mostly the performance of any RTOS and define their time capability. Finally, the tests have been made on a low-power ARM MCU. 
Keywords: realtime operation system; embedded system; FreeRTOS; Keil RTX5; benchmarking metric; ARM MCU


Introduction
From the first modern embedded system "Apollo Guidance Computer" used for the guidance system of the Apollo mission in 1967, until this day, these systems receive high interest from many industrial domains like automotive, aerospace, telecommunication devices, and home appliances that satisfy our daily life, work, and gaming demands. Therefore, this demand imposes adaptability to dynamic operation situations, fast Time to Market, conformity to the industrial standards, reliability, and safety. Besides, embedded systems constitute 99 % of all processors manufactured in the world (1) . https://www.indjst.org/ The growth of sophisticated, low cost and high-performance processors designed for embedded systems (2) offers to developers a good infrastructure for almost all of their applications. But sometimes with a simple or ever-shorter project, it's becoming impossible to control all these devices without the use of advanced software (SW). Conventionally, the embedded system programs are created using super-loop programming (named also beta-metal or background programming), which means that all tasks will be run in an infinite loop all the time and will be executed sequentially. Thus, the microcontroller will stop execution only if there is an interrupt event or power shutting down. But this kind of programming also has a start-up code to perform the hardware initialize, define the interrupt service routines and exception, initialize the stack pointer and allocate memory for it, and so on.
Unfortunately, the rise of embedded SW complexity takes more resources such as power consumption (3) , (4) , memory usage (5) , size, and the number of processors. Consequently, a Real-Time Operating System (RTOS) (6) is a suitable solution to implement such complex embedded SW and cohere it to the target architecture. As illustrated in Figure 1, the architecture of the embedded system contains a software layer and hardware layer, including hardware and software components where the RTOS takes place in the system software layer. The RTOS contains two terms, Real-time, and operating system. The first one is the real-time system, it is the system that can process the inputs and provide the output in a deterministic time, so it must satisfy the physical interaction with the real world and time requirement of these interactions. Based on this time requirement we can identify two types of RT system: hard-real time system assures that the critical tasks complete their execution on time. However, the critical real-time task in the soft-real time gets priority over other tasks and retains this priority until it completes. For both systems, the delay between the recovery of the stored data to the time that the operating system takes to finish the request made for it must be bounded.
The second is the Operation System (OS), which is simply defined as a software program that interfaces the user and the hardware components of the computer. Windows, Linux, Unix are examples of the operating system in which they operate under the general-purpose operation system (GPOS), those systems cannot be run on small microcontrollers due to their resource requirements and the use of virtual memory.
A real-time operating system (RTOS) is a part of the software with a set of APIs for users to develop their applications, where the actions are divided into separate threads or tasks. It has the job like any OS, it's responsible for hardware resource management, scheduling program applications, memory management, etc. However, when this OS must handle numerous events and ensure that the response to these events is within a finite and strict time (or timing requirements), we called it an RTOS. In addition to the wide variety of services they offer such as scheduling, I/O operations, and communication between processes, these systems efficiently manage the power consumption of the whole embedded system (7) . https://www.indjst.org/ Several RTOSs are existing nowadays, divided into commercial, open-source, and proprietary RTOSs, and available for high-performance microcontrollers (8) and small microcontrollers (9) , (10) . Examples of RTOS include FreeRTOS, RTX, OSEK, VxWorks, µC/OS-II, µC/OS-III, LynxOS, MbedOS, RT-thread, Nucleus, etc. Furthermore, these RTOSs are implemented in small microcontrollers and widely used in several domains such as automotive (11)(12)(13) , Avionic (14) , mobile, and the internet of things (IoT) (15,16) , and robotic (17) .
To develop a multitasking application, RTOSs offers several services and features such as the task priority to provide to each task an individual level of priority, so the RTOS can identify which task must enter in the running state. The selection of this task, with the higher priority, is achieved by task management service, constituted of two fundamentals types of scheduling: pre-emptive and non-preemptive. In pre-emptive scheduling, the running task can be pre-empted by a higher priority task that becomes in the ready state. The use of the stack and interrupt management is very important in this case, to ensure the correctness of the context switching since the pre-emption can occur at any time. Thus, when the task resumes its execution, the stack values are restored. Contrarily, in the non-preemptive (or cooperative) scheduling all tasks cooperate to execute the context switching, clearly saying, the running task voluntarily quit the CPU. Even though the non-preemptive scheduling policy is simple, it is less used in common RTOS scheduling because it cannot always meet the time deterministic nature. Therefore, pre-emptive scheduling is the most used in RTOS since it gives a short time response for critical actions. Some other services offered by RTOS is inter-task communication and synchronization mechanisms such as mutexes, semaphores, message queues, mailboxes, signal events, etc.
Before selecting a specific RTOS for an application, several criteria must be taken into account such as predictability, the Worst-case Response Time (WCRT) (18) , the response time of a task to an external/internal event, the inter-task communication, the response to an interrupt service routine, and the synchronization mechanism, etc. In this paper, we aim to measure, compare, and analyze the timing of some of those criteria for two open-source RTOSs: FreeRTOS and Keil RTX5. All the benchmarking tests have been established on a small microcontroller ARM Cortex ® -M4 based MCU.
The paper contributes with a new result of the timing comparison of two of the most open-source RTOSs (19) used on the small microcontroller. Moreover, the test codes have been run on an actual MCU (STM32F429 discovery board based on an ARM Cortex ® -M4 MCU). The test includes four micro-benchmarking parameters: task switching time, pre-emption time, semaphore shuffling time, and the inter-task messaging latency. Compared to other works, where an oscilloscope is used to perform the time measurements on a GPIO (General-purpose Input/Output) Pin selected from the board, the time measuring in this paper This paper is organized as follows: section 2 represents a general insight into the real-time operation system, Section 3 provides some related works and the purpose of our work. Following this, Section 4 defines the performance metrics, which will be measured, and the experimental setup. Experimental results and discussions of the benchmark results are drawn in Section 5. The paper finishes with a conclusion in Section 6.

Real-Time Operating System
Every RTOS has a Kernel, which is the supervisor core software that provides protection, scheduling, resources management, and other services; it also combines several modules, as shown in Figure 2, including protocol networks, system files, and other devices. The RTOS execution refers to multiple tasks sharing a processor (i.e., multitasking), performed with a set of APIs.
Before giving some details about the RTOSs selected for the benchmarking test and comparison, let's take a look into some related statistics done in 2019 by the Embedded Market study (19) about the use of RTOS for the current and the ongoing projects. The availability of source codes and the cost of the RTOS software is one of the biggest factors in deciding whether to use it in an embedded application, as the study shows that 42% of programmers use an open-source RTOS without any commercial support. The reason for this choice, according to the study, is that the open-source with current solutions works fine for them, and the commercial alternatives are too expensive. For the persons who choose the commercial OS, the real-time capability is the most influential feature of their choice.
For the RTOS selected in this paper, 19% of programmers use FreeRTOS, and 4% use Keil RTX. These numbers will increase for the ongoing projects to 27% and 6% respectively. https://www.indjst.org/

FreeRTOS
FreeRTOS (20) is a fully open-source RTOS distributed under free MIT licensing. Available for over 15 years and now it is the leader in the market of RTOS (19) for the microcontroller and small microprocessors.
FreeRTOS supports more than 40 MCU architectures and more than 15 toolchains, including the newest RISC-V (21) and ARMv8-M (Martin, 2016) (22) . In addition to its simple and easy use, it provides an official demo project for each new MCU architecture, which can be downloaded from the FreeRTOS website.

Keil RTX5
Designed for ARM and Cortex architecture, the version 5 Keil RTX (RTX5) (23) based on CMSIS-RTOS v2 APIs allows embedded system developers to create a powerful application, well-structured with multiple functions, and easy to maintain. It comes with a free royalty license. The CMSIS2 package for Keil MDK includes the RTX5 Kernel accompanying with necessary file and library.
This advanced RTOS provides various benefits, such as memory management, Interrupt Service Routine (ISR) system management, many kinds of inter-task communications. Additionally, The RTX5 Kernel scheduler supports cooperative, round-robin, and the famous pre-emptive multitasking policies.

Related work
Besides the statistical study done by "EETimes and Embedded.com" (19) in 2019 that reflects the audience's usage of the realtime operating system on their embedded system projects, there are also several pieces of research and studies for the RTOSs, including their portability on small microcontrollers, timing evaluations and comparison based on some metrics parameters. By using the verification system Hip/Sleek, the authors (24) present the process used to automatically verify the memory safety and functional correctness of the task scheduler component of the FreeRTOS Kernel. Furthermore, In (25) , Xing et al. give the process to build an embedded system based on FreeRTOS on a Cortex-M3 MCU. The paper demonstrates also that the porting method is feasible and reasonable by doing several tests. Likewise, the authors in (26) propose a real-time robot operating system based on FreeRTOS combined with a publisher/subscriber communication mechanism in ROS (Robot Operating System). Another qualitative and quantitative comparison is done in (27) between FreeRTOS V8.0.0 and µC/OS-III. The results illustrated in Table 2 show that µC/OS-III becomes somewhat unpredictable due to the latency of the tick interrupt and FreeRTOS is stable in some metrics. In brief, µC/OS-III slightly outperforms FreeRTOS in most of the measured metrics. The analyzed results of the study done by (28) show that there is no performance penalty of CMSIS-RTOS when it is designed to follow the standard APIs. So, the authors propose to use an adaptation layer for the X real-time kernel, which can impose a significant performance if the type of services of the kernel is different.
In (29) , the authors compare the time switch context between a low priority task and a high priority task for four RTOSs (FreeRTOS, Keil RTX, µC/OS-II, RT-Thread). The results of the experimental tests performed on ARM Cortex ® -M0+ and ARM Cortex ® -M4 based MCUs show that FreeRTOS has the largest context switching time, compared to Keil RTX which has the best performances. The same authors add FreeRTOS version 10.2.0 and the µC/OS-III RTOS to the experiment in (30) and extend it to measure and compare the performance achieved by those six RTOS for the send/receive latencies of an event, mailbox, and semaphore as shown in Table 3. Additionally, they also measure the time for task switching from a high priority task to a low priority task. The results of the experiment show that µC/OS-III, Keil RTX, and RT-Thread have the best performance, contrary to FreeRTOS which has a low performance even though it is the most open-source RTOS used in embedded applications.
https://www.indjst.org/ Tot -Time of task context switch from lower priority task to higher priority task is triggered by Tot1 -Time of task context switch from higher priority task to lower priority task is triggered by Tet -The execution time for primitive that signal Tet1 -The execution time for primitive that sent In this study, we focused just on two types of RTOS: FreeRTOS 10.2.0 and Keil RTX5. Both RTOS selected are free Royalty open-source RTOS. RTX5 is integrated into the Keil MDK-ARM software tools, and FreeRTOS is the most used RTOS in the embedded applications based on a small microcontroller. The selection of these RTOSs refers to their top use for the embedded projects based on small microcontrollers such as ARM Cortex ® -Mx MCUs. The efficiency of these MCUs is proved by the great applications built by them, such as in data acquisition and controlling system with an embedded web server for industrial monitoring (31) using Keil RTX and TCP/IP Ethernet running on an ARM Cortex ® -M3 MCU. Likewise, in (32)

Performance Metrics and Experimental setup
For a specific embedded application, with a particular system requirement, some parameters are most important and others are insignificant. Hence, the benchmarking tool comes to solve this issue and offers to designers a method to evaluate and compare multitasking RTOSs and help them choose the appropriate RTOS for their applications. In this section of the paper, we will describe the experimental setup and explain all the tests done for the benchmarking evaluation, based on the timing measurement. So, in this work we will focus on four micro-benchmarking parameters: task switching time, pre-emption time, semaphore shuffling time, and the inter-task Messaging latency.

Performance metrics
To measure the timing performance of the RTOS selected, four metrics have been defined for the benchmarking test.

Task Switching Time
It is the average time the kernel takes to switch from one task to another with the same priority. The tasks mustn't be suspended or in sleeping mode. This metric is essential to assess the efficiency with which the kernel manages the data structure in restoring and saving contexts. Figure 3 demonstrates this performance parameter. The two tasks created for this metric have the same priority and switch the CPU between them in every iteration. The task switching time measured will be used to determine the specific time for the other metric that uses the task switch.

Preemption Time
The preemption time is the average time that takes a higher priority task (HPT) to take the CPU from a lower priority task (LPT) currently running on the CPU. This preemption usually occurs when the HPT responds to an external event and switch from the suspended or the blocked state to the ready state. By way of explanation, it is the average time that the kernel takes to switch control from a running LPT to the HPT activated by an external event. To measure this time, two tasks were created with different priorities. First, the HPT runs and suspends itself, and then the LPT switches to the running state and resumes the execution of the HPT. The process is repeated for the defined iterations. A demonstration of the pre-emption time is illustrated in Figure 4.

Semaphore Shuffling Time
The time elapsed between a task's release of a semaphore and the activation of another blocked task waiting for this semaphore is called semaphore shuffling time, it measures the overhead related to the mutual exclusion. This parameter gives the time that the kernel spends to provide a non-sharable resource from one task to another. The two tasks used for this metric have the same priority and pass the semaphore between them for a specified number of iterations. From the final time gotten, we subtract the time needed for task switching in order to only get the net semaphore shuffling time. Figure 5 demonstrates this performance metric.

Inter-task Messaging Latency
The Inter-task Messaging Latency is the time elapsed between sending and receiving a nonzero-length message from one task to another. The receiving task should be suspended while waiting for the message and the sending task should stop execution after sending it, in order to properly measure this latency. Supposing that multiple messages are sent to the same receiving task, in this case, the multitasking kernel provides other alternatives such as queues and pipes, so the receiving task can read an old message before overwriting it by the sending task with a new one.
In Figure 6, we demonstrate this parameter. The LPT stays blocked until it receives a message queue from the HPT and gets the CPU, it turns blocked again when trying to receive another message. This process is also repeated for a specified number of iterations.

Timing measurement process
One of many ways to measure the execution time of a part of code is using an oscilloscope and GPIOs pin. However, an amount of setup must be done before going through the measurement such as find free GPIO pins and configure them as outputs which cannot be possible in some applications where the number of pins is limited. Additionally, the measurement may be affected by some errors. To overtakes the aforementioned limitations, the DWT cycle counter is selected for time measurement. Where the number of CPU cycles can be captured easily and converted to milliseconds.
For all the bench marking metrics measured in this paper, we use the same process to develop the application code. We create two tasks, where the priorities depend on the parameter to measure. And to ensure an accurate measure, the tasks switch back and forth for 50000 iterations, so first, we determine the time needed to perform the for loops as illustrated in Figure 7 (with no work & without task switching). Then, we subtract this time from the total measured time and divide the result by the number of iterations in order to only get the time to perform the benchmark metric. The figure (Figure 8) illustrates a bloc synopsis of the process to get the execution time of a portion of the code. We select to use a software measurement for the benchmark metrics included in the hardware platform facilities. Besides the RTOS Tick Timer used by the Kernel, we use the DWT cycle counter unit for the timing measurements (chapter 9 of (33) ). The value of the other timers increases when an interrupt is issued, so if the program enters in a portion of code with interrupts disabled, the timing measurement will be delayed, thus choosing the DWT cycle counter solves the problem. By using this counter, and running the program in debugging mode, we can capture the number of cycles at the beginning (T1) and in the end (T2) of https://www.indjst.org/ the code to measure. The value of T1 and T2 are displayed in the MDK-ARM debug window. The total number of cycles equals T2-T1, then we divide this number by the clock frequency (80 MHz in our case) to get the time in µs. Assuming that △ 1 = T 4 − T 3 is the number of cycles needed to perform empty for loops with 50000 iterations, and △ 2 = T 2 − T 1 is the number of cycles to perform 50000 iterations of task switching, preemption, semaphore passing, or intertask messaging. Then, we calculate the value of △ 2 value for each metric. The final number of cycles which correspond to the number of one iteration is equaled to: Hence, the time of benchmark metrics is defined in microseconds (µs) as bellow:

Software/Hardware Configuration
The configuration of our project includes the kernel and board configuration, such as memory initialization, stack size initialization, output and input configuration, and so on. The target hardware used to perform the benchmarking tests is a STM32F429ZIT6 Discovery board (34) Based on a 32-bit Cortex-M4 core (35) in an LQFP144 package with 2 Mbytes of Flash memory, 256 Kbytes of RAM, 180 MHz maximum system clock frequency, and an on-board ST-Link/V2-B. A code for each benchmark metric has been developed for each RTOS selected (FreeRTOS and RTX5) in the Keil uVision5 v5.29 under the same conditions in the same hardware platform. To take the value of the cycles counted by the DWT unit, the codes will be run and executed on debugging mode in order. The system speed frequency is configured at 80 MHz (PLL activated with the high-speed external clock of 8 MHz).
In the program software, we configured static priority-based scheduling for multitasking mechanism for both FreeRTOS and Keil RTX5 RTOSs, and a tick rate of 1000 Hz to generate a tick interrupt every 1ms, which makes a good balance between the overhead of the task switching and task speed. For FreeRTOS we used timer 6 to generate this RTOS kernel tick. As well, for Keil RTX5, the unified System Tick Timer (SysTick Timer) has been used. To put all the tests on the same conditions, we have used a minimum stack size of 512 Bytes for each task.

Experimental results and discussion
In this section, the results of the bench marking tests discussed in the previous sections will be presented and discussed.
The comparison of FreeRTOS and RTX5 has been done based on four parameters: task switching time, pre-emption time, Inter-task messaging latency, and semaphore shuffling time. For each parameter, we measure the total cycle for several iterations (ex: 50000) and calculate the time taken to perform one iteration. The cycle counter used gives us the number of cycles needed to establish a part of the code and displays it in the debugging window of uVision5 ARM-MDK as shown in Figure 9.
https://www.indjst.org/ The results of the time measurement for each RTOSs in microseconds are illustrated in Figures 10, 11, 12 and 13. Table 4 shows the number of cycles of each benchmark parameter for both FreeRTOS and Keil RTX5.  Analyzing the results, we see that the slow task switching ( Figure 10 ) is presented by RTX5. For the preemption time ( Figure 11 ), both RTOSs have approximately the same value. Therefore, FreeRTOS has a great capability to switch between tasks in both cases; if they have the same priority or different priorities. FreeRTOS also has good performance in managing shared resources with semaphore with a semaphore shuffling time (Figure 13 ) of 3.6093 µs, contrarily to the large value of RTX5 which equal to 16.4899 µs. For the message passing (Figure 12 ), the RTX5 RTOS handles it in a short time with an inter-task messaging latency of 4.5129 µs. On the other hand, FreeRTOS has a value of 19.2067 µs.

Conclusion
This study presents a short overview of RTOSs, their features, and fundamentals. It gives further some description of the RTOSs selected here, which are Keil RTX5 and FreeRTOS. A general insight into the main characteristics and differences between them have been given. Before starting our implementation of the benchmarking tests, we gave the important experimental setups and a definition of the performance metrics that will be measured. The tests have been performed in an STM32F429ZI discovery board based on ARM Cortex ® -M4 MCU. The task switching time, pre-emption time, semaphore shuffling time, and inter-task messaging metrics have been measured using the DWT cycle counter unit of the ARM Cortex ® -M4. RTX5 is an open-source RTOS very optimized so we have expected that it will give us powerful results against FreeRTOS, but the results obtained show that FreeRTOS is fast in preemption and switching between tasks and reacts rapidly to unblock a task waiting for a semaphore. Concerning RTX5 RTOS, it has good communication and message passing time. This study contributes with a new evaluation and benchmarking results of the most used open-source RTOSs (FreeRTOS and Keil RTX5) based on four metrics, reflecting the main timing performances of a real-time operating system. The time measurement is done with the DWT cycle counter that gives an accurate time value, contrarily to the oscilloscope.
This study may represent a time indicator to help programmers on choosing a suitable RTOS to use, but we must also look into other principles, such as memory footprint, deadlock break time, interrupt latency, and licensing, as the goal for our next works, in which the MCU used in this study and other MCUs, will be evaluated for the performances when the embedded architecture changes.