| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| en:multiasm:paarm:chapter_5_15 [2025/12/05 00:38] – [Barriers(instruction synchronization / data memory / data synchronization / one way BARRIER)] eriks.klavins | en:multiasm:paarm:chapter_5_15 [2025/12/08 12:33] (current) – eriks.klavins |
|---|
| ====== Energy Efficient Coding ====== | ====== Energy Efficient Coding ====== |
| |
| Assembler code is assembled into a single object code. Compilers, instead, take high-level language code and convert it to machine code. And during compilation, the code may be optimised in several ways. For example, there are many ways to implement statements, FOR loops or Do-While loops in the assembler. There are some good hints for optimising the assembler code as well, but these are just hints for the programmer. | Some special instructions are meant to put the processor into sleep modes and wait for an event to occur. The processor can be woken up by an interrupt or by an event. In these modes, the code may be explicitly created to initialise interrupts and events, and to handle them. After that, the processor may be put into sleep mode and remain asleep unless an event or interrupt occurs. The following code example can be used only in bare-metal mode – without an OS. |
| - Take into account the instruction execution time (or cycle). Some instructions take more than one CPU cycle to execute, and there may be other instructions that achieve the desired result. | <codeblock code_label> |
| - Try to use the register as much as possible without storing the temporary data in the memory. | <caption>IDLE loop</caption> |
| - Eliminate unnecessary compare instructions by doing the appropriate conditional jump instruction based on the flags that are already set from a previous arithmetic instruction. Remember that arithmetic instructions can update the status flags if the postfix ''<fc #008000>S</fc>'' is used in the instruction mnemonic. | <code> |
| - It is essential to align both your code and data to get a good speedup. For ARMv8, the data must be aligned on 16-byte boundaries. In general, if alignment is not used on a 16-byte boundary, the CPU will eventually raise an exception. | .global idle_loop |
| - And yet, there are still multiple hints that can help speed up the computation. In small code examples, the speedup will not be noticeable. The processors can execute millions of instructions per second. | idle_loop: |
| | 1: WFI @ Wait For Interrupt, core goes to low-power |
| | B 1b @ After the interrupt, go back and sleep again |
| | </code> |
| | </codeblock> |
| | <note>Note that interrupt handling and initialisation must also be implemented in the code; otherwise, the CPU may encounter an error that may force a reboot. </note> |
| | The example only waits for interrupts to occur. To wait for events and interrupts, the ''<fc #800000>WFI</fc>'' instruction must be replaced with the ''<fc #800000>WFE</fc>'' instruction. Another CPU core may execute an ''<fc #800000>SEV</fc>'' instruction that signals an event to all cores. |
| |
| Processors today can execute many instructions in parallel using pipelining and multiple functional units. These techniques allow the reordering of instructions internally to avoid pipeline stalls (Out-of-Order execution), branch prediction to guess the branching path, and others. Without speculation, each branch in the code would stall the pipeline until the outcome is known. These situations are among the factors that reduce the processor's computational power. | On a Raspberry Pi 5 running Linux, it is not observable whether the CPU enters these modes, because the OS generates many events between CPU cores and also handles many interrupts from communication interfaces and other Raspberry Pi components. |
| | Another way to save more energy while running the OS on the Raspberry Pi is to reduce the CPU clock frequency. There is a scheme called dynamic voltage and frequency scaling (DVFS), the same technique used in laptops, that reduces power consumption and thereby increases battery life. On the internet, there is a paper named “Cooling a Raspberry Pi Device ”. The paper includes one chapter explaining how to reduce the CPU clock frequency. The Linux OS exposes CPU frequency scaling through sysfs, e.g.: |
| | * ”/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor” |
| | * “/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq” |
| |
| ===== Speculative instruction execution ===== | It is possible to use syscalls in assembler to open and write specific values into them. |
| | <codeblock code_label> |
| | <caption>Power saving</caption> |
| | <code> |
| | .global _start |
| | .section .text |
| | _start: |
| | @ openat(AT_FDCWD, path, O_WRONLY, 0) |
| | mov x0, #-100 @ AT_FDCWD |
| | ldr x1, =gov_path @ const char *pathname |
| | mov x2, #1 @ O_WRONLY |
| | mov x3, #0 @ mode (unused) |
| | mov x8, #56 @ sys_openat |
| | svc #0 |
| | mov x19, x0 @ save fd |
| |
| Let's start with an explanation of how speculation works. The pipeline breaks down the whole instruction into small microoperations. The first microoperation (first step) is to fetch the instruction from memory. The second step is to decode the instruction; this is the primary step, during which the hardware is prepared for instruction execution. And of course, the next step is instruction execution, and the last one is to resolve and commit the result. The result of the instruction is temporarily stored in the pipeline buffer and waits to be stored either in the processor’s registers or in memory. \\ | @ write(fd, "powersave\n", 10) |
| ''<fc #800000>CMP </fc> <fc #008000>X0</fc>, <fc #ffa500>#0</fc>''\\ | mov x0, x19 |
| ''<fc #800000>B</fc>.<fc #9400d3>EQ </fc> JumpIfZeroLabel''\\ | ldr x1, =gov_value |
| ''<fc #800000>ADD </fc> <fc #008000>X1</fc>, <fc #008000>X1</fc>, <fc #ffa500>#1</fc> <fc #6495ed>@ This executes speculatively while B.EQ remains unresolved</fc>''\\ | mov x2, #10 @ length of "powersave\n" |
| The possible outcomes are shown in the picture below. | mov x8, #64 @ sys_write |
| {{:en:multiasm:paarm:spec_cmp.jpg|}} | svc #0 |
| |
| In the example above, the comparison is made on the ''<fc #008000>X0</fc>'' register. The next instruction creates a branch to the label. In the pipeline, while ''<fc #800000>B</fc>.<fc #9400d3>EQ </fc>'' instruction is being executed; the next instruction is already prepared to be executed, if not already (by speculation). When the branch outcome becomes known, the pipeline either commits the results (if the prediction was correct) or flushes the pipeline and re-fetches from the correct address (if the prediction was wrong). In such a way, if the register is not equal to zero, processor speculation wins in performance; otherwise, the third instruction result is discarded, and any microoperation for this instruction is cancelled—the processor branches to a new location. | @ close(fd) |
| | mov x0, x19 |
| | mov x8, #57 @ sys_close |
| | svc #0 |
| |
| From the architectural point of view, the speculations with instructions are invisible, as if those instructions never ran. But from a microarchitectural perspective (cache contents, predictors, buffers), all speculation leaves traces. Regarding registers, speculative updates remain in internal buffers until commit. No architectural changes happen until then. Regarding memory, speculative stores are not visible to other cores or devices and remain buffered until committed. But the speculative loads can occur. They may bring the data into cache memory even if it’s later discarded. | @ exit(0) |
| | mov x0, #0 |
| | mov x8, #93 @ sys_exit |
| | svc #0 |
| |
| In this example, the AMR processor will perform speculative memory access:\\ | .section .rodata |
| ''<fc #800000>LDR</fc> <fc #008000>X0</fc>, [<fc #008000>X1</fc>]''\\ | gov_path: |
| ''<fc #800000>STR</fc> <fc #008000>X2</fc>, [<fc #008000>X3</fc>]''\\ | .asciz "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" |
| If the registers ''<fc #008000>X1</fc>'' and ''<fc #008000>X3</fc>'' are equal, the store may affect the load result. Processor instead of stalling until it knows for sure it will speculatively issue the load. The processor performs loads earlier to hide memory latency and keep execution units busy. The X1 register result may not be known for sure, since previous instructions may have computed it. If aliasing is later detected, the load result is discarded and reissued, a process called speculative load execution. It’s one of the primary sources of side-channel leaks (Spectre-class vulnerabilities) because speculative loads can cache data from locations the program shouldn’t have access to. | gov_value: |
| | .asciz "powersave\n" |
| ===== Barriers(instruction synchronization / data memory / data synchronization / one way BARRIER) ===== | </code> |
| | </codeblock> |
| | Similar things can be done with CPU frequencies, or even by turning off a separate core. This is just one example template that can be used to put the processor into a specific power mode. By changing the stored path in //gov_path// variable and //gov_value// value. The main idea is to use the OS's system call functions. The OS will do the rest |
| |
| Many processors today can execute instructions out of the programmer-defined order. This is done to improve performance, as instructions can be fetched, decoded, and executed in a single cycle (instruction stage). Meanwhile, memory access can also be delayed or reordered to maximise throughput on data busses. Mostly, this is invisible to the programmers. | |
| |
| The barrier instructions enforce instruction order between operations. This does not matter whether the processor has a single core or multiple cores; these barrier instructions ensure that the data is stored before the next operation with them, that the previous instruction result is stored before the next instruction is executed, and that the second core (if available) accesses the newest data. ARM has implemented special barrier instructions to do that: ''<fc #800000>ISB</fc>'', ''<fc #800000>DMB</fc>'', and ''<fc #800000>DSB</fc>''. The instruction synchronisation barrier (''<fc #800000>ISB</fc>'') ensures that subsequent instructions are fetched and executed after the previous instruction's results are stored (the previous instruction's operations have finished). A data memory barrier (''<fc #800000>DMB</fc>'') ensures that the order of data read or written to memory is fixed. And the Data synchronisation barrier (''<fc #800000>DSB</fc>'') ensures that both the previous instruction and the data access are complete and that the order of data access is fixed. | |
| |
| Since the instructions are prefetched and decoded ahead of time, those earlier fetched and executed instructions might not yet reflect the newest state. The ''<fc #800000>ISB</fc>'' instruction – it forces the processor to stop fetching the next instruction before the previous instruction operations are finished. This type of instruction is required to ensure proper changes to a processor's control registers, memory access permissions, exception levels, and other settings. These instructions are not necessary to ensure the equation is executed correctly, unless the equation is complex and requires storing temporary variables in memory.\\ | |
| ''<fc #800000>MRS </fc> <fc #008000>X0</fc>, SCTLR_EL1 <fc #6495ed>@ Read system control register</fc>''\\ | |
| ''<fc #800000>ORR </fc> <fc #008000>X0</fc>, <fc #008000>X0</fc>, <fc #ffa500>#</fc>(<fc #ffa500>1</fc> << <fc #ffa500>0</fc>) <fc #6495ed>@ Set bit 0 (enable MMU)</fc>''\\ | |
| ''<fc #800000>MSR </fc> SCTLR_EL1, <fc #008000>X0 </fc> <fc #6495ed>@ Write back</fc>''\\ | |
| ''<fc #800000>ISB </fc> <fc #6495ed>@ Ensure new control state takes effect</fc>'' | |
| |
| |
| |
| ===== Conditional instructions ===== | |
| |
| ===== Power saving ===== | |