2021年9月15日 星期三

Halide 實務心得 3

在 Halide 的使用上會有錯覺地認為 Halide::Runtime::Buffer 的使用必須與 libHalide.so or libHalide.a linking 才可以. 但其實 Halide::Runtime::Buffer 是可以單獨使用的, 只需要 header files, 基本上一般的程式只要確認有:

#include <HalideBuffer.h>

並且確認編譯時有在 include path 即可, 單獨的使用上並不需要任何 libHalide, 相較使用 raw pointer 或是 STL 元件, 這樣的方式有很多的好處:

  1. 俱備 reference count 特性
  2. 提供類似函數的存取介面
  3. 能夠彈性的配置 planar 或 interleaved 資料
  4. 能夠帶入或取得 raw pointer
  5. 可搭配 halide_image_io.h 來載入/另存 jpeg or png 圖檔

俱備 reference count 特性

現今的 C++ 元件多半具有這樣的特性, 除了使用上的簡潔彈性外, 好處是可以減少程式中充斥自行管理 buffer/object 管理上的配置與釋放相關的程式碼, 除了有效避免 memory leak / double free 之外, 也能降低因 pointer 操作的錯誤發生. 而這樣的特性也讓程式中對於 Buffer 的指定與傳遞更為便利.

提供類似函數的存取介面

Halide::Runtime::Buffer 並不是單純地提供 Halide 所使用的 Buffer object 存在, 每個所配置產生的 Buffer object 是能夠做存取操作的, 與 STL 不同的地方是, Halide::Runtime::Buffer 使用的是類似函式呼叫的形式來存取資料, 假設我們如下宣告了一個 Buffer.

Buffer<uint8_t> rgb_planes(1920, 1080, 3);

概念上,  這配置了 Width - 1920, Height - 1080, Channel - 3 的 Buffer, 也就是 sizeof(uint8_t) x 1920x1080x3 大小的記憶體空間, 每個 channel 有 1920x1080 , 而要存取該特定的位置, 可以使用下列方式:

// write to a pixel of a channel

rgb_plane(100, 100, 0) = 128

// read a pixel of a channel

uint_8 pix_val = rgb_plane(200, 200, 1)

可以減少計算 buffer offset 的錯誤, 程式碼也更為簡潔, 而因為採用了類似函數的形式, 程式的邏輯過程會更接近 functional 的感覺. 另外是也方便將查表轉為計算實作的驗證流程.

能夠彈性的配置 planar 或 interleaved 資料

許多的圖形資料並非單存以基本的 planar 形式存在, 難免會需要操作 interleaved RGB or RGBA 的資料, 以 Halide 參數順序代表資料格式與 loop 巢狀結構的概念, 那麼要宣告先前 rgb_plane 的 interleaved 版本, 必須這麼宣告:

Buffer<uint8_t> rgb_planes(3, 1920, 1080);

這的確是可行的, 但操作 planar 與 interleave 的程式流程就會使用不同的 index 次序, 而為了解決這樣的方式 Halide::Runtime::Buffer 提供了 make_interleaved 的介面來建構這樣的 Buffer object. 而最大的優點為, 這能夠保有與上一點相同次序的存取方式, 而對應底層不同的 memory layout.

能夠帶入或取得 raw pointer

為了銜接一些其他的實作或操作, 像是使用 OpenCV library 或是以 SIMD 加速的程式, 在 Buffer 的宣告上是能夠傳入外部的 pointer. 相對地需要 Buffer 對應的 buffer pointer 是能夠透過 Halide::Runtime::Buffer::data() 這個呼叫取得. 如此就能銜接/整合/驗證以不同方式實作的資料處理功能.

可搭配 halide_image_io.h 來載入/另存 jpeg or png 圖檔

在 Halide tutorial 中使用的 load_image 與 save_image 對於實務上是很便利的工具,  這也能在 include halide_image_io.h 之後個別來使用, 而需要注意的是這會需要 link libpng 與 libjpeg.


2021年2月28日 星期日

Raspberry Pi Zero W Project Part 3 - Software Interrupt & Handler example

For ARM architecture, the instruction for software interrupt is SWI or SVC (ARMv7).

Using the instruction is very simple:

SVC #imm

When an ARM processor executes the instruction, it will switch to SVC/SWI mode and jump to the SWI or SVC hander in exception vector table specified by Vector Base Address Register (VBAR).

Similar to previous parts, please get from repo https://github.com/champyen/rpiz_bare_metal.git 

please checkout the commit: 637086c2

$ git checkout 637086c2

It is not difficult to use SWI/SVC instruction. But it requires experience to apply it well. Using SWI/SVC instrution for system calls, we need to:

  • design system call table ( SWI number , system call pair)
  • implement SWI/SVC handler, get the system call number
  • implement system calls with corresponding number

For get the SWI (or system call) number, it is describe in ARM's "SWI Handler's" document. The SWI number is encoded as lowest 24 bit in SWI/SVC instruction. Therefore we could get SWI/SVC number by fetching the instruction itself and clearing the MSb 8bits with BIC instruction right after entering SWI hander:

ldr     r0, [lr,#-4]
bic     r0, r0, #0xff000000

After getting SWI number, we could pass it to system_handler as an argument saved in R0. The system_hander is very similar to ISR, but we don't need to check the hardware status to know which hardware interrupts CPU. We just need to handler the SWI or system call by the SWI number. Of course, as IRQ has Interrupt latency SWI/SVC instruction consumes more cycles than normal instructions for most CPUs.

For Application processor it is easy to understand the meaning of mode switching and SWI handling. In fact, Cortex-R, and Cortex-M processor also has SWI/SVC instruction. For most Embedded / RTOS developers, they think it is useless or trivial in such processor.

For some embedded or RTOS, mutual-exclusive applications can be loaded on demand. This can be done by well-designed linker-script (e.g.: different LAs with same). For such Embedded / RTOS, SWI / SVC instruction is useful to maintain an API layer between Applications and Kernel. An intuitive / naive way is to setup and maintain table of function pointers on the kernel side (Of course we still need to setup API index for using the function as SWI number). For development it has an drawback: it is hard to maintain or adjust or expand the table. With SWI/SVC instruction, no table forwarding is needed. Besides it provides flexibility to group system calls and reserve numbers for future needs. (Since for table-based method, it required to reserve a huge table).


2021年2月6日 星期六

Raspberry Pi Zero W Project Part 2 - Interrupt Service Routine (ISR) example

After implementing naive function and printf, let's add ISR to it.

please checkout the commit: 91abc0d3

$ git checkout 91abc0d3

There are many changes from previous commit:

head.S

it becomes more complicated. As we know, its context starts at 0x8000 address.

start:
    ldr     pc, reset_target        /* 0x00 mode: svc */
    ldr     pc, undefined_target    /* 0x04 mode: ? */
    ldr     pc, swi_target          /* 0x08 mode: svc */
    ldr     pc, prefetch_target     /* 0x0c mode: abort */
    ldr     pc, abort_target        /* 0x10 mode: abort */
    ldr     pc, unused_target       /* 0x14 unused */
    ldr     pc, irq_target          /* 0x18 mode: irq */
    ldr     pc, fiq_target          /* 0x1c mode: fiq */

reset_target:           .word   reset_entry

undefined_target:       .word   undefined_entry
swi_target:             .word   syscall_entry
prefetch_target:        .word   prefetch_entry
abort_target:           .word   abort_entry
unused_target:          .word   unused_entry
irq_target:             .word   irq_entry
fiq_target:             .word   fiq_entry

After loading the binary, it will jump to a routine named "reset_entry". Before we trace the reset_entry. The 8 "ldr pc, XXXXXX" instructions are so called Exception Vector Table. (FIQ is a special irq mode, it has a advantage - the implementation can be start at the location, the jump is not necessary. Therefore one jump delay is saved. ) It is used to handle system exceptions. Each has corresponding privileged mode to it. Besides, each mode has dedicated LR and SP registers - this means OS / firmware implementation should take care of stack space arrangement for the mode:


In fact, for reset_entry here, its major work is setting stack for each mode:

reset_entry:
    /* set VBAR to 0x8000 */
    mov r0, #0x8000
    mcr p15, 0, r0, c12, c0, 0


    /* (PSR_FIQ_MODE|PSR_FIQ_DIS|PSR_IRQ_DIS) */
    mov r0,#0xD1
    msr cpsr_c,r0
    ldr sp, stack_fiq_top

... other 4 modes ...

    /* (PSR_SVC_MODE|PSR_FIQ_DIS|PSR_IRQ_DIS) */
    mov r0,#0xD3
    msr cpsr_c,r0
    ldr sp, stack_svc_top

    cpsie i
    bl  bare_metal_start

In addition to stack assignment and jump to bare_metal_start , there are two key points here:

  1. setup Vector Base Address Register (VBAR) - From ARMv6, the exception vector can be placed other than 0x00000000 and 0xFF000000. This is achieved by setting VBAR, please refer to "3.2.43 c12, Secure or Non-secure Vector Base Address Register" in ARM1176JZF-S TRM.
  2. enable interrupt - cpsie instruction

And we have to trace isr_entry:

irq_entry:
    stmfd  sp!, {r0-r12, lr}
    add     lr, pc, #4
    bl      isr_entry
    ldmfd   sp!, {r0-r12, lr}

    subs    pc, lr, #4
For ISR, it is not surprised to backup and restore all (non-dedicated) registers. The most interesting things are - LR register setting and the instruction to leave IRQ mode. For LR setting it is easy to figure out, the target return address is the 'bl isr_entry' not 'add lr, pc, #4". That's the main reason to save "pc+4" to LR. And for leaving each mode, please refer to "2.12.2 Exception entry and exit summary" of ARM1176JZF-S TRM.

isr.c

There 3 functions in the source file: timer_enable, timer_check and isr_enty. Here we use "System Timer" in BCM2835, please refer to Chap 7 and Chap 12 of "BCM2835 Peripheral specification". Besides the IRQ number of System Timer is not listed in the document, please refer to the link of  "errata and some additional information" on the page.

The timer_enable will enable System Timer 1 or 3 by index and timer_check is used to clear IRQ state and update next timeout interrupt. Therefore the isr_enty just check status and call timer_check for clear the IRQ.

bare_metal.c

For demonstrate IRQ and main thread's progress, a busy loop with counter is added. The loop will print out a number when specified condition is met. And Timer is enabled before the loop, You can see the timer tick with ISR and the main thread keeps counting.

void bare_metal_start(void)
{
    int base = 0;
    asm volatile (
        "mov %0, sp\n\t" : "=r" (base)
    );

    printf("\n\n%s:%x: Hello World! %s %s %d\n\n", __func__, base, __DATE__, __TIME__, __LINE__);
    printf("enter busy loop\n");

    timer_enable(1);

    volatile int i = 0;
    while(1){
        if((i++ & 0x00FFFFFF) == 0)
            printf("%d\n", i);
    }

}

 



 

 

 

 

2021年1月23日 星期六

Raspberry Pi Zero W Project Part 1 - bare metal printf implementation

Step 0 - update config.txt and overlay

config.txt:

enable_uart=1
dtoverlay=miniuart-bt # or disable-bt as you want
kernel=bare_metal.bin

Please remember to download miniuart-bt.dtbo or disable-bt.dtbo and place the files to a folder named "overlays" in your SD card. 

Now, Let's start the lab.

The first step to control a CPU is to dump any message you want!

$ git clone https://github.com/champyen/rpiz_bare_metal.git

$ git checkout 4e769069

There are 4 major files in the example:

  • bare_metal.c - the main example flow
  • head.S - the glue code for entering the flow
  • bare_metal.c - the main example flow
  • printf.c - printf for RPi Zero's PL011 uart
  • bare_metal.ld - linker script of the example

From the "Boot options in config.txt" of Raspberry Pi Document, we know that the default start address is 0x8000.

In bare_metal.ld, you can see the linker script:

OUTPUT_ARCH(arm)
SECTIONS {
    . = 0x8000;
    .text . : {
        *(.text)
    }
    . = ALIGN(4);
    .data . : {
        *(.data)
    }
    . = ALIGN(4);
    .bss . : {
        *(.bss)
        *(COMMON)
    }
    . = ALIGN(4);
    .rodata . : {
        *(.rodata)
        *(.rodata.*)
    }
}
In Makefile, you can see the linking order is - head.o bare_metal.o printf.o.


Therefore, after bootcode.bin, it will jump to the first function in head.S.

In head.S:

.text
_start:
    ldr    sp, =stack_top
    bl    bare_metal_start

stack_top:      .word   0x100000
Before calling the demo function - bare_metal_start, head.S does only one thing - setup the address of 0x100000 as stack pointer.

In bare_metal.c:

void printf(const char *fmt, ...);
void bare_metal_start(void)
{
    printf("\n\n%s: Hello World! %s %s %d\n\n", __func__, __DATE__, __TIME__, __LINE__);
    printf("enter busy loop\n");
    while(1);
}

the bare_metal_start function calls the 'printf' implemented in printf.c to dump two messages.

In printf, the fundamental function is _putc.

#define PL011_BASE  0x20201000
#define PL011_DR    (PL011_BASE + 0x00)
#define PL011_FR    (PL011_BASE + 0x18)
#define _REG(x)        *((unsigned int t *)(x))

void _putc(unsigned char c)
{
    while( (_REG(PL011_FR) & 0x80) == 0);
    _REG(PL011_DR) = c;
}
Please refer to Chapter 13 of the "Peripheral Specificaion" of BCM2835, the _putc just check the Transmit buffer status, and send out a char when Transmit buffer/register is empty.





在 ARM 平台上使用 Function Multi-Versioning (FMV) - 以使用 Android NDK 為例

Function Multi-Versioning (FMV) 過往的 CPU 發展歷程中, x86 平台由於因應各種應用需求的提出, 而陸陸續續加入了不同的指令集, 此外也可能因為針對市場做等級區隔, 支援的數量與種類也不等. 在 Linux 平台上這些 CPU 資訊可以透過...