2021年9月15日 星期三

Halide 實務心得 3

在 Halide 的使用上會有錯覺地認為 Halide::Runtime::Buffer 的使用必須與 libHalide.so or libHalide.a linking 才可以. 但其實 Halide::Runtime::Buffer 是可以單獨使用的, 只需要 header files, 基本上一般的程式只要確認有:

#include <HalideBuffer.h>

並且確認編譯時有在 include path 即可, 單獨的使用上並不需要任何 libHalide, 相較使用 raw pointer 或是 STL 元件, 這樣的方式有很多的好處:

  1. 俱備 reference count 特性
  2. 提供類似函數的存取介面
  3. 能夠彈性的配置 planar 或 interleaved 資料
  4. 能夠帶入或取得 raw pointer
  5. 可搭配 halide_image_io.h 來載入/另存 jpeg or png 圖檔

俱備 reference count 特性

現今的 C++ 元件多半具有這樣的特性, 除了使用上的簡潔彈性外, 好處是可以減少程式中充斥自行管理 buffer/object 管理上的配置與釋放相關的程式碼, 除了有效避免 memory leak / double free 之外, 也能降低因 pointer 操作的錯誤發生. 而這樣的特性也讓程式中對於 Buffer 的指定與傳遞更為便利.

提供類似函數的存取介面

Halide::Runtime::Buffer 並不是單純地提供 Halide 所使用的 Buffer object 存在, 每個所配置產生的 Buffer object 是能夠做存取操作的, 與 STL 不同的地方是, Halide::Runtime::Buffer 使用的是類似函式呼叫的形式來存取資料, 假設我們如下宣告了一個 Buffer.

Buffer<uint8_t> rgb_planes(1920, 1080, 3);

概念上,  這配置了 Width - 1920, Height - 1080, Channel - 3 的 Buffer, 也就是 sizeof(uint8_t) x 1920x1080x3 大小的記憶體空間, 每個 channel 有 1920x1080 , 而要存取該特定的位置, 可以使用下列方式:

// write to a pixel of a channel

rgb_plane(100, 100, 0) = 128

// read a pixel of a channel

uint_8 pix_val = rgb_plane(200, 200, 1)

可以減少計算 buffer offset 的錯誤, 程式碼也更為簡潔, 而因為採用了類似函數的形式, 程式的邏輯過程會更接近 functional 的感覺. 另外是也方便將查表轉為計算實作的驗證流程.

能夠彈性的配置 planar 或 interleaved 資料

許多的圖形資料並非單存以基本的 planar 形式存在, 難免會需要操作 interleaved RGB or RGBA 的資料, 以 Halide 參數順序代表資料格式與 loop 巢狀結構的概念, 那麼要宣告先前 rgb_plane 的 interleaved 版本, 必須這麼宣告:

Buffer<uint8_t> rgb_planes(3, 1920, 1080);

這的確是可行的, 但操作 planar 與 interleave 的程式流程就會使用不同的 index 次序, 而為了解決這樣的方式 Halide::Runtime::Buffer 提供了 make_interleaved 的介面來建構這樣的 Buffer object. 而最大的優點為, 這能夠保有與上一點相同次序的存取方式, 而對應底層不同的 memory layout.

能夠帶入或取得 raw pointer

為了銜接一些其他的實作或操作, 像是使用 OpenCV library 或是以 SIMD 加速的程式, 在 Buffer 的宣告上是能夠傳入外部的 pointer. 相對地需要 Buffer 對應的 buffer pointer 是能夠透過 Halide::Runtime::Buffer::data() 這個呼叫取得. 如此就能銜接/整合/驗證以不同方式實作的資料處理功能.

可搭配 halide_image_io.h 來載入/另存 jpeg or png 圖檔

在 Halide tutorial 中使用的 load_image 與 save_image 對於實務上是很便利的工具,  這也能在 include halide_image_io.h 之後個別來使用, 而需要注意的是這會需要 link libpng 與 libjpeg.


2021年2月28日 星期日

Raspberry Pi Zero W Project Part 3 - Software Interrupt & Handler example

For ARM architecture, the instruction for software interrupt is SWI or SVC (ARMv7).

Using the instruction is very simple:

SVC #imm

When an ARM processor executes the instruction, it will switch to SVC/SWI mode and jump to the SWI or SVC hander in exception vector table specified by Vector Base Address Register (VBAR).

Similar to previous parts, please get from repo https://github.com/champyen/rpiz_bare_metal.git 

please checkout the commit: 637086c2

$ git checkout 637086c2

It is not difficult to use SWI/SVC instruction. But it requires experience to apply it well. Using SWI/SVC instrution for system calls, we need to:

  • design system call table ( SWI number , system call pair)
  • implement SWI/SVC handler, get the system call number
  • implement system calls with corresponding number

For get the SWI (or system call) number, it is describe in ARM's "SWI Handler's" document. The SWI number is encoded as lowest 24 bit in SWI/SVC instruction. Therefore we could get SWI/SVC number by fetching the instruction itself and clearing the MSb 8bits with BIC instruction right after entering SWI hander:

ldr     r0, [lr,#-4]
bic     r0, r0, #0xff000000

After getting SWI number, we could pass it to system_handler as an argument saved in R0. The system_hander is very similar to ISR, but we don't need to check the hardware status to know which hardware interrupts CPU. We just need to handler the SWI or system call by the SWI number. Of course, as IRQ has Interrupt latency SWI/SVC instruction consumes more cycles than normal instructions for most CPUs.

For Application processor it is easy to understand the meaning of mode switching and SWI handling. In fact, Cortex-R, and Cortex-M processor also has SWI/SVC instruction. For most Embedded / RTOS developers, they think it is useless or trivial in such processor.

For some embedded or RTOS, mutual-exclusive applications can be loaded on demand. This can be done by well-designed linker-script (e.g.: different LAs with same). For such Embedded / RTOS, SWI / SVC instruction is useful to maintain an API layer between Applications and Kernel. An intuitive / naive way is to setup and maintain table of function pointers on the kernel side (Of course we still need to setup API index for using the function as SWI number). For development it has an drawback: it is hard to maintain or adjust or expand the table. With SWI/SVC instruction, no table forwarding is needed. Besides it provides flexibility to group system calls and reserve numbers for future needs. (Since for table-based method, it required to reserve a huge table).


2021年2月6日 星期六

Raspberry Pi Zero W Project Part 2 - Interrupt Service Routine (ISR) example

After implementing naive function and printf, let's add ISR to it.

please checkout the commit: 91abc0d3

$ git checkout 91abc0d3

There are many changes from previous commit:

head.S

it becomes more complicated. As we know, its context starts at 0x8000 address.

start:
    ldr     pc, reset_target        /* 0x00 mode: svc */
    ldr     pc, undefined_target    /* 0x04 mode: ? */
    ldr     pc, swi_target          /* 0x08 mode: svc */
    ldr     pc, prefetch_target     /* 0x0c mode: abort */
    ldr     pc, abort_target        /* 0x10 mode: abort */
    ldr     pc, unused_target       /* 0x14 unused */
    ldr     pc, irq_target          /* 0x18 mode: irq */
    ldr     pc, fiq_target          /* 0x1c mode: fiq */

reset_target:           .word   reset_entry

undefined_target:       .word   undefined_entry
swi_target:             .word   syscall_entry
prefetch_target:        .word   prefetch_entry
abort_target:           .word   abort_entry
unused_target:          .word   unused_entry
irq_target:             .word   irq_entry
fiq_target:             .word   fiq_entry

After loading the binary, it will jump to a routine named "reset_entry". Before we trace the reset_entry. The 8 "ldr pc, XXXXXX" instructions are so called Exception Vector Table. (FIQ is a special irq mode, it has a advantage - the implementation can be start at the location, the jump is not necessary. Therefore one jump delay is saved. ) It is used to handle system exceptions. Each has corresponding privileged mode to it. Besides, each mode has dedicated LR and SP registers - this means OS / firmware implementation should take care of stack space arrangement for the mode:


In fact, for reset_entry here, its major work is setting stack for each mode:

reset_entry:
    /* set VBAR to 0x8000 */
    mov r0, #0x8000
    mcr p15, 0, r0, c12, c0, 0


    /* (PSR_FIQ_MODE|PSR_FIQ_DIS|PSR_IRQ_DIS) */
    mov r0,#0xD1
    msr cpsr_c,r0
    ldr sp, stack_fiq_top

... other 4 modes ...

    /* (PSR_SVC_MODE|PSR_FIQ_DIS|PSR_IRQ_DIS) */
    mov r0,#0xD3
    msr cpsr_c,r0
    ldr sp, stack_svc_top

    cpsie i
    bl  bare_metal_start

In addition to stack assignment and jump to bare_metal_start , there are two key points here:

  1. setup Vector Base Address Register (VBAR) - From ARMv6, the exception vector can be placed other than 0x00000000 and 0xFF000000. This is achieved by setting VBAR, please refer to "3.2.43 c12, Secure or Non-secure Vector Base Address Register" in ARM1176JZF-S TRM.
  2. enable interrupt - cpsie instruction

And we have to trace isr_entry:

irq_entry:
    stmfd  sp!, {r0-r12, lr}
    add     lr, pc, #4
    bl      isr_entry
    ldmfd   sp!, {r0-r12, lr}

    subs    pc, lr, #4
For ISR, it is not surprised to backup and restore all (non-dedicated) registers. The most interesting things are - LR register setting and the instruction to leave IRQ mode. For LR setting it is easy to figure out, the target return address is the 'bl isr_entry' not 'add lr, pc, #4". That's the main reason to save "pc+4" to LR. And for leaving each mode, please refer to "2.12.2 Exception entry and exit summary" of ARM1176JZF-S TRM.

isr.c

There 3 functions in the source file: timer_enable, timer_check and isr_enty. Here we use "System Timer" in BCM2835, please refer to Chap 7 and Chap 12 of "BCM2835 Peripheral specification". Besides the IRQ number of System Timer is not listed in the document, please refer to the link of  "errata and some additional information" on the page.

The timer_enable will enable System Timer 1 or 3 by index and timer_check is used to clear IRQ state and update next timeout interrupt. Therefore the isr_enty just check status and call timer_check for clear the IRQ.

bare_metal.c

For demonstrate IRQ and main thread's progress, a busy loop with counter is added. The loop will print out a number when specified condition is met. And Timer is enabled before the loop, You can see the timer tick with ISR and the main thread keeps counting.

void bare_metal_start(void)
{
    int base = 0;
    asm volatile (
        "mov %0, sp\n\t" : "=r" (base)
    );

    printf("\n\n%s:%x: Hello World! %s %s %d\n\n", __func__, base, __DATE__, __TIME__, __LINE__);
    printf("enter busy loop\n");

    timer_enable(1);

    volatile int i = 0;
    while(1){
        if((i++ & 0x00FFFFFF) == 0)
            printf("%d\n", i);
    }

}

 



 

 

 

 

2021年1月23日 星期六

Raspberry Pi Zero W Project Part 1 - bare metal printf implementation

Step 0 - update config.txt and overlay

config.txt:

enable_uart=1
dtoverlay=miniuart-bt # or disable-bt as you want
kernel=bare_metal.bin

Please remember to download miniuart-bt.dtbo or disable-bt.dtbo and place the files to a folder named "overlays" in your SD card. 

Now, Let's start the lab.

The first step to control a CPU is to dump any message you want!

$ git clone https://github.com/champyen/rpiz_bare_metal.git

$ git checkout 4e769069

There are 4 major files in the example:

  • bare_metal.c - the main example flow
  • head.S - the glue code for entering the flow
  • bare_metal.c - the main example flow
  • printf.c - printf for RPi Zero's PL011 uart
  • bare_metal.ld - linker script of the example

From the "Boot options in config.txt" of Raspberry Pi Document, we know that the default start address is 0x8000.

In bare_metal.ld, you can see the linker script:

OUTPUT_ARCH(arm)
SECTIONS {
    . = 0x8000;
    .text . : {
        *(.text)
    }
    . = ALIGN(4);
    .data . : {
        *(.data)
    }
    . = ALIGN(4);
    .bss . : {
        *(.bss)
        *(COMMON)
    }
    . = ALIGN(4);
    .rodata . : {
        *(.rodata)
        *(.rodata.*)
    }
}
In Makefile, you can see the linking order is - head.o bare_metal.o printf.o.


Therefore, after bootcode.bin, it will jump to the first function in head.S.

In head.S:

.text
_start:
    ldr    sp, =stack_top
    bl    bare_metal_start

stack_top:      .word   0x100000
Before calling the demo function - bare_metal_start, head.S does only one thing - setup the address of 0x100000 as stack pointer.

In bare_metal.c:

void printf(const char *fmt, ...);
void bare_metal_start(void)
{
    printf("\n\n%s: Hello World! %s %s %d\n\n", __func__, __DATE__, __TIME__, __LINE__);
    printf("enter busy loop\n");
    while(1);
}

the bare_metal_start function calls the 'printf' implemented in printf.c to dump two messages.

In printf, the fundamental function is _putc.

#define PL011_BASE  0x20201000
#define PL011_DR    (PL011_BASE + 0x00)
#define PL011_FR    (PL011_BASE + 0x18)
#define _REG(x)        *((unsigned int t *)(x))

void _putc(unsigned char c)
{
    while( (_REG(PL011_FR) & 0x80) == 0);
    _REG(PL011_DR) = c;
}
Please refer to Chapter 13 of the "Peripheral Specificaion" of BCM2835, the _putc just check the Transmit buffer status, and send out a char when Transmit buffer/register is empty.





2020年12月29日 星期二

Raspberry Pi Zero W Project Part 0.5 - Documents

Documents

Before starting the project, to collect documents is an important things. 

It is important to study document before and during implementation.

The documents needed in the project are:

1. Raspberry Pi's Official Documents

2. Broadcom BCM2835 Peripheral specification

3. ARM1176

BCM2825 Address Space



It is important to know the address view of CPU and VPU/GPU. You have to know the difference between physical address and bus address of BCM2835. Otherwise you can't get hardware work correctly.

Why Raspberry Pi Zero WH?

The reasons of selecting RPi ZW as my project platform are:

  1. cheap, easy to get one, ARMv6 provides enough features of modern ARM Application processors.
  2. single core, simple & good for teaching / practice.
  3. low power, RPi ZW can be powered by almost any USB slot.
  4. RPi has a lot of documents / projects / implementations for reference.

What's in Next?

In next part, simple assembly programming on RPi Zero W will be introduced. We need to use assembly to implement some important parts of OS and bootloader.

2020年12月27日 星期日

Raspberry Pi Zero W Project Part 0 - u-boot building / testing

 1. Raspberry Pi (Pre-4B) Boot Sequence

Currently, loader.bin is not needed.

2. Boot from SD - from power-on to bootcode.bin


 

3. Toolchain 

GNU Arm Embedded Toolchain - Download

4. u-boot

$ sudo apt-get install flex bison

$ git clone https://github.com/u-boot/u-boot; cd u-boot

$ export PATH=PATH_TO_TOOLCHAIN/gcc-arm-none-eabi-10-2020-q4-major/bin/:$PATH

$ export CROSS_COMPILE=arm-none-eabi-

$ make rpi_0_w_defconfig

$ make -s -j4; cp ./u-boot.bin PATH_TO_SDCARD

5. booting

config.txt:

enable_uart=1
uart_2ndstage=1
kernel=u-boot.bin



2020年11月14日 星期六

正體中文資訊考古 - 21世紀多媒體英漢雙向辭典

 

今天分享的是一個相當有歷史意義的字典
"21世紀多媒體英漢雙向辭典"
比較可惜的地方是該辭典入手時反射層的 CD cover 已有無法挽救的刮痕
部份發音資料檔損毀, 因此 (c, d, i, m, p, y) 開頭的單字無發法發音
其歷史意義在許多現今"免費"字典軟體中
有個廣為流傳的一個字典檔是 "21世紀英漢漢英雙向辭典"
廣受歡迎的主要原因是字彙數目豐富, 許多人推薦是因為涵蓋 2X 萬單字, 勝過僅數萬字的其他字典
多年來經過了許多各方使用者的編修與補充
然而有近年許多人在考究該辭典出處究竟為何, 像是
該討論串有些資訊, 但是從 2015 年當時的翻譯比對是沒有意義的
原因是該字典檔已經流傳非常久
經過許多人依照偏好地自其他辭典合併/替換解釋內容與補充的字非常多
而結論的"新世紀英漢辭典"即便最新版本收錄詞彙也約莫十餘萬, 這是十分明顯的不相符, 也將成就歸錯他人
然而可以確認的是這辭典的流傳有直接關聯的,
是近 2000 年所推出的星際譯王 StarDict
然而該作者也沒有解釋這字典檔出處
這本辭典的條目的與其他辭典不同的特徵在於會把音節點納入
因此這篇有個特別的說明就是其他字典找不到的
這裡提供討論中出現的 abrupt 與 saguaro 的字義, 1994 年的辭典與此分毫不差
除了字彙的字義上的完全符合外("新世紀"很多不同)
當時也未有以"新世紀"為辭典內容的電腦辭典軟體
封面上的 "B"ictionary 並非錯字, 該辭典當時上市時發明的詞, 透過搜尋可以找到一個說明 "為微系電腦股份有限公司發展之超強功能電腦雙向辭典之註冊商標其囊括十六項業界第一: 1. 辭彙最多, 約二十萬字 2. 片語最多,約五千辭 3. 例句最多, 約六千句 4. 英漢漢 ...", 可以得知字彙數目是相符的
而星際譯王作者應沒有心力去建立這麼龐大的字典檔, 而應該完全是由現成軟體檔案的轉換
因而無論推出年份, 字彙的數目, 解釋的內容與名稱都只有這個辭典最為可能, 只是因為軟體是 1994 年推出, 完整版本流通數少而讓他人無法直接證明
微系電腦股份有限公司 1990/6 年成立, 2001/1 年解散
 
 





 

Halide 實務心得 3

在 Halide 的使用上會有錯覺地認為 Halide::Runtime::Buffer 的使用必須與 libHalide.so or libHalide.a linking 才可以. 但其實 Halide::Runtime::Buffer 是可以單獨使用的, 只需要 head...