Hope this article would be helpful for some people.
2022年2月21日 星期一
How to make "Dark mode" listed in Facebook APP on Android Tablet
Hope this article would be helpful for some people.
2022年2月16日 星期三
ARM SVE 研讀筆記 Part-1 - ARM SVE 基本概述
![]() |
Fig.1 - ARM SVE vector register 與 NEON register 關係圖. |
Fig. 3 - 總結 ARM SVE 的重大特性 |
個人統整認為 SVE 的主要特性有 1(個人認為應該加入) + 5, 分別是:
- Vector Length Agnostic (VLA)
- Gather-load and Scatter-Store
- Per-lane predication
- Predicate-driven loop control and managment
- Vector partitioning for software-managed speculation
- Extended floating-point and bitwise horizontal reductions.
後續陸續將這些心得一一分享. 下一篇將介紹說明 Vector Length Agnostic (VLA)
2022年2月9日 星期三
Halide 實務心得 4
這陣子因為同事在弄 Hexagon DSP 且需要實作 Laplacian Pyramid 來完成所需功能, 由於這是 Halide 當初推出時 paper 內說明的範例, 所以搜尋了一下檔名, 建議他參考 Halide git repo 內的 Laplacian 範例.
今日與演算開發者開會討論時, 這位同事建議演算開發者考慮去使用 Halide language, 並且以當中的 downsample 為範例解說給演算開發的同事聽. 這時我注意到了 downsample 中有著先前沒看過, 另外也沒在任何 Halide 文件 / 教學文件提到的東西:
using Halide::_;
最有趣的是透過使用 "_" 所實作的 downsample Funciton:
Func downsample(Func f)
{
using Halide::_;
Func downx, downy;
downx(x, y, _) = (
f(2 * x - 1, y, _) +
3.0f * (f(2 * x, y, _) + f(2 * x + 1, y, _)) +
f(2 * x + 2, y, _)
) / 8.0f;
downy(x, y, _) = (
downx(x, 2 * y - 1, _) +
3.0f * (downx(x, 2 * y, _) + downx(x, 2 * y + 1, _)) +
downx(x, 2 * y + 2, _)
) / 8.0f;
return downy;
}
很明顯地, "Halide::_" 是用來處理 function 內 argument / dimension 不定數量的問題. 然而詳細的使用應該有正式的解說, 經過搜尋後發現在 Halide 文件中的 Var 當中, 這個功能稱為 Implicit Variable Constructor. 文件中的解釋為: "Implicit variables are injected automatically into a function call if
the number of arguments to the function are fewer than its
dimensionality and a placeholder ("_") appears in its argument list.
Defining a function to equal an expression containing implicit variables
similarly appends those implicit variables, in the same order, to the
left-hand-side of the definition where the placeholder ('_') appears." 這也就是說在 function 呼叫若使用了 "_" 在參數 list 中 , 一旦參數的數目少於函數所需參數, 將會自動對應遞補足. 透過使用 implicit variable "_" 來定義一個函數, 這表示函數有著以 "_" 出現所代表的參數 list.
透過 implicit variable constructor 如此可以無論傳進的 Func f 有幾個參數, 像是對於影像處理的函數而言, 最前面兩個參數能固定被使用作為座標的 (x, y), 而對應的維度在處理時將直接補足對應. 因此無論純 2D 黑白影像可以套用外, 3D (有3個參數, 分別為 Width, Height, Channels) 的 RGB / RGBA / YUV444 等等影像都可以直接套用這個 downsample 來處理, 更重要的是參數更多的更高維度資料都可以因此而透過單一實作來處理. 另外應可以搭配 RDom / RVar 來使用, 像是上述程式碼中的 downx / downy 的範圍, 搭配 RDom 來使用, 應該能實作出更為簡潔的 Func 表示方式.
2021年9月15日 星期三
Halide 實務心得 3
在 Halide 的使用上會有錯覺地認為 Halide::Runtime::Buffer 的使用必須與 libHalide.so or libHalide.a linking 才可以. 但其實 Halide::Runtime::Buffer 是可以單獨使用的, 只需要 header files, 基本上一般的程式只要確認有:
#include <HalideBuffer.h>
並且確認編譯時有在 include path 即可, 單獨的使用上並不需要任何 libHalide, 相較使用 raw pointer 或是 STL 元件, 這樣的方式有很多的好處:
- 俱備 reference count 特性
- 提供類似函數的存取介面
- 能夠彈性的配置 planar 或 interleaved 資料
- 能夠帶入或取得 raw pointer
- 可搭配 halide_image_io.h 來載入/另存 jpeg or png 圖檔
俱備 reference count 特性
現今的 C++ 元件多半具有這樣的特性, 除了使用上的簡潔彈性外, 好處是可以減少程式中充斥自行管理 buffer/object 管理上的配置與釋放相關的程式碼, 除了有效避免 memory leak / double free 之外, 也能降低因 pointer 操作的錯誤發生. 而這樣的特性也讓程式中對於 Buffer 的指定與傳遞更為便利.
提供類似函數的存取介面
Halide::Runtime::Buffer 並不是單純地提供 Halide 所使用的 Buffer object 存在, 每個所配置產生的 Buffer object 是能夠做存取操作的, 與 STL 不同的地方是, Halide::Runtime::Buffer 使用的是類似函式呼叫的形式來存取資料, 假設我們如下宣告了一個 Buffer.
Buffer<uint8_t> rgb_planes(1920, 1080, 3);
概念上, 這配置了 Width - 1920, Height - 1080, Channel - 3 的 Buffer, 也就是 sizeof(uint8_t) x 1920x1080x3 大小的記憶體空間, 每個 channel 有 1920x1080 , 而要存取該特定的位置, 可以使用下列方式:
// write to a pixel of a channel
rgb_plane(100, 100, 0) = 128
// read a pixel of a channel
uint_8 pix_val = rgb_plane(200, 200, 1)
可以減少計算 buffer offset 的錯誤, 程式碼也更為簡潔, 而因為採用了類似函數的形式, 程式的邏輯過程會更接近 functional 的感覺. 另外是也方便將查表轉為計算實作的驗證流程.
能夠彈性的配置 planar 或 interleaved 資料
許多的圖形資料並非單存以基本的 planar 形式存在, 難免會需要操作 interleaved RGB or RGBA 的資料, 以 Halide 參數順序代表資料格式與 loop 巢狀結構的概念, 那麼要宣告先前 rgb_plane 的 interleaved 版本, 必須這麼宣告:
Buffer<uint8_t> rgb_planes(3, 1920, 1080);
這的確是可行的, 但操作 planar 與 interleave 的程式流程就會使用不同的 index 次序, 而為了解決這樣的方式 Halide::Runtime::Buffer 提供了 make_interleaved 的介面來建構這樣的 Buffer object. 而最大的優點為, 這能夠保有與上一點相同次序的存取方式, 而對應底層不同的 memory layout.
能夠帶入或取得 raw pointer
為了銜接一些其他的實作或操作, 像是使用 OpenCV library 或是以 SIMD 加速的程式, 在 Buffer 的宣告上是能夠傳入外部的 pointer. 相對地需要 Buffer 對應的 buffer pointer 是能夠透過 Halide::Runtime::Buffer::data() 這個呼叫取得. 如此就能銜接/整合/驗證以不同方式實作的資料處理功能.
可搭配 halide_image_io.h 來載入/另存 jpeg or png 圖檔
在 Halide tutorial 中使用的 load_image 與 save_image 對於實務上是很便利的工具, 這也能在 include halide_image_io.h 之後個別來使用, 而需要注意的是這會需要 link libpng 與 libjpeg.
2021年2月28日 星期日
Raspberry Pi Zero W Project Part 3 - Software Interrupt & Handler example
For ARM architecture, the instruction for software interrupt is SWI or SVC (ARMv7).
Using the instruction is very simple:
SVC #imm
When an ARM processor executes the instruction, it will switch to SVC/SWI mode and jump to the SWI or SVC hander in exception vector table specified by Vector Base Address Register (VBAR).
Similar to previous parts, please get from repo https://github.com/champyen/rpiz_bare_metal.git
please checkout the commit: 637086c2
$ git checkout 637086c2
It is not difficult to use SWI/SVC instruction. But it requires experience to apply it well. Using SWI/SVC instrution for system calls, we need to:
- design system call table ( SWI number , system call pair)
- implement SWI/SVC handler, get the system call number
- implement system calls with corresponding number
For get the SWI (or system call) number, it is describe in ARM's "SWI Handler's" document. The SWI number is encoded as lowest 24 bit in SWI/SVC instruction. Therefore we could get SWI/SVC number by fetching the instruction itself and clearing the MSb 8bits with BIC instruction right after entering SWI hander:
ldr r0, [lr,#-4]
bic r0, r0, #0xff000000
After getting SWI number, we could pass it to system_handler as an argument saved in R0. The system_hander is very similar to ISR, but we don't need to check the hardware status to know which hardware interrupts CPU. We just need to handler the SWI or system call by the SWI number. Of course, as IRQ has Interrupt latency SWI/SVC instruction consumes more cycles than normal instructions for most CPUs.
For Application processor it is easy to understand the meaning of mode switching and SWI handling. In fact, Cortex-R, and Cortex-M processor also has SWI/SVC instruction. For most Embedded / RTOS developers, they think it is useless or trivial in such processor.
For some embedded or RTOS, mutual-exclusive applications can be loaded on demand. This can be done by well-designed linker-script (e.g.: different LAs with same). For such Embedded / RTOS, SWI / SVC instruction is useful to maintain an API layer between Applications and Kernel. An intuitive / naive way is to setup and maintain table of function pointers on the kernel side (Of course we still need to setup API index for using the function as SWI number). For development it has an drawback: it is hard to maintain or adjust or expand the table. With SWI/SVC instruction, no table forwarding is needed. Besides it provides flexibility to group system calls and reserve numbers for future needs. (Since for table-based method, it required to reserve a huge table).
2021年2月6日 星期六
Raspberry Pi Zero W Project Part 2 - Interrupt Service Routine (ISR) example
After implementing naive function and printf, let's add ISR to it.
please checkout the commit: 91abc0d3
$ git checkout 91abc0d3
There are many changes from previous commit:
head.S
it becomes more complicated. As we know, its context starts at 0x8000 address.
start:
ldr pc, reset_target /* 0x00 mode: svc */
ldr pc, undefined_target /* 0x04 mode: ? */
ldr pc, swi_target /* 0x08 mode: svc */
ldr pc, prefetch_target /* 0x0c mode: abort */
ldr pc, abort_target /* 0x10 mode: abort */
ldr pc, unused_target /* 0x14 unused */
ldr pc, irq_target /* 0x18 mode: irq */
ldr pc, fiq_target /* 0x1c mode: fiq */
reset_target: .word reset_entry
undefined_target: .word undefined_entry
swi_target: .word syscall_entry
prefetch_target: .word prefetch_entry
abort_target: .word abort_entry
unused_target: .word unused_entry
irq_target: .word irq_entry
fiq_target: .word fiq_entry
After loading the binary, it will jump to a routine named "reset_entry". Before we trace the reset_entry. The 8 "ldr pc, XXXXXX" instructions are so called Exception Vector Table. (FIQ is a special irq mode, it has a advantage - the implementation can be start at the location, the jump is not necessary. Therefore one jump delay is saved. ) It is used to handle system exceptions. Each has corresponding privileged mode to it. Besides, each mode has dedicated LR and SP registers - this means OS / firmware implementation should take care of stack space arrangement for the mode:
In fact, for reset_entry here, its major work is setting stack for each mode:
reset_entry:
/* set VBAR to 0x8000 */
mov r0, #0x8000
mcr p15, 0, r0, c12, c0, 0
/* (PSR_FIQ_MODE|PSR_FIQ_DIS|PSR_IRQ_DIS) */
mov r0,#0xD1
msr cpsr_c,r0
ldr sp, stack_fiq_top
... other 4 modes ...
/* (PSR_SVC_MODE|PSR_FIQ_DIS|PSR_IRQ_DIS) */
mov r0,#0xD3
msr cpsr_c,r0
ldr sp, stack_svc_top
cpsie i
bl bare_metal_start
In addition to stack assignment and jump to bare_metal_start , there are two key points here:
- setup Vector Base Address Register (VBAR) - From ARMv6, the exception vector can be placed other than 0x00000000 and 0xFF000000. This is achieved by setting VBAR, please refer to "3.2.43 c12, Secure or Non-secure Vector Base Address Register" in ARM1176JZF-S TRM.
- enable interrupt - cpsie instruction
And we have to trace isr_entry:
irq_entry:For ISR, it is not surprised to backup and restore all (non-dedicated) registers. The most interesting things are - LR register setting and the instruction to leave IRQ mode. For LR setting it is easy to figure out, the target return address is the 'bl isr_entry' not 'add lr, pc, #4". That's the main reason to save "pc+4" to LR. And for leaving each mode, please refer to "2.12.2 Exception entry and exit summary" of ARM1176JZF-S TRM.
stmfd sp!, {r0-r12, lr}
add lr, pc, #4
bl isr_entry
ldmfd sp!, {r0-r12, lr}
subs pc, lr, #4
isr.c
There 3 functions in the source file: timer_enable, timer_check and isr_enty. Here we use "System Timer" in BCM2835, please refer to Chap 7 and Chap 12 of "BCM2835 Peripheral specification". Besides the IRQ number of System Timer is not listed in the document, please refer to the link of "errata and some additional information" on the page.
The timer_enable will enable System Timer 1 or 3 by index and timer_check is used to clear IRQ state and update next timeout interrupt. Therefore the isr_enty just check status and call timer_check for clear the IRQ.
bare_metal.c
For demonstrate IRQ and main thread's progress, a busy loop with counter is added. The loop will print out a number when specified condition is met. And Timer is enabled before the loop, You can see the timer tick with ISR and the main thread keeps counting.
void bare_metal_start(void)
{
int base = 0;
asm volatile (
"mov %0, sp\n\t" : "=r" (base)
);
printf("\n\n%s:%x: Hello World! %s %s %d\n\n", __func__, base, __DATE__, __TIME__, __LINE__);
printf("enter busy loop\n");
timer_enable(1);
volatile int i = 0;
while(1){
if((i++ & 0x00FFFFFF) == 0)
printf("%d\n", i);
}
}
2021年1月23日 星期六
Raspberry Pi Zero W Project Part 1 - bare metal printf implementation
Step 0 - update config.txt and overlay
config.txt:
enable_uart=1
dtoverlay=miniuart-bt # or disable-bt as you want
kernel=bare_metal.bin
Please remember to download miniuart-bt.dtbo or disable-bt.dtbo and place the files to a folder named "overlays" in your SD card.
Now, Let's start the lab.
The first step to control a CPU is to dump any message you want!
$ git clone https://github.com/champyen/rpiz_bare_metal.git
$ git checkout 4e769069
There are 4 major files in the example:
- bare_metal.c - the main example flow
- head.S - the glue code for entering the flow
- bare_metal.c - the main example flow
- printf.c - printf for RPi Zero's PL011 uart
- bare_metal.ld - linker script of the example
From the "Boot options in config.txt" of Raspberry Pi Document, we know that the default start address is 0x8000.
In bare_metal.ld, you can see the linker script:
OUTPUT_ARCH(arm)In Makefile, you can see the linking order is - head.o bare_metal.o printf.o.
SECTIONS {
. = 0x8000;
.text . : {
*(.text)
}
. = ALIGN(4);
.data . : {
*(.data)
}
. = ALIGN(4);
.bss . : {
*(.bss)
*(COMMON)
}
. = ALIGN(4);
.rodata . : {
*(.rodata)
*(.rodata.*)
}
}
Therefore, after bootcode.bin, it will jump to the first function in head.S.
In head.S:
.textBefore calling the demo function - bare_metal_start, head.S does only one thing - setup the address of 0x100000 as stack pointer.
_start:
ldr sp, =stack_top
bl bare_metal_start
stack_top: .word 0x100000
In bare_metal.c:
void printf(const char *fmt, ...);
void bare_metal_start(void)
{
printf("\n\n%s: Hello World! %s %s %d\n\n", __func__, __DATE__, __TIME__, __LINE__);
printf("enter busy loop\n");
while(1);
}
the bare_metal_start function calls the 'printf' implemented in printf.c to dump two messages.
In printf, the fundamental function is _putc.
#define PL011_BASE 0x20201000Please refer to Chapter 13 of the "Peripheral Specificaion" of BCM2835, the _putc just check the Transmit buffer status, and send out a char when Transmit buffer/register is empty.
#define PL011_DR (PL011_BASE + 0x00)
#define PL011_FR (PL011_BASE + 0x18)
#define _REG(x) *((unsigned int t *)(x))
void _putc(unsigned char c)
{
while( (_REG(PL011_FR) & 0x80) == 0);
_REG(PL011_DR) = c;
}
在 ARM 平台上使用 Function Multi-Versioning (FMV) - 以使用 Android NDK 為例
Function Multi-Versioning (FMV) 過往的 CPU 發展歷程中, x86 平台由於因應各種應用需求的提出, 而陸陸續續加入了不同的指令集, 此外也可能因為針對市場做等級區隔, 支援的數量與種類也不等. 在 Linux 平台上這些 CPU 資訊可以透過...

-
現今對於 Daily Linux Developer / User 面對不同程式/開發版本環境感到很頭疼, 常常疲於 執行舊版程式需要安裝舊版本 Library, 設定 RPATH / LD_LIBRARY_PATH 開發需求建立不同的版本 SDK 開發/執行環境, 在較舊系統...
-
這幾年個人在影像處理程式優化的領域打滾, 如果問到感到棘手的工作, floating point 的處理應該可以排上很前面的名次 在許多演算來說由於同時對於 precision 與 dynamic range 的需求, 因此在計算過程中對於浮點數的使用是非常常見的 (若要避免...
-
Function Multi-Versioning (FMV) 過往的 CPU 發展歷程中, x86 平台由於因應各種應用需求的提出, 而陸陸續續加入了不同的指令集, 此外也可能因為針對市場做等級區隔, 支援的數量與種類也不等. 在 Linux 平台上這些 CPU 資訊可以透過...