網路黑貓 BlackCat on Net / Champ Yen: 3月 2017

2017年3月28日星期二

"ARM Compute Library for computer vision and machine learning" II - Framework 篇

ARM Compute Library 的使用上的 class 主要有二類分別是針對 data 以及 task/workload
data 類別為: Image/Tensor, TensorInfo
task/workload 類別為: Kernel, Window 與 Function
下列的內容主要為 ARM Compute Library: Documentation 所描述

並且搭配 source code 的內容做特定用途的說明
(取代文件內 MyKernel, MyFunction 的方式)

由於 ARM Computer Library 是做 Computer Vision 與 Machine Learning 應用的
因此主要處理的資料型別為 Image 及 Tensor
在 Compute Library 中基本上只是名稱不同而已

Image, Tensor, TensorInfo

在 NEON 下直接使用 Image

Image src, dst;

而在 CL 下則使用 CLImage

CLImage src, tmp, dst;

Image 宣告後並沒有實際的 buffer 空間, 必須進一步最配置的動作,配置的方法有二, 兩者都需要傳遞 TensorInfo 資訊, TensorInfo 基本上為 Image/Tensor 各維度的大小以及資料格式
配置的第一種方式為直接透過 Allocator 的 init() 方式

src.allocator()->init(TensorInfo(640, 480, Format::U8));

而第二種方式為先呼叫 configure() 在呼叫 Allocator 的 allocate()

TensorInfo dst_tensor_info(src.info()->dimension(0) / scale_factor, src.info()->dimension(1) / scale_factor, Format::U8);
dst.allocator()->init(dst_tensor_info);
dst.allocator()->allocate();

Kernel, Window & Function

空間配置好之後, 就必須透過各種各樣的 Kernel 來套用對應的功能來操作 Image/Tensor
使用上的核心 class 為 Kernel, 各個 Kernel 實作了 IKernel 相關的介面

使用的第一個步驟為宣告想使用的 kernel object, 假設我們想作 image scale

//Create a kernel object:
NEScaleKernel scale_kernel;

在使用之前必須對 kernel 作 input, output 使用參數以及 padding mode 作設定

// Initialize the kernel with the input/output and options you want to use:
Tensor offsets;
const TensorShape shape(dst.info()->dimension(0), dst.info()->dimension(1));
TensorInfo tensor_info_offsets(shape, Format::S32);
tensor_info_offsets.auto_padding();
offsets.allocator()->init(tensor_info_offsets);
scale_kernel.configure( &src, nullptr, nullptr, &offset, &dst, InterpolationPolicy::NEAREST_NEIGHBOR, BorderMode::UNDEFINED);
offsets.allocator()->allocate();
// compute offset
...

這裡使用了 NEAREST Filter, 並且使用 UNDEFINED padding 方式 (對於邊界有缺所需資料的點不做處理)

最後就是呼叫使用該功能,即是呼叫 IKernel 的 run() 介面

// Retrieve the execution window of the kernel:
const Window& max_window = scale_kernel.window();// Run the whole kernel in the current thread:
scale_kernel.run( max_window ); // Run the kernel on the full window

這些即為 Compute Library 基本的使用方法.
對於 CL Kernel 則有稍微不同的 flow (需要特別傳入以操作 cl::CommandQueue)

或許會覺得上面例子的 max_window 很多餘, 但它是有進階應用的
Window 用途在於對 Kernel 指定要套用執行的範圍描述
在官方說明文件是以 Multi-Threading 的方式來說明 Window 的用途

const Window &max_window = scale_kernel->window();
const int num_iterations = max_window.num_iterations(split_dimension);
int num_threads = std::min(num_iterations, _num_threads);

for(int t = 0; t < num_threads; ++t){
    Window win = max_window.split_window(split_dimension, t, num_threads);
    win.set_thread_id(t);
    win.set_num_threads(num_threads);
    if(t != num_threads - 1){
        _threads[t].start(kernel, win);    }else{
        scale_kernel->run(win);    }
}

當下列所有的條件都符合後, Window 可以用來分割 workload 為多個子 Window

max[n].start() <= sub[n].start() < max[n].end()
sub[n].start() < sub[n].end() <= max[n].end()
max[n].step() == sub[n].step()
(sub[n].start() - max[n].start()) % max[n].step() == 0
(sub[n].end() - sub[n].start()) % max[n].step() == 0

至於 Function 的使用則是為了簡化繁雜的 Kernel, Window 的使用流程, Function 實作內部會自行配置所需的暫存 buffer, 甚至能透過上述的方式自行做 Multi-Threading

// Create and initialize a Scale function object:
NEScale scale;scale.configure(&src, &dst, InterpolationPolicy::NEAREST_NEIGHBOR, BorderMode::UNDEFINED);
// Run the scale operation:
scale.run();

若使用的為以 CL 實作的 Kernel 最後還有個確保執行完成的額外同步動作

// Make sure all the OpenCL jobs are done executing:
CLScheduler::get().sync();

如此一來 Function 比起直接使用 Kernel 簡化不少

在下一篇會進入 NEON 與 CL 內部實作方式的說明

2017年3月26日星期日

簡報 - Video Compression Standards - History & Introduction

這份簡報是一年半前為了數位電視課程所準備, 發現 blog 沒有紀錄所以發文分享與 Link 一下

Video Compression Standards - History & Introduction from Champ Yen

"ARM Compute Library for computer vision and machine learning" I - Overview 篇

日前 ARM 官方透過 github 並且以 MIT License 方式再次正式地釋出了 Compute Library 的原始碼(先前提供了 internal evaluation only 的 binary, 詳請請回顧當時的 release note), 這是個 low-level implementation, 而且是 pure function 的形式, Compute Library 提供了涵蓋下列的功能函式:

基本的運算, 數學與布林運算函式 (Basic arithmetic, mathematical, and binary operator functions)
色彩操作, 包含轉換, 頻道擷取與其他 (Color manipulation (conversion, channel extraction, and more))
捲積濾波器 (Convolution filters (Sobel, Gaussian, and more))
Canny Edge, Harris corners, optical flow, and more
Pyramids (such as Laplacians)
HOG (Histogram of Oriented Gradients)
SVM (Support Vector Machines)
半/全精準通用矩陣乘法 (H/SGEMM (Half and Single precision General Matrix Multiply))
捲積類神經網路建構功能區塊 (Convolutional Neural Networks building blocks (Activation, Convolution, Fully connected, Locally connected, Normalization, Pooling, Soft-max))

對於 Compute Library 來說, 它屬於個人介紹過的(請參考SIMD Introduction 簡報) SIMD Programming Model 中的 SIMD Optimized Library, 概念與與提供的效能上, 可以參考 ARM 官方釋出的介紹文, 本系列文會專注於 Compute Library 內部架構與更細節的如何使用, 所提供的能力以及, 以及探討使用這樣的 Library 依然存在什麼樣的限制.

首先 ARM Compute Library 其 github 位置為 https://github.com/ARM-software/ComputeLibrary , 而相關的原文文件在 source 與網站上各有一份

以目錄結構來說下列為主要較重要的目錄:

arm_compute/ - 放置所有 Compute Libraray 的 Headers

core/ - Core library 是由底層演算法的實作所組成

基本共用資料型別 (Types, Window, Coordinates, Iterator, 等等)
基本通用介面 (ITensor, IImage, 等等)
物件 metadata 型別 (ImageInfo, TensorInfo, MultiImageInfo)
backend 目錄

runtime/ - Runtime library 是用來快速 prototyping 用途非常基本的 Core Library 的 wrapper (由於 CL/NEON 的 Programming Model, 這裡提供對應不同的 execution interface)

基本通用物件介面的實作(Array, Image, Tensor, etc.)

以上兩者, 內各自有 CL/CPP/NEON backedn 目錄, 提供對應 backend 定義的 kernel headers

documentation/ - Doxygen 所產生的文件
examples/ - 內有提供的 4 個範例程式
include/ - 基本上只放置 OpenCL 1.2 的 Headers

src/

core/ - 於 arm_compute/core/ 中定義的型別/介面的實作
runtime/ - 於 arm_compute/runtime/ 中定義的型別/介面的實作
以上兩者, 內各自有 CL/CPP/NEON 目錄, 即為該 backend 實作相關原始碼

而值得一提的是在 Compute Library 中提供的功能中, 這些 Kernel 演算所使用的定義規範為 OpenVX 1.1 所制定的

以下為目前提供的 Kernel 列表, 若了解 image processing, DNN 該 function 名稱應解釋了其功用, 即不在此冗文解釋: (注明 NEON-only 表示目前尚未有 CL 實作)
AbsoluteDifferenceKernel
AccumulateKernel
ActivationLayerKernel
ArithmeticAdditionKernel
ArithmeticSubtractionKernel
BitwiseAndKernel
BitwiseNotKernel
BitwiseOrKernel
BitwiseXorKernel
Box3x3Kernel
CannyEdgeKernel
ChannelCombineKernel
ChannelExtractKernel
Col2ImKernel
ColorConvertKernel
ConvolutionKernel
ConvolutionLayerWeightsReshapeKernel
CumulativeDistributionKernel (NEON-only)
DepthConvertKernel
DerivativeKernel
DilateKernel
ErodeKernel
FastCornersKernel
FillArrayKernel (NEON-only)
FillBorderKernel
FillInnerBorderKernel (NEON-only)
Gaussian3x3Kernel
Gaussian5x5Kernel
GaussianPyramidKernel
GEMMInterleave4x4Kernel
GEMMLowpMatrixMultiplyKernel
GEMMMatrixAccumulateBiasesKernel
GEMMMatrixAdditionKernel
GEMMMatrixMultiplyKernel
GEMMTranspose1xWKernel
HarrisCornersKernel
HistogramKernel
HOGDescriptorKernel (NEON-only)
HOGDetectorKernel (NEON-only)
HOGNonMaximaSuppressionKernel (NEON-only)
Im2ColKernel
IntegralImageKernel
LKTrackerKernel
MagnitudePhaseKernel
MeanStdDevKernel
Median3x3Kernel
MinMaxLocationKernel
NonLinearFilterKernel
NonMaximaSuppression3x3Kernel
NormalizationLayerKernel
PixelWiseMultiplicationKernel
PoolingLayerKernel
RemapKernel
ScaleKernel
Scharr3x3Kernel
Sobel3x3Kernel
Sobel5x5Kernel
Sobel7x7Kernel
SoftmaxLayerKernel
TableLookupKernel
ThresholdKernel
TransposeKernel
WarpKernel (CL 細分為 WarpAffine, WarpPerspective 兩種)

下一篇將會介紹 Compute Library 中使用所需了解的基本型別, 介面, 執行方式以及範例

2017年3月19日星期日

OpenCL Programming Tips for Qualcomm Adreno GPU 導讀

目前手機多半都有內建 OpenCL Runtime
(除了 Google 無謂地堅持 RenderScript, Do Evil 地阻礙標準的採用)
對於 OpenCL 有所了解的人, 多半清楚 OpenCL 是 function portability, 而無法做到 performance portability, 這其中的緣由主要還是在於各家的 GPU Architecture 的差異, 因此多半各家 GPU vendor 都會提供自家 OpenCL Optimization Guide, 以利開發者對自家平台優化應用性能

目前手機可能內建的 GPU 主要為三

ARM Mali
Imagination PowerVR
Qualcomm Adreno

由於個人 OpenCL 經驗多半與前兩者有關, 由於未曾接觸, 因此對於 Adreno 部分有很大的興趣, 因而選擇研讀與撰文. 這篇主要的內容主要來自下列公開的 Adreno OpenCL Programming Guide 文件. 有機會也會撰文介紹 ARM Mali 與 Imagination PowerVR 的 Programming Guide

Adreno OpenCL Programming Tips

Memory

其第一個章節即是 "Memory", 對於計算架構來說 Memory 幾乎是效能上的關鍵, 對於 GPU 亦不例外, 而在這份文件中 Memory 章節佔了一半的份量, 可見其重要性, 對於 Adreno GPU 而言, 其考量點有五:

Vectorization and coalescing

對於 memory access 而言, Coalescing 是常用的方式, 儘管一些 compiler 會透過 auto-vectorization 嘗試去優化 workgroup 內更有效率的存取, 但是 programmer 自行作 vectorization 並控制 access pattern 對於後續的 finetune 是重要的. 一旦 vectorization 後, OpenCL 提供兩個介面做 vector loading, 一者為 vector-based pointer arithmetic, 另一為 vloadn 的方式, 這裡 Adreno 建議以 vloadn 並且不建議 n > 4. (這必定也只是一個 common rule, 程式的流程與記憶體的行為也會有差異)

Image vs. buffer memory objects

OpenCL 的記憶體使用分為兩種 abstract object, 一者為 Buffer 另一為 Image Object, 對於 Image 所提供的優點如下:

主要是相較於 Buffer 多了 L1 cache (這是 Texture Unit 的特性所增加), 另外就是硬體加速的線性內插的計算, 以及最後是能夠透過硬體自動處理邊界的問題(以 Buffer 而言需要增加 GPU 所不擅長的 if/else 的 code)

Global memory (GM) vs. local memory (LM)

在 Global Memory(GM) 與 Local Memory(LM) 間應注意的事情有三, barrier 的使用, 再者為搬移資料需要考量的 cost, 最後是將資料存放在 LM 的條件
對於 Kernel 撰寫實作熟悉者, 應該對於 barrier 的使用會有相當的 overhead 不陌生, 但是對於一些有 data dependency stages 的實作這又是必要的, 減少 barrier 的使用幾乎是所有 GPU 平台一致要納入考量的點.
儘管 LM 有著較低的 latency 但對於 Adreno GPU 來說 GM 到 LM 的途徑中需要使用 GPU 內部的暫存器, 隨著需要搬移的資料數目耗費的暫存器可能會引起 register spilling 問題, 另外 Adreno GPU 對於 GM 有 L2 cache, 因此並不是直接將資料放置於 LM 就能獲得效益.
對於需要放置到 LM 的條件, 這裡建議 LM 存放介於兩個 stage 中間的資料, 或是會被使用至少三次的 input data.

Private memory

對於多數的 GPU 架構 private memory 對應到的地方是暫存器, 對於 Adreno 亦不例外, 由於 general register 數目有限, 若不良的 coding style 會造成 register spilling, 而一旦這樣的情況發生, Adreno 會嘗試自 LM 調度, 若 LM 不足則最後會調度到 GM, 這樣的流程若頻繁發生, 可以預期的是效能會大幅降低. 因此對於想要宣告為 array 的變數, 建議直接使用 LM. 其實除了這樣之外, 可以做 in-function 的 multi-stages 方式, 積極地將 kernel 透過多個 stage subroutine 來實作, 最後手段是透過精準的 variable life-scope 控制(也就是加入 {} 大括號, 主動提供 compiler 資訊) 來減少 register 的使用.

Constant memory

Adreno 提供了相對特別的 constant memory (經驗上來說常看到架構會使用 LM 作為存放 constant 的地方), 然而大小為 3KB, 超過的部分系統會放在 GM 中, 對於透過 Kernel Argument 傳入的 constant buffer 需要透過屬性的設置來預期會被放置於 constant memory.(詳細請參考 1. 文件)

Zero memory copy

如同其他 OpenCL Runtime, 若需要同時 CPU/GPU 能夠存取, Buffer/Image 的配置要透過 CL_MEM_ALLOC_HOST_PTR 這個 flag 來取得能夠讓 CPU/GPU 無需 data copy 的空間, 在透過Map/Unmap 的方式來使用. 然而對於除了 CPU, GPU 外的硬體需要使用, 需適當地選擇 cl_ion_qcom_host_ptr 或 cl_qcom_android_native_buffer_host_ptr 這兩個 flag 來使用

Work group size and shape

儘管多數 OpenCL 教科書建議在 Kernel 執行時輸入將 local work size 參數傳入 NULL, 讓 Runtime 自動配置最適當的 WorkGroup size, 但是 Adreno 還是提醒這樣的作法其實並不總是最佳的(事實上所有的平台都不是), programmer 應該嘗試尋找適當的 WorkGroup size.

Data type and bit width

資料型別的使用上 Adreno 上建議使用較短的資料型別, 除了能夠減少資料存放的大小與減少頻寬, 像是對於 half (16位元浮點數) 與 float 而言, half 還提供了兩倍的計算能力.

Math functions

這裡 Adreno 上應避免除法與餘數的計算, 此外若 mul24/mad24 (24位元乘法/乘加) 足夠精準度的話, 應該儘量使用, 這主因是一個32位元的乘法計算在 Adreno 內部是透過3個指令來組成的.

後續會再探討

訂閱：文章 (Atom)

2017年3月28日 星期二