網路黑貓 BlackCat on Net / Champ Yen: 你也可以寫 SIMD 比寫網頁還快

2017年1月6日星期五

你也可以寫 SIMD 比寫網頁還快 - II

第一篇並沒有介紹 clang 中的 OpenCL kernel 一些基礎型別宣告部分, 已經在上一篇出現過了, 所以這裡就從使用開始. 首先是 vector 的初始化

typedef int int4 __attribute__((ext_vector_type(4)));

typedef int int8 __attribute__((ext_vector_type(8)));

int a = 3;

int4 va = {1, 2, 3, 4};

int4 vb = 3;

接著是個別的 access, 下列的方式很適合做資料格式上的操作在等號的兩邊可以用不同的表示法, 沒有需要一致:

va.x = a;

// the below two lines are equivalent.

vb.xyw = va.xzw;

vb.s013 = va.s023;

// repeat is ok

va = vb.xyyz;

// high and low part

va.hi = va.lo;

va.lo = vb.hi;

// even and odd

va.odd = vb.even;

Vector 簡單的計算如下, 可以參照 clang - vectors and extended vectors 的列表:

// scalar multiply

va = a * vb;

// element-wise multiply

va = va * vb;

va = va | vb;

va = -va;

不同型別的 vector 轉換:

typedef float float4 __attribute__((ext_vector_type(4)));

// format convertion

float4 vfa = __builtin_convertvector(va, float4);

// data reinterpret

float4 vfa = (float4)va;

接著來實戰一番吧, 下列為以 bayer format 平均 GRBG 4個像素, 轉為1/4大小灰階圖的範例

void bayer_convert(char *src, char *gray, int w, int h)

{

    for(int y = 0; y < h; y+=2){

        for(int x = 0; x < w; x+=16){

            //bayer raw data fetch

            ushort8 pix00 = __builtin_convertvector(*((uchar8*)(src+x)), ushort8);

            ushort8 pix01 = __builtin_convertvector(*((uchar8*)(src+x+8)), ushort8);

            ushort8 pix10 = __builtin_convertvector(*((uchar8*)(src+x+w)), ushort8);

            ushort8 pix11 = __builtin_convertvector(*((uchar8*)(src+x+w+8)), ushort8);

            ushort8 pix_g0;

            pix_g0.lo = pix00.even;

            pix_g0.hi = pix01.even;

            ushort8 pix_r;

            pix_r.lo = pix00.odd;

            pix_r.hi = pix01.odd;

            ushort8 pix_b;

            pix_b.lo = pix10.even;

            pix_b.hi = pix11.even;

            ushort8 pix_g1;

            pix_g1.lo = pix10.odd;

            pix_g1.hi = pix11.odd;

            // average! so simple

            ushort8 pix_out = (pix_g0 + pix_r + pix_b + pix_g1) ;

            // write out

            *((uchar8*)(gray+(x/2))) = __builtin_convertvector(pix_out, uchar8);

        }

        src += w*2;

        gray += w;

    }

}

如何? 比起看 intrinsics function 來得清楚與簡單吧

如果還想了解 OpenCL vector 的操作可以參考這篇文章

有興趣記得編譯時加入 --save-temps 來看使用的指令!!! 下一篇會介紹進階的使用方式

沒有留言:

張貼留言

2017年1月6日 星期五

你也可以寫 SIMD 比寫網頁還快 - II

沒有留言:

在 ARM 平台上使用 Function Multi-Versioning (FMV) - 以使用 Android NDK 為例

2017年1月6日星期五