網路黑貓 BlackCat on Net / Champ Yen: SIMD 加速的實例 - NEON acceleration of tiling in VC4 of RPi3

2017年1月16日星期一

SIMD 加速的實例 - NEON acceleration of tiling in VC4 of RPi3

今日增加了 Matrix Multiplication 對於 GCC 的支援, 與相關效能數字後
(基本上 GCC 的 Vector 優化較 Clang 差)
一則關於NEON的新聞引起了我的注意
主要是 Broadcom 的開發者在OSS Driver 中使用 NEON 加速
其部分 Blog 原文引述如下:
"My first hope was that I could load a full cacheline with NEON's VLD. VLD1 only loads up to 1 "quadword" (16 bytes) at a time, so that doesn't seem like much help. VLD4 can load 64 bytes like we want, but it also turns AOS data into SOA in the process, and there's no corresponding "SOA-back-to-AOS store 8 or 16 bytes at a time" like we need to do to get things back into the CPU's strided representation. I tried VLD4+VST4 into a temporary, then doing my old untiling path on the cached temporary, but that still left me a few percent slower on loads than not doing any of this work at all.

Finally, I hit on using the VLDM instruction. It seems to be intended for stack loads/stores, but we can also use it to get 64 bytes of data in from memory untouched into NEON registers, and then I can use 4 (32bpp) or 8 (8 or 16bpp) VST1s to store it to the CPU side. With this, we get a 208.256% +/- 7.07029% (n=10) improvement to GetTexImage performance at 1024x1024. Doing the same NEON code for stores gave a 41.2371% +/- 3.52799% (n=10) improvement, probably mostly due to not calling into memcpy and having it go through its size/alignment-based memcpy path choosing process."

由於 NEON 中的 Vector load 指令 VLD1 "只能"一次讀入16bytes, 因此原本計劃採用了 VLD4 期望來一次大量的讀入資料, 但 VLD4 這個指令會做類似 AOS(Array of Structure) 轉 SOA (Structure of Array) 的動作(其實是 de-interleave), 而且並沒有一次 SOA轉回 AOS 的指令, 該開發者嘗試過 VLD4+VST4 然後再利用原有非 tiling 的作法, 但是還是比慢上了許多. (其實這樣使用 VLD4 的作法, 基本上有 register spilling 的疑慮, 再者 VLD4 基本上還是透過 L/S unit, 基本上由於 ARM 的 NEON unit 的實作, 基本上並不會比 VLD1 好)

最後他採用了 VLDM 指令(這個指令可以參考這篇), 基本上這指令帶來的好處是不用會 touch 到 NEON registers, 並且能夠處理的長度也較長, 最終該開發者在讀取的 GetTexImage 得到了 208.256% 的效能改善, 並且在寫入獲得了 41.2371% 的改善. 開發者認為這或許歸功於避免直接呼叫 memcpy, 並採取 size/alignment-based 的對應流程.

2 則留言:

jeunder 提到...: 嗨，vld1可以一次讀入32 bytes，兩個Q，2 cycles。vldm需要number of Q+1 cycles，似乎沒賺到？; 2020年11月14日晚上10:10
jeunder 提到...: 喔，對了，我說沒賺到是根據你給的這篇資訊 http://armneon.blogspot.com/2013/07/neon-tips-tricks-part-1-using-stack.html?m=1
作者最後還是回到使用vst1、vld1，他覺得有賺到1個cycle。; 2020年11月15日凌晨12:21

張貼留言

2017年1月16日 星期一

SIMD 加速的實例 - NEON acceleration of tiling in VC4 of RPi3

2 則留言:

在 ARM 平台上使用 Function Multi-Versioning (FMV) - 以使用 Android NDK 為例

2017年1月16日星期一