Why Deep Learning Needs Assembler Hackers

Photo by Daniel Lopez

Take a look at this function:

for (j = 0; j < n; j++) { for (i = 0; i < m; i++) { float total(0); for (l = 0; l < k; l++) { const size_t a_index = ((i * a_i_stride) + (l * a_l_stride)); const float a_value = a[a_index]; const size_t b_index = ((j * b_j_stride) + (l * b_l_stride)); const float b_value = b[b_index]; total += (a_value * b_value); } const size_t c_index = ((i * c_i_stride) + (j * c_j_stride)); c[c_index] = total; } }

For something so simple, it turns out it’s amazingly hard for compilers to speed up without a lot of human intervention. This is the heart of the GEMM matrix multiply function , whichpowers deep learning, and every fast implementation I know has come from old-school assembler jockeys hand-tweaking instructions!

When I first started looking at the engineering side of neural networks, I assumed that I’d be following the path I’d taken on the rest of my career and getting most of my performance wins from improving the algorithms, writing clean code, and generally getting out of the way so the compiler could do its job of optimizing it. Instead I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century. Why is this?

Matrix multiplies are a hard casefor modern compilers to handle. The inputs used for neural networks mean that one function call may require millions of operations to complete, which magnifies the latency impact of any small changes to the code. The access patterns are entirely predictable for a long period, but not purely linear, which doesn’t fit well with cache line algorithms as written in the naive way above. There are lots of choices about how to accumulate intermediate results and reuse memory reads, which will have different outcomes depending on the sizes of the matrices involved.

All this means that an endangered species, hand-coding assembler experts, write all of the best implementations. GotoBlas (which evolved into OpenBlas ) showed how much speed could be gained on Intel CPUs. Eigen has had a lot of work put into it to run well on both x86 and ARM with float, and gemmlowp is optimized for eight-bit on ARM. Even if you’re running on a GPU, Scott Gray (formerly at Nervana, now at OpenAI) has shown how much faster hand-coded solutions can be.

This is important because it means that there’s a lot of work involved in getting good performance from new platforms, and there’s often a big gap between existing highly-optimized solutions and those ported from other architectures. This is visible for example with gemmlowp on x86, where the hand optimization is still a work in progress and so the speed still lags behind float alternatives right now.

It’s also exciting, because the real-world performance of even most optimized libraries lags behind the theoretical limits of the hardware, so there are still opportunities to squeeze more speed out of them with some clever hacking. There are also exciting developments in fundamentally different approaches to the problem like the Winograd algorithm . The good news is that if you’re an old-school assemblerhacker there’s still an important place for you in the brave new world of deep learning, so I hope we can pull you in!

Why Deep Learning Needs Assembler Hackers

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎