龙芯开源社区

 找回密码
 注册新用户(newuser)
查看: 2721|回复: 0

[转帖] 测试英特尔的新 SIMD 指令: AVX

[复制链接]
发表于 2011-1-10 11:52:09 | 显示全部楼层 |阅读模式
新的 AVX 是对 SSE 的提高。直接就是 10% 的提高,看起来效果不错。

http://muizelaar.blogspot.com/2011/01/trying-out-avx.html

Jeff Muizelaar
Saturday, January 8, 2011

Trying out AVX

Intel's new Sandy Bridge CPUs came out this week and they support a new set of instructions called AVX. The AVX instructions are a much bigger change than the usual SSE revisions in the past few micro-architectures. First of all, they double the 128 bit SSE registers to 256 bits. Second, they introduce an entirely new instruction encoding. The new encoding switches from 2 operand instructions to 3 operand instructions allowing the destination register to be different than the source registers. For example:

  1.   addps r0, r1       # (r0 = r0 + r1)
  2.          vs.
  3.   vaddps r0, r1, r2  # (r0 = r1 + r2)
复制代码
This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers. Finally, AVX introduces some new data movement instructions, which should help improve code efficiency.

I decided to see what kind of performance difference using AVX could make in qcms with minimal effort. If you use SSE compiler intrinsics, like qcms does, switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx, I also took advantage of some of the new data movement instructions by replacing the following:

  1.   vec_r = _mm_load_ss(r);
  2.   vec_r = _mm_shuffle_ps(vec_r, vec_r, 0);
复制代码
with the the new vbroadcastss instruction:

  1.   vec_r = _mm_broadcast(r);
复制代码
Overall, this change reduces the inner loop by 3 instructions.

The performance results were positive, but not what I expected. Here's what the timings were:

  1. SSE2:        75798 usecs
  2. AVX (-mavx):        69687 usecs
  3. AVX w/ vbroadcastss:        72917 usecs
复制代码
Switching to the AVX encoding improves performance by more than I expected: nearly 10%. But adding the new vbroadcastss instruction, in addition to the AVX encoding, not only doesn't help, but actually makes things worse. I tried analyzing the code with the Intel Architecture Code Analyzer, but the analyzer also thought that using vbroadcastss should be faster. If anyone has any ideas why vbroadcastss would be slower, I'd love to hear them.

Despite this weird performance problem, AVX seems like a good step forward and should provide good opportunities for improving performance beyond what's possible with SSE. For more information, check out this presentation which gives a good overview of how to take advantage AVX.

Posted by Jeff Muizelaar at 6:42 PM

本版积分规则

Archiver|手机版|小黑屋|Lemote Inc.

GMT+8, 2020-9-22 15:52 , Processed in 0.161850 second(s), 15 queries .

Powered by Discuz! X3.4

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表