Assembly Optimization Tips by Mark Larson (2004)

(masm32.com)

48 points | by htfy96 5 days ago ago

21 comments

shiroiushi 21 hours ago ago
This article title should have "(2004)" added; this is seriously old information.
For modern use, something about ARM CPUs would be much more useful since that's what microcontrollers all use now. No one's doing ASM programming on x86 CPUs these days (and certainly not Pentium4 CPUs).
[-]
- unwind 19 hours ago ago
  Perhaps it's rare with full programs written in assembly, but for performance analysis and optimization I think knowledge about these kinds of tricks (but probably updated for the N generations since 2004, of course) still have relevance.
  For instance Daniel Lemire's blog [1] is quite often featured here, and very often features very low-level performance analysis and improvements.
  [1]: https://lemire.me/blog/
- josephg 19 hours ago ago
  > No one's doing ASM programming on x86 CPUs these days
  I take your point, but I think there’s still a fair bit of x86 asm out there. For example, in ffmpeg and zstd.
  [-]
  - saagarjha 18 hours ago ago
    I don't think anyone really cares about optimizing those codepaths, though.
    [-]
    - AstralStorm 16 hours ago ago
      Try Eigen then, where people were tweaking every last ounce of performance. Even then, it has problems matching MKL or nVidia libs for ultimate performance sometimes.
      [-]
      - saagarjha 5 hours ago ago
        For systems nobody has used in years?
- mshockwave 11 hours ago ago
  > No one's doing ASM programming on x86 CPUs these days
  I don't think that's entirely true...it's still pretty common to write high-performance / performance sensitive computation kernels in assembly or intrinsics.
ghaff a day ago ago
A fascinating peek into the fairly deep past (sigh) is Abrash's The Zen of Assembly language. Time pretty much overtook a planned Volume 2 but the Volume 1 is still a pretty fascinating read for a time when tweaking optimization for pre-fetch queues and the like was still a thing.
mshockwave a day ago ago
> (Intermediate)1. Adding to memory faster than adding memory to a register
I'm not familiar with Pentium but my guess is that memory store is relatively cheaper than load in many modern (out-of-order) microarchitectures.
> (Intermediate)14. Parallelization.
I feel like this is where compilers come into handy, because juggling critical paths and resource pressures at the same time sounds like a nightmare to me
> (Advanced)4. Interleaving 2 loops out of sync
Software pipelining!
ynoxinul 17 hours ago ago
> If you have a full 32-bit number and you need to divide, you can simply do a multiply and take the top 32-bit half as the result.
Can someone explain how this can work? Obviously, you can't just multiply the same numbers instead of dividing.
[-]
- Findecanor 17 hours ago ago
  Of course not. It is multiplication with a reciprocal in fixed-point representation. You'd first have to compute the reciprocal as 2**32 / divisor. Therefore it is most often done with constant divisors.
  A longer tutorial that goes into more depth: https://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  [-]
  - 0xf00ff00f 16 hours ago ago
    Also, x86 has an instruction that multiplies two 32-bit registers and stores the 64-bit result in two 32-bit registers. So you get the result of the division in the register with the high part of the multiplication result, and don't need to do a shift.
optymizer a day ago ago
What's a good resource like this for modern CPUs (especially ARM)?
[-]
- Findecanor 17 hours ago ago
  For modern x86, the go-to resource is often Agner Fog's optimisation manuals: https://www.agner.org/optimize/
- menaerus 18 hours ago ago
  Vendor microarchitectural and optimization manuals.
fwip a day ago ago
Looks like this was written in 2004, or thereabouts.
[-]
- nickelas a day ago ago
  I was wondering why it said P4. That's an old processor.
  [-]
  - mobilio 18 hours ago ago
    Also P4 comes with NetBurst architecture.
    "The Willamette and Northwood cores contain a 20-stage instruction pipeline. This is a significant increase in the number of stages compared to the Pentium III, which had only 10 stages in its pipeline. The Prescott core increased the length of the pipeline to 31 stages."
    https://en.wikipedia.org/wiki/NetBurst
    And many of that tricks actually works for long pipelines.
    [-]
    - AstralStorm 16 hours ago ago
      Many of the tricks do not work the same way due to how instructions are now broken down by the decoder into microops. You may end up with worse RISC code than what Intel or AMD microcoded. The CPU can optimize it as well if it sees CISC. And less cache pressure can still be valuable.
      Speculation and branch prediction got vastly sped up since.
      Compilers themselves got way better since as well, so you can sometimes get away with just intrinsics.
17 hours ago ago
[deleted]
17 hours ago ago
[deleted]