I parallelzied the DCT code differently - I eventually had a DCT function that performed 8 at once, which fits in with my batch system pretty well. This got me to about 3.9-4Gflops (measured) on the E5200 box I've been playing with... and aboot 7.2 at home.
Then I realized I did something absolutely boneheaded - I duplicated the same cosine table for each DCT process. I fixed that and now the quad gets 14.6(!) Gflops at peak... and the dual about 7.3.
TL;DR any extra memory accesses can kill you with SSE code. There's only so much bandwidth to go around - even on an i7 (which would be pretty darn cool to have for this stuff. i'll get one when i can get a nice complete rig for <$500)
Now to actually process pictures - and post some!
P.S. Made a Google Code repo at http://code.google.com/p/chadslab/ - the program won't make much sense, but... there are a few nice fragments.
And other notes while processing images:
- If you're doing anything too complicated to make gcc vectorize better, You're Doing It Wrong.
- Don't worry about tightening rarely run O(1) tasks with less than say 10,000 items, at least if you're running with current tech.
- If you're not vectorizing, double isn't that much slower than single-precision floats. But it eats bandwidth for breakfast (nom!)
- Give something enough power and a brick really will fly. On the E5200 I can do a 2D DCT+IDCT of a 1400x2100 picture in under 10 seconds. This sounds slow... until one does the math and find that it does a gazillion multiply+accumulates.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment