Wednesday, July 22, 2009

ubuntu SSD hints

Preface: Any good CompactFlash card will work as an IDE drive with a very very cheap adapter. CF operates in three modes - memory map?, PCMCIA, and true IDE mode. siliconkit has one good line, monoprice has a decent cheap one as usual, and newegg has some too. For those who never want to buy old 80-wire IDE kit again, you can also get CF->SATA adapters for ~$20-30.

TL;DR flash-based hard drives have existed for years right under your nose, and they are dirt cheap.


I've always liked the idea of SSD's, and now with cheap 4GB CF's (Costco blew them out for $10/ea a while back) you can fully install ubuntu, as long as you don't plan to rip DVD's, load eclipse with every plugin known to man, or things like that.

However, the typical Sandisk Ultra II card is 'only' 15MB/sec and isn't fully optimized for SSD use. One can get the faster UDMA-supporting Extreme IV card but those tend to cost more than HD's, so you need a really good reason to have one. So, unless we have the money for that* (let alone an intel X25!) we gotta tweak settings to make things 'feel' fast.

(* - especially if the target box is a hand-me-down P4)

And for those of you who have money for faster flashes - this'll work well on those too. However if your power is flaky, these... er, wait, just get a UPS, your computer will thank you by lasting a bit longer.

First, edit /etc/fstab to make sure the SSD is mounted noatime and nodiratime.

So here's some good tweaks that can be run in /etc/rc.local or better places to set /proc/sys and /sys/... tables.

# laptop mode holds onto writes for a while - up to 10 minutes with this setting.
echo 10 > /proc/sys/vm/laptop_mode
# only swap when absolutely necessary.
echo 0 > /proc/sys/vm/swappiness
# keep 'dirty' pages longer
echo 1500 > /proc/sys/vm/dirty_writeback_centisecs
echo 20 > /proc/sys/vm/dirty_ratio
echo 10 > /proc/sys/vm/dirty_background_ratio
# this scheduler will work better with flash
echo deadline > /sys/block/sda/queue/scheduler


Sunday, July 05, 2009

More random optimization notes:

I parallelzied the DCT code differently - I eventually had a DCT function that performed 8 at once, which fits in with my batch system pretty well. This got me to about 3.9-4Gflops (measured) on the E5200 box I've been playing with... and aboot 7.2 at home.

Then I realized I did something absolutely boneheaded - I duplicated the same cosine table for each DCT process. I fixed that and now the quad gets 14.6(!) Gflops at peak... and the dual about 7.3.

TL;DR any extra memory accesses can kill you with SSE code. There's only so much bandwidth to go around - even on an i7 (which would be pretty darn cool to have for this stuff. i'll get one when i can get a nice complete rig for <$500)

Now to actually process pictures - and post some!

P.S. Made a Google Code repo at - the program won't make much sense, but... there are a few nice fragments.

And other notes while processing images:

- If you're doing anything too complicated to make gcc vectorize better, You're Doing It Wrong.

- Don't worry about tightening rarely run O(1) tasks with less than say 10,000 items, at least if you're running with current tech.

- If you're not vectorizing, double isn't that much slower than single-precision floats. But it eats bandwidth for breakfast (nom!)

- Give something enough power and a brick really will fly. On the E5200 I can do a 2D DCT+IDCT of a 1400x2100 picture in under 10 seconds. This sounds slow... until one does the math and find that it does a gazillion multiply+accumulates.

Friday, July 03, 2009

Image Processing part 2

(code going online later)

Did the code cleanup the other day... was mostly happy with the results unlike last time.

Then I started playing with DCT's just for the heck of it... and I figured out you could sharpen the image by boosting the middle/end coefficients. I still don't have the color enhancing effects added back into this version yet - once I do I'll probably start going through recent pictures and posting stuff.

I don't have a 'fast' DCT algorithm, but I do have access to an E5200 box that can sustain 1.4 GFlops. So after tweaking it takes about 15sec to do a 2D DCT+IDCT of a 1400x2100 image. When I get back home I'll have the Q8200 again - I bet that could do 2GFlops. And then there's the intel compiler to try out...

... but the real win would be transposing it to GPU code. The DCT algorithm I have now could be turned into shaders really easily... probably resulting in a 10x+ speedup w/a fast video card.

For now - the next step for image processing is to move from RGB to HSL. Having RGB*Y doesn't work very well for extreme adjustments...


- Don't bother using SSE intrinsics - setting up the C++ code to vectorize with gcc 4.3 is far easier, even if the results aren't quite as good.

- DCT itself is quite interesting - the 'slow' frequency change covers phase changes pretty well.