Friday, December 25, 2009

Checked hw-working code into SVN...

Finally cracked serial port and clock setup on real hardware tonight. I was also able to set up the PLL to run the chip at 50mhz - and even over clock it to 100mhz. :)

Interrupts don't work yet, but eh.

I think I should write something FORTH-like next so I won't have to flash the chip too much, or at least a super-simple interpreter of some sort for HW bringup.

Tuesday, December 15, 2009

c++ virtual-functioning runtime in 270 lines, no asm.

Interrupts definitely won't work yet, but I've got virtual functions working and it's dealing with multiple source files even. No assembly needed! (and it does use libgcc for some helper code...)

http://code.google.com/p/chadslab/source/browse/#svn/trunk/m++

Saturday, December 12, 2009

Cortex-M3 + CodeSourcery startup...

This .pdf documents how to write a minimal C program to run on the LM3S811EVB qemu target:

http://www.bravegnu.org/gnu-eprog-handout.pdf

Install CodeSourcery and have fun! I'm going to look into elaborating on this over the weekend. Maybe even blink an LED on my LPC13xx board...

---

Compiling CMSIS w/CodeSourcery w/o IDE...

This is for the LPC13XX target. This also works with Luminary Micro's stuff as long as you change the relevant LPC13xx's to LM3's.

- Get the CodeSourcery tarball for Linux and extract it.
- Set path to [extraction dir]/bin:$PATH to get the toolchain in path

- Get the CMSIS library from onarm.com and extract it
- create a flat cmsis directory outside of that tree
- copy CMSIS_V1P30/CM3/CoreSupport/core_cm3.* to it
- then CMSIS_V1P30/CM3/DeviceSupport/NXP/LPC13xx/startup/gcc/*
- finally CMSIS_V1P30/CM3/Example/Sourcery\ G++Lite/LPC13xx/*

Then make a start.c with an empty _start() function.

To build: arm-none-eabi-gcc -mcpu=cortex-m3 -msoft-float -mthumb -fno-common -Wl,-cref,'-Map=map.txt',-S,'-TLPC13xx.ld' *.c startup_LPC13xx.s

And to de-ELF it so you can write it to flash: arm-none-eabi-objcopy a.out -O binary a.bin

Now to actually build and run code. The next step is to build the LM3 version and run it under qemu's L3MS811 EVB emulation... and not have it crash.

Thursday, December 10, 2009

Some crazy macro-abuse I came up with..

... not terribly useful in and of itself, but it allows for an organized extensible systemcall-y/pseudo-dynamic library setup.


#include <stdio.h>

#define BEGIN_FUNC(name, inm, outm) \
typedef struct name##_in_t { inm } name##_in; \
typedef struct name##_out_t { outm } name##_out; \
name##_out name(name##_in *in) {name##_out out; printf("l%d %d\n", sizeof(name##_in), sizeof(name##_out));

#define END_FUNC return out;}
#define CALL(output, name, invar ... ) output = name(&(name##_in){invar})
#define CALLA(output, name, invar ... ) name##_out output = name(&(name##_in){invar})

BEGIN_FUNC(test, int a; int ab;, int b;)
printf("%d %d\n", in->a, in->ab);
out.b = in->a * 2;
END_FUNC

int main(void)
{
int x = 2;

CALLA(outpt, test, 20, x);
printf("%d\n", outpt.b);
CALL(outpt, test, 30);
printf("%d\n", outpt.b);

return 0;
}

Tuesday, August 11, 2009

Notes on the PS3 wireless keypad and Linux...

The "PS3" Wireless Keypad is actually a small Bluetooth(r) keyboard+mouse.

To get it into pairing mode with other devices, hold down the blue button and turn the power 'on' and wait for blinkinglights. Then you can pair with it normally - set a PIN and like most other bluetooth kbd's, enter the PIN number and type enter.

And no, you can't use it as a USB keyboard.

I'm not quite sure how the modifier keys work yet, and how to properly map it. More details will come later as I figure out how to actually use it. ;) I've made a little progress using the "input-events" program (ubuntu: input-utils pkg)

As a general keyboard there are definitely "holes" in the keymap. The blue(left) button works as left shift for the keys it feels like working for, and the right is right alt.

However, they're only active for some keys, and those keys get remapped.

Therefore, to make this thing *actually* work, one needs to map other keys in as CTRL and ALT. The two buttons next to the keypad select switch (which map as F24 and F23) should work nicely for this.

The Linux input drivers can't do remapping on this, but x.org's event driver can.

xmodmap -e "keycode 202 = Control_L" < that *should* have worked, but the actual modifier isn't kicking in. Argh. But I'm tired, so that's a problem for another day...

Friday, August 07, 2009

mini 6502 emulator v0.001

I checked my very early 6502 source into google code (project: chadslab) - it's not debugged at all, but it's the smallest emulator I know of.

tl;dr the 6502's logic is really elegant and I felt like trying to writing .c file out of it. :)

Wednesday, July 22, 2009

ubuntu SSD hints

Preface: Any good CompactFlash card will work as an IDE drive with a very very cheap adapter. CF operates in three modes - memory map?, PCMCIA, and true IDE mode. siliconkit has one good line, monoprice has a decent cheap one as usual, and newegg has some too. For those who never want to buy old 80-wire IDE kit again, you can also get CF->SATA adapters for ~$20-30.

TL;DR flash-based hard drives have existed for years right under your nose, and they are dirt cheap.

---

I've always liked the idea of SSD's, and now with cheap 4GB CF's (Costco blew them out for $10/ea a while back) you can fully install ubuntu, as long as you don't plan to rip DVD's, load eclipse with every plugin known to man, or things like that.

However, the typical Sandisk Ultra II card is 'only' 15MB/sec and isn't fully optimized for SSD use. One can get the faster UDMA-supporting Extreme IV card but those tend to cost more than HD's, so you need a really good reason to have one. So, unless we have the money for that* (let alone an intel X25!) we gotta tweak settings to make things 'feel' fast.

(* - especially if the target box is a hand-me-down P4)

And for those of you who have money for faster flashes - this'll work well on those too. However if your power is flaky, these... er, wait, just get a UPS, your computer will thank you by lasting a bit longer.

First, edit /etc/fstab to make sure the SSD is mounted noatime and nodiratime.

So here's some good tweaks that can be run in /etc/rc.local or better places to set /proc/sys and /sys/... tables.

# laptop mode holds onto writes for a while - up to 10 minutes with this setting.
echo 10 > /proc/sys/vm/laptop_mode
# only swap when absolutely necessary.
echo 0 > /proc/sys/vm/swappiness
# keep 'dirty' pages longer
echo 1500 > /proc/sys/vm/dirty_writeback_centisecs
echo 20 > /proc/sys/vm/dirty_ratio
echo 10 > /proc/sys/vm/dirty_background_ratio
# this scheduler will work better with flash
echo deadline > /sys/block/sda/queue/scheduler

references:

http://ubuntuforums.org/showthread.php?t=1183113
http://www.ocztechnologyforum.com/forum/showthread.php?t=54379&page=2

Sunday, July 05, 2009

More random optimization notes:

I parallelzied the DCT code differently - I eventually had a DCT function that performed 8 at once, which fits in with my batch system pretty well. This got me to about 3.9-4Gflops (measured) on the E5200 box I've been playing with... and aboot 7.2 at home.

Then I realized I did something absolutely boneheaded - I duplicated the same cosine table for each DCT process. I fixed that and now the quad gets 14.6(!) Gflops at peak... and the dual about 7.3.

TL;DR any extra memory accesses can kill you with SSE code. There's only so much bandwidth to go around - even on an i7 (which would be pretty darn cool to have for this stuff. i'll get one when i can get a nice complete rig for <$500)

Now to actually process pictures - and post some!

P.S. Made a Google Code repo at http://code.google.com/p/chadslab/ - the program won't make much sense, but... there are a few nice fragments.

And other notes while processing images:

- If you're doing anything too complicated to make gcc vectorize better, You're Doing It Wrong.

- Don't worry about tightening rarely run O(1) tasks with less than say 10,000 items, at least if you're running with current tech.

- If you're not vectorizing, double isn't that much slower than single-precision floats. But it eats bandwidth for breakfast (nom!)

- Give something enough power and a brick really will fly. On the E5200 I can do a 2D DCT+IDCT of a 1400x2100 picture in under 10 seconds. This sounds slow... until one does the math and find that it does a gazillion multiply+accumulates.

Friday, July 03, 2009

Image Processing part 2

(code going online later)

Did the code cleanup the other day... was mostly happy with the results unlike last time.

Then I started playing with DCT's just for the heck of it... and I figured out you could sharpen the image by boosting the middle/end coefficients. I still don't have the color enhancing effects added back into this version yet - once I do I'll probably start going through recent pictures and posting stuff.

I don't have a 'fast' DCT algorithm, but I do have access to an E5200 box that can sustain 1.4 GFlops. So after tweaking it takes about 15sec to do a 2D DCT+IDCT of a 1400x2100 image. When I get back home I'll have the Q8200 again - I bet that could do 2GFlops. And then there's the intel compiler to try out...

... but the real win would be transposing it to GPU code. The DCT algorithm I have now could be turned into shaders really easily... probably resulting in a 10x+ speedup w/a fast video card.

For now - the next step for image processing is to move from RGB to HSL. Having RGB*Y doesn't work very well for extreme adjustments...

Notes:

- Don't bother using SSE intrinsics - setting up the C++ code to vectorize with gcc 4.3 is far easier, even if the results aren't quite as good.

- DCT itself is quite interesting - the 'slow' frequency change covers phase changes pretty well.

Tuesday, June 30, 2009

pseudoHDR - take 0.1.

I've been experimenting with reprocessing RAW files from my camera using dcraw and then taking the RGB data and reprocessing it.

The core idea is that my dSLR has 14-bits of range (and your typical .jpg? 8-bit.) So that gives us extra dynamic range to reprocess the picture and do an HDR-type effect.

Here's my first pass at it. Yes, this code is structured horribly (if you're a future prospective employer, I can do better than this. Really!) I might or might not rewrite it properly once I get back from vacation. I've got other things I wanna throw together, too.

To use it, run dcraw with something like these settings: dcraw -h -a -4 -n 100 .CR2 and then pipe it through this program.

---

#include
#include
#include

double Kb = 0.114, Kr = 0.299;

char line1[512], line2[512], line3[512];
int i = 0, x, y, px, py;
unsigned short int *pic, *out;
double *fpic;

double *Y, *Yorig, *Y2, *Y3, *Cr, *Cg, *Cb;
double minY = 65536, maxY = 0.0;

void process(int inverse, double f1, double f2)
{
#define R 4

for (py = 0; py < y; py++) {
// fprintf(stderr, "%d\n", py);
for (px = 0; px < x; px++) {
double total = 0, peak = 0, mult, factor = 0, tfactor = 0;
int ty;
for (ty = ((py - R) > 0) ? (py - R) : 0; (ty < (y - 1)) && ((ty - py) < R); ty++) {
if (Y[(ty * x) + px] > peak) peak = Y[(ty * x) + px];
factor = 1.0 / ((abs(ty - py)) + 1);
total += (Y[(ty * x) + px] * factor);
tfactor += factor;
}
mult = (65536.0 - (total / tfactor)) / 65536.0;
mult = 1 - ((mult * mult) * f1);
if (mult > 4) mult = 4;
if (mult < 0.001) mult = 0.001;
Y2[(py * x) + px] = Y[(py * x) + px] / mult;
}
}

for (py = 0; py < y; py++) {
// fprintf(stderr, "%d\n", py);
for (px = 0; px < x; px++) {
double total = 0, peak = 0, mult, factor = 0, tfactor = 0;
int tx;
for (tx = ((px - R) > 0) ? (px - R) : 0; (tx < (x - 1)) && ((tx - px) < R); tx++) {
if (Y2[(py * x) + tx] > peak) peak = Y2[(py * x) + tx];
factor = 1.0 / ((abs(tx - px)) + 1);
total += (Y2[(py * x) + tx] * factor);
tfactor += factor;
}
mult = (65536.0 - (total / tfactor)) / 65536.0;
mult = 1 - ((mult * mult) * f2);
if (mult > 4) mult = 4;
if (mult < 0.001) mult = 0.001;
Y3[(py * x) + px] = Y2[(py * x) + px] / mult;
// fprintf(stderr, "%lf %lf %lf %lf\n", Y[(py * x) + px], Y2[(py * x)+ px], (total / tfactor), mult);
}
}
}void processdark(int inverse, double f1, double f2)
{
#if 0
for (py = 0; py < y; py++) {
// fprintf(stderr, "%d\n", py);
for (px = 0; px < x; px++) {
double total = 0, peak = 0, mult, factor = 0, tfactor = 0;
int ty;
for (ty = ((py - R) > 0) ? (py - R) : 0; (ty < (y - 1)) && ((ty - py) < R); ty++) {
if (Y[(ty * x) + px] > peak) peak = Y[(ty * x) + px];
factor = 1.0 / ((abs(ty - py)) + 1);
total += ((65536 - Y[(ty * x) + px]) * factor);
tfactor += factor;
}
mult = (65536.0 - (total / tfactor)) / 65536.0;
mult = 1 - ((mult * mult) * f1);
if (mult > 4) mult = 4;
if (mult <= 0) mult = 0;
Y2[(py * x) + px] = Y[(py * x) + px] * mult;
}
}
#else
memcpy(Y2, Y, sizeof(double) * x * y);
#endif
for (py = 0; py < y; py++) {
// fprintf(stderr, "%d\n", py);
for (px = 0; px < x; px++) {
double total = 0, peak = 0, mult, factor = 0, tfactor = 0;
int tx;
for (tx = ((px - R) > 0) ? (px - R) : 0; (tx < (x - 1)) && ((tx - px) < R); tx++) {
if (Y2[(py * x) + tx] > peak) peak = Y2[(py * x) + tx];
factor = 1.0 / ((abs(tx - px)) + 1);
total += ((Y[(py * x) + tx]) * factor);
tfactor += factor;
}
mult = (65536.0 - (total / tfactor)) / 65536.0;
mult = 1 - ((mult * mult) * f2);
if (mult > 4) mult = 4;
if (mult <= 0) mult = 0;
Y3[(py * x) + px] = Y2[(py * x) + px] * mult;
// fprintf(stderr, "%lf %lf %lf %lf\n", Y2[(py * x) + px], Y3[(py * x) + px], (total / tfactor), mult);
}
}
}

void processdark2(int inverse, double f1, double f2)
{
double total = 0, avg;

maxY = 0.0; minY = 65536.0;
for (i = 0; i < x * y; i++) {
if (Y[i] > maxY) maxY = Y[i];
if (Y[i] < minY) minY = Y[i];
if (Y[i] > 65535.0)
Y2[i] = 0.0;
else
Y2[i] = 65535.0 - Y[i];
}

maxY = 0.0; minY = 65536.0;
for (i = 0; i < x * y; i++) {
if (Y2[i] > maxY) maxY = Y2[i];
if (Y2[i] < minY) minY = Y2[i];
}

for (i = 0; i < x * y; i++) {
total += Y3[i] = 65535.0 - (Y2[i] * (65535.0 / maxY));
}

avg = total / (x * y);

for (i = 0; i < x * y; i++) {
if (Y3[i] > (avg * 4)) {
Y[i] = avg * 4;
} else {
Y[i] = Y3[i];
}
Y[i] = Y3[i];
}
}

int main(int argc, char *argv[])
{
double p1 = 0.8, p2 = 0.8;

if (argc >= 2) sscanf(argv[1], "%lf", &p1);
if (argc >= 3) sscanf(argv[2], "%lf", &p2);

memset(line1, 512, 0);

/* read the first line */
while (read(0, &line1[i], 1)) {
if (line1[i] == '\n') break;
i++;
}
line1[i + 1] = 0;

memset(line2, 512, 0);
i = 0;
/* read the second line */
while (read(0, &line2[i], 1)) {
if (line2[i] == '\n') break;
i++;
}
line2[i + 1] = 0;

sscanf(line2, "%d %d", &x, &y);

memset(line3, 512, 0);
i = 0;
/* read the third line */
while (read(0, &line3[i], 1)) {
if (line3[i] == '\n') break;
i++;
}
line3[i + 1] = 0;

pic = (unsigned short *)malloc(x * y * 6);
out = (unsigned short *)malloc(x * y * 6);
fpic = (double *)malloc(x * y * 3 * sizeof(double));
read(0, pic, (x * y * 6));
memcpy(out, pic, (x * y * 6));

Y = (double *)malloc(x * y * sizeof(double));
Yorig = (double *)malloc(x * y * sizeof(double));
Y2 = (double *)malloc(x * y * sizeof(double));
Y3 = (double *)malloc(x * y * sizeof(double));
Cr = (double *)malloc(x * y * sizeof(double));
Cg = (double *)malloc(x * y * sizeof(double));
Cb = (double *)malloc(x * y * sizeof(double));

for (i = 0; i < (x * y * 3); i++) {
fpic[i] = ntohs(pic[i]);
}

for (i = 0; i < x * y; i++) {
double r = fpic[(i * 3)], g = fpic[(i * 3) + 1], b = fpic[(i * 3) + 2];

Y[i] = (0.299 * r) + (0.587 * g) + (0.114 * b);
Yorig[i] = (0.299 * r) + (0.587 * g) + (0.114 * b);
Cr[i] = r / Y[i];
Cg[i] = g / Y[i];
Cb[i] = b / Y[i];

// Cr[i] = -(0.168736 * r) - (0.331264 * g) + (0.5 * b);
// Cb[i] = +(0.5 * r) - (0.418688 * g) - (0.081312 * b);

if (Y[i] > maxY) maxY = Y[i];
if (Y[i] < minY) minY = Y[i];
}

fprintf(stderr, "%lf %lf\n", maxY, minY);

processdark2(1, 0.8, 0.8);

process(0, p1, p1);
memcpy(Y, Y3, (x * y * sizeof(double)));
maxY = 0.0; minY = 65536.0;
for (i = 0; i < x * y; i++) {
if (Y[i] > maxY) maxY = Y[i];
if (Y[i] < minY) minY = Y[i];
}
fprintf(stderr, "%lf %lf\n", maxY, minY);
processdark(1, p2, p2);

for (i = 0; i < x * y; i++) {
double r1 = fpic[i * 3], g1 = fpic[(i * 3) + 1], b1 = fpic[(i * 3) + 2];
// fprintf(stderr, "%lf %lf %lf ", fpic[i * 3], fpic[(i * 3) + 1], fpic[(i * 3) + 2]);
fpic[(i * 3)] = r1 * (Y3[i] / Yorig[i]);
fpic[(i * 3) + 1] = g1 * (Y3[i] / Yorig[i]);
fpic[(i * 3) + 2] = b1 * (Y3[i] / Yorig[i]);

double r2 = fpic[i * 3], g2 = fpic[(i * 3) + 1], b2 = fpic[(i * 3) + 2];
double max = r2;
if (g2 > max) max = g2;
if (b2 > max) max = b2;

if (max > 65535.0) {
Y3[i] *= (65535.0 / max);
fpic[(i * 3)] = r1 * (Y3[i] / Yorig[i]);
fpic[(i * 3) + 1] = g1 * (Y3[i] / Yorig[i]);
fpic[(i * 3) + 2] = b1 * (Y3[i] / Yorig[i]);
// fprintf(stderr, "%lf %lf %lf %lf, ", Y3[i], fpic[(i * 3)], fpic[(i* 3) + 1], fpic[(i * 3) + 2]);
// fprintf(stderr, "%lf %lf %lf, %d %d %d\n", fpic[(i * 3)], fpic[(i * 3) + 1], fpic[(i * 3) + 2], pic[i * 3], pic[(i * 3) + 1], pic[(i * 3) + 2]);
}
// if (fabs(g2 - g1) > 33)
// fprintf(stderr, "%lf %lf %lf\n", r2 - r1, g2 - g1, b2 - b1);
}

maxY = 0.0; minY = 65536.0;
for (i = 0; i < x * y; i++) {
if (Y3[i] > maxY) maxY = Y3[i];
if (Y3[i] < minY) minY = Y3[i];
}
fprintf(stderr, "%lf %lf\n", maxY, minY);

for (i = 0; i < (x * y * 3); i++) {
if (fpic[i] < 0) fpic[i] = 0;
if (fpic[i] > 65535) fpic[i] = 65535;
out[i] = htons((unsigned short)fpic[i]);
// fprintf(stderr, "%d %d %lf\n", pic[i], out[i], fpic[i]);
}

write(1, line1, strlen(line1));
write(1, line2, strlen(line2));
write(1, line3, strlen(line3));
write(1, out, x * y * 6);
return 0;
}