The massive scalability of modern GPUs has led to many things, from Bitcoin mining to folding@home, seti@home and various other general purpose tasks that involve huge amounts of number crunching. While not as flexible as CPUs, modern GPUs can be hugely scalable thanks to the massive number of stream processors they contain.
I've long been impressed by the technology, and long known that to implement something using it you can use CUDA and/or OpenCL. For about the same amount of time there have been low level Java wrappers too that mean you can execute OpenCL without resorting to native languages - even if it meant you had to write the OpenCL directly, as well as huge amounts of boilerplate code to interface with it. Of course, if you wanted your application to work in the real world, a completely separate load of code for it to fall back on if the device didn't support OpenCL was needed too.
For about the same amount of time, I've thought that what we really need is a near transparent layer in (insert high level language of your choice here) which manages all the boilerplate / fallback code, OpenCL generation etc. for you. Of course, I could have sat down and looked at implementing it myself but I decided to wait for someone to do the hard work for me, thinking that it was inevitable that they would.
Fortunately I was correct, a bit of digging around today turned up Aparapi. It's an initiative started by AMD that enables to bring GPGPU programming to the average Joe Java developer completely transparent (almost) of any OpenCL calls and boilerplate. All it requires is for you to write your code inside a fabricated closure (anonymous inner class at present) and then call an execute method for the number of times you want it to run, in parallel. You can get the index of your essentially parallelised *loop* from within the kernel and crunch away on the GPU.
When lambdas come out of course there will be less boilerplate still - essentially having compresssed the hundreds of lines previously required to do this sort of thing with no more than a couple. Suddenly we've gone from stupidly hard and not worth doing to about as easy as it can get.
I envision this to be the first, but most important (and arguably hardest, but perhaps that's because I'm not a low level guy) step in providing the sort of transparent GPGPU access to Java that I continually dream of. The next step would be for API designers to take advantage of this technology in turn to produce classes like ParallelBufferedImage for instance. When this comes then ordinary Java folk will be able to utilise hugely parallel operations without so much as lifting a finger. We won't have to do anything different and all such API lifting will be done in the background, creating a huge (in some cases) but near invisible performance boost, just using one class rather than another.
Following on from that, the final step if it really takes on would be for it to be included in the API as standard. Possible? Sure. Likely? Who knows, but if so then it'd truly expose GPGPU to the masses.
Of course, that doesn't prohibit playing around until my completely transparent API-level pipe dream comes into play. Me being me I haven't actually got an OpenCL compatible card to try it out (I tend to live in the past a bit on the gaming front) but am thinking of ordering one just to have a play! And next step (naturally) is to see what we can integrate into Quelea to potentially speed that up with GPU hardware. It might even make my idea of keystone correction in software viable, we shall see! Perhaps 2012 will bring us a few more developments. Watch this space!