Thursday, September 30, 2010

Octane Render preparing to smite the competition with its MLT equivalent

Octane Render, the ultra-fast unbiased GPU renderer (made in Belgium just like me :-)) is soon going to introduce a new MLT-(Metropolis light transport)-like algorithm, which will make the rendering of certain difficult scenes with small light sources much more efficient: the scene will converge much faster, with less noise and will kill fireflies (bright pixels as a consequence of long paths from reflective caustics).

MLT is the base rendering algorithm used by unbiased CPU renderers like LuxRender, Maxwell Render, Fryrender, Indigo Renderer and Kerkythea.

Making Metropolis light transport (or an equivalent) work on current GPUs was thought by many to be impossible and it was one of the main criticisms from GPU rendering skeptics such as Luxology (Modo) and Next Limit (Maxwell Render), who believe that GPUs can only do dumb, inefficient path tracing and nothing more. Luckily there's Octane Render to prove them wrong. The fact that it has taken the developer such a long time to make it work shows that it's quite tricky to develop. Octane Render is currently also the only renderer (to my knowledge) that will utilise a more sophisticated rendering algorithm.

On a sidenote, ERPT (energy redistribution path tracing) is also possible on the GPU, as described in one of my previous posts. It combines the advantages of Monte Carlo path tracing and Metropolis light transport to allow faster convergence with less noise and can achieve fantastic results, which look indistinguishable from the path traced reference (see Timo Aila, a graphics researcher at Nvidia and GPU ray tracing genius, is also working on real-time Metropolis light transport (

Octane's MLT-like algorithm has been hinted at by its developer since the unveiling of the software in January 2010, and it should be here very soon (within a couple of weeks, post will be updated when that happens). I'm very curious to see the first results.

Future GPU architectures, like Kepler and Maxwell, should make the implementation of MLT-like algorithms on the GPU much easier, but it's nice to see at least one developer trying to squeeze the maximum out of current GPUs, bending their compute capability until it breaks.

Saturday, September 25, 2010

Kepler and Maxwell: ray tracing monsters thanks to CPU and GPU cores on the same chip?

At GTC 2010, Nvidia announced their future GPUs named Kepler and Maxwell. One of the more interesting quotes:
"Between now and Maxwell, we will introduce virtual memory, pre-emption, enhance the ability of the GPU to autonomously process, so that it's non-blocking of the CPU, not waiting for the CPU, relies less on the transfer overheads that we see today. These will take GPU computing to the next level, along with a very large speed up in performance," said Jen-Hsun Huang.

Pre-emption was already revealed in a slide from a presentation by Tony Tomasi at Nvision08 (, depicting a timeline showing pre-emption, full support for function pointers, C++, etc. :

The part about "the ability of the GPU to autonomously process, so that it's non-blocking of the CPU, not waiting for the CPU, relies less on the transfer overheads that we see today" is very interesting and suggests the incorporation of CPU cores on the GPU, as shown in a slide from an Nvidia presentation at SC09 (

There's also this live chat with Bill Dally:
We all know that Intel and AMD are looking at merging CPU cores and GPUs on the same die.
In my mind, the future is for hybrid computing, where different kind of processors working together and find their own kind of tasks to work on. Currently, multi-core CPU and many-core GPU are working together, tasks are distributed by software schedulers. Data parallel tasks are assigned to GPUs and task-parallel jobs are assigned to GPUs. However, communication between these two kinds of processors is the performance bottleneck. I hope NVIDIA can provide a solution on their desktop GPU product line too.

Bill Dally:
That's exactly right. The future is heterogeneous computing in which we use CPUs (which are optimized for single-thread performance) for the latency sensitive portions of jobs, and GPUs (which are optimized for throughput per unit energy and cost) for the parallel portions of jobs. The GPUs can handle both the data parallel and the task parallel portions of jobs better than CPUs because they are more efficient. The CPUs are only needed for the latency sensitive portions of jobs - the serial portions and critical sections.

Do you believe a time will come when GPU and CPU are on the same chip or "board" it seems the logical next step to avoid the huge PCI-E latency and have a better GPU-CPU interactivity ? i know there is ongoing research in this area already ...but what is your personal opinion on the possibility and benefits of this ?"

Bill Dally:
Our Tegra processors already combine CPUs and a GPU on a single chip. For interactivity what's important is not the integration but rather having a shared memory space and low latency synchronization between the two types of cores.
I don't see convergence between latency-optimized cores and throughput optimized cores. The techniques used to optimize for latency and throughput are very different and in conflict. We will ultimately have a single chip with many (thousands) of throughput cores and a few latency-optimized cores so we can handle both types of code.

From the above slide, Nvidia expects to have 16 CPU cores on the GPU by 2017, deducing from that you would get:

- 2017: GPU with 16 CPU cores
- 2015: GPU with 8 CPU cores
- 2013: Maxwell with 4 CPU cores
- 2011: Kepler with 2 CPU cores

My bet is that Kepler will at least have one and probably two (ARM based) CPU cores and Maxwell will probably have 4 CPU cores on the GPU. The inclusion of true CPU cores on the GPU will make the CPU-GPU bandwidth problem of today obsolete and will enable smarter ray tracing algorithms like Metropolis light transport and bidirectional path tracing on the GPU. Biased rendering methods such as photon mapping and irradiance caching will be easier to implement. It will also give a tremendous performance boost to the (re)building of acceleration structures and to ray tracing of dynamic geometry, which will no longer depend on the slow PCIe bus. Apart from ray tracing, most other general computation tasks will also benefit greatly. I think this CPU/GPU combo chip will be Nvidia's answer to AMD's oft-delayed Fusion and Intel's Sandy Bridge.

Wednesday, September 22, 2010

OTOY at GPU Technology Conference, partnering with Nvidia, Intel and AMD

It's been a long time since OTOY was in the news, but the company will resurface at GTC. Jules Urbach of OTOY will be speaking in a session about emerging companies. Apparently there's no exclusive deal any longer between OTOY and AMD according to this article at Venturebeat.

OTOY will also make use of CUDA in the future which is great news!!! Hopefully this will speed up adoption of the technology by a factor of 10 to 50x ;-)

UPDATE: here's the full PR release:

OTOY to Present Enterprise Cloud Platform at NVIDIA GPU Technology Conference

OTOY will unveil its Enterprise Cloud platform at the GPU Technology Conference this week. The platform is designed to enable developers to leverage NVIDIA CUDA, PhysX and Optix technologies through the cloud.

Santa Clara, CA (PRWEB) September 23, 2010

OTOY announced that it will unveil its Enterprise Cloud platform at the GPU Technology Conference this week. The platform is designed to enable developers to leverage NVIDIA CUDA, PhysX and Optix technologies through the cloud. OTOY's proprietary ORBX GPU codec will enable high performance 3D applications to render on a web server and instantly stream to any thin client.

OTOY is participating in the GTC “Emerging Companies Summit,” a two-day event for developers, entrepreneurs, venture capitalists, industry analysts and other professionals.

OTOY Enterprise Cloud platform
The OTOY Enterprise Cloud platform sandboxes an application or virtual machine image without degrading or limiting GPU performance. CUDA-powered applications, such as Adobe's Creative Suite 5, will be able to take full advantage of GPU acceleration while streaming from a virtual OTOY session.

OTOY bringing GPGPU to the browser
In addition to supporting CUDA through its server platform, OTOY's 4k web plug-in adds CUDA and OpenCL compliant scripting across all major web browsers, including Internet Explorer, Mozilla FireFox, Google Chrome, Apple Safari and Opera. GPU web applets that cannot run locally are executed and rendered using OTOY server side rendering. This ensures that GPU web applets can be viewed on any client, including HTML 4 browsers.

Next generation rendering tools coming to developers
OTOY enables server hosted game engines to render LightStage assets and leverage distributed GPU ray-tracing or path-tracing in the cloud. The OTOY Enterprise Cloud platform can host complete game engine SDKs, making game deployment to Facebook or other web portals simple and instantaneous.

OTOY will add native support for CryEngine content in 2011, starting with Avatar Reality's Blue Mars. Blue Mars is the first virtual world built using the Crytek engine. It is currently in beta testing on the OTOY platform.

About OTOY
OTOY is a leading developer of innovative software solutions for GPU and CPU hardware, as well as a provider of convergence technologies for the video game and film industries. OTOY works with a wide range of movie studios, game developers, hardware vendors and technology companies.

OTOY integrated in CryEngine and supporting distributed GPU ray tracing and path tracing in the cloud!! The dream of real-time ray traced or even path traced games is getting closer every day! I do hope that OTOY will deliver this dream first, they have all the right technology and partners now.

Friday, September 10, 2010

Small update on the Brigade real-time path tracer

Jacco Bikker has released two new video's of the progress with his real-time path tracer named Brigade, demonstrating some kind of game where a truck has to push gold containers or something. Looks fun:

1. direct lighting (32 samples per pixel):
2. one bounce of indirect lighting (16 spp):

There is also an update from Dietger van Antwerpen on the GPU path tracer (subsystem of Brigade path tracer) running with the more advanced ERPT (energy redistribution path tracing) algorithm. He has improved the ERPT code to produce virtually identical results to the path traced reference and released a high quality image with it (ERPT on the left and path tracing on the right):

Explanation from Dietger van Antwerpen in the description at youtube:
"After some complains pointing out that in the movie, ERPT is significantly darker then path tracing , I fixed the darkening effect of the ERPT image filter, solving the difference in lighting quality. I made an image ( ) using ERPT for the left half, while using path tracing for the right half and waited until the path tracing noise almost vanished. As you can see, the lighting quality between the left and right half is pretty much the same. (The performance and convergence characteristics remain unchanged)"
It would be interesting to know the time for ERPT and for path tracing to achieve these results.

He also released a new video showing improvements to the GPU ERPT code:

As the videos show, ERPT converges considerably faster than standard path tracing and the noise is significantly reduced. Very cool and very impressive. I wonder if the optimized ERPT code will be used in Brigade for real-time animations and games.

Wednesday, September 8, 2010

VoxLOD: interactive ray tracing of massive polygon models with voxel based LOD + Monte Carlo global illumination!

This is really amazing technology:


Some quotes from the author's blog:
"a real-time massive model visualization engine called VoxLOD, which can handle data sets consisting of hundreds of millions of triangles.

It is based on ray casting/tracing and employs a voxel-based LOD framework. The original triangles and the voxels are stored in a compressed out-of-core data structure, so it’s possible to explore huge models that cannot be completely loaded into the system memory. Data is fetched asynchronously to hide the I/O latency. Currently, the renderer runs entirely on the CPU.


I’ve implemented shadows in VoxLOD, which has thus become a ray tracer. Of course, level-of-detail is applied to the shadow rays too.

While shadows make the rendered image a lot more realistic, the parts in shadow are completely flat, devoid of any details, thanks to the constant ambient light. One possible solution is ambient occlusion, but I wanted to go further: global illumination in real-time.

GI in VoxLOD is very experimental and unoptimized for now. It’s barely interactive: it runs at only 1-2 fps at 640×480 on my dual-core Core i7 notebook. Fortunately, there are lots of optimization opportunities. Now let’s see an example image:

Note that most of the scene is not directly lit, and color bleeding caused by indirect lighting is clearly visible. There are two light sources: a point light (the Sun) and a hemispherical one (the sky). I use Monte Carlo integration to compute the GI with one bounce of indirect lighting. Nothing is precomputed (except the massive model data structure of course).

I trace only two GI rays per pixel, and therefore, the resulting image must be heavily filtered in order to eliminate the extreme noise. While all the ray tracing is done on the CPU, the noise filter runs on the GPU and is implemented in CUDA. Since diffuse indirect lighting is quite low frequency, it is adequate to use low LODs for the GI rays."
The author of this engine has also written a paper entitled: "Interactive Out-of-Core Ray Casting of Massive Triangular Models with Voxel-Based LODs"

There is a very interesting graph in this paper, which shows that when using LOD, the cost of ray casting remains constant once a certain number of triangles (0.5M) is reached:

Quoting the paper,
"by using LODs, significantly higher frame rates can be achieved with minimal loss of image quality, because ray traversals are less deep, intersections with voxels are implicit, and memory accesses are more coherent. Furthermore, the LOD framework can also reduce the amount of aliasing artifacts, especially in case of highly tesselated models."

"Our LOD metric could be also used for several types of secondary rays, including shadow, ambient occlusion, and planar reflection rays. One drawback of this kind of metric is that it works only with secondary rays expressible as linear transformations. Because of this, refraction and non-planar reflection rays are not supported."
Implemented on the GPU, this tech could be the ideal solution for real-time raytracing in games:

- it makes heavy use of LOD for primary, shadow and GI rays which greatly reduces their tracing cost

- LOD is generated automatically by the voxel data structure

- nearby geometry is represented by triangles, so there isn't any voxel blockiness on close inspection

- characters and other dynamic geometry can still be represented as triangles and as such avoid the difficulties with animating voxels

- huge immensely detailed levels are possible thanks to the out-of-core streaming of the voxels and triangles

- it uses Monte Carlo GI, which scales very easily (number of bounces + number of samples per pixel) and can be filtered, while still giving accurate results

This is certainly something to keep an eye on!

Saturday, September 4, 2010

Nvidia research chats about GPU ray tracing

A couple of days ago there was a live chat with David Luebke and Bill Dally from Nvidia, on Nvidia's nTersect blog with GPU ray tracing as one of the main subjects ( Below are some of the questions and answers related to GPU ray tracing and rendering:

Are there any plans to add fixed function raytracing hardware to the GPU?

David Luebke:
Fixed-function ray tracing hardware: our group has definitely done research in this area to explore the "speed of light", but my sense at this time is that we would rather spend those transistors on improvements that benefit other irregular algorithms as well.

ray-triangle intersection maps well to the GPU already, it's basically a lot of dot products and cross products. ray traversal through an acceleration structure is an interesting proxy for lots of irregular parallel computing workloads : there is abundant parallelism but it is branchy and hard to predict. Steps like Fermi's cache and unified memory space are great examples of generic hardware improvements that benefit GPU ray tracing as well as many other workloads (data mining, tree traversal, collision detection, etc)

When do you think real-time ray tracing of dynamic geometry will become practical for being used in games?
David Luebke:
ray tracing in games: I think Jacopo Pantaleoni's "HLBVH" paper at High Performance Graphics this year will be looked back on as a watershed for ray tracing of dynamic content. He can sort 1M utterly dynamic triangles into a quality acceleration structure at real-time rates, and we think there's more headroom for improvement. So to answer your question, with techniques like these and continued advances in GPU ray traversal, I would expect heavy ray tracing of dynamic content to be possible in a generation or two.

Currently there is a huge interest in high quality raytracing on the GPU. The number of GPGPU renderers has exploded during the last year. At the same time there are critics saying that GPU rendering is still not mature enough to be used in serious work citing a number of limitations such as not enough memory, shaders are too simple and that you can only do brute force path tracing on the GPU, which is very inefficient compared to the algorithms used in CPU renderers. What is your take on this? Do you think that these limitations are going to be solved by future hardware or software improvements and how soon can we expect them?
David Luebke:
re offline renderers - I do think that GPU performance advantages are becoming too great for studios to ignore. You can definitely get way past simple path tracing. I know of a whole bunch of studios that are doing very deep dives. Stay tuned!

Do you think rasterization is still going to be used in 10 years?
David Luebke:
re rasterization: yes, forward rasterization is a very energy-efficient way to solve single-center-of-projection problems (like pinhole cameras and point light shadow maps) which continue to be important problems and subproblems in rendering. So I think these will stick around for at least another 10 years

There have been a lot of papers about reyes style micropolygon rasterizing at past graphics conferences with the feasibility of hardware implementation. Do you think this is a good idea?
David Luebke:
re: micropolygons - I think all the work on upolys is incredibly interesting. I still have some reservations about whether upolys are REALLY the final answer to rendering in the future. They have many attractive attributes, like the fact that samples are glued to parametric space and thus have good temporal coherence, but they seem kind of ... heavyweight to me. There may be simpler approaches.

am I wrong in thinking that game graphics are limited more by the artist than the graphics or are the game companies just trying to reach a broader market?
David Luebke:
you are not wrong! game developers are limited by artists, absolutely. But this translates to graphics - better graphics means simpler, more robust, easier to control and direct graphics. A good example is cascaded shadow maps, used very widely in games today. These are notoriously finicky and artists have to keep a lot of constraints in their head when designing levels etc. Looking forward, increased GPU performance and programmability both combine to make simpler approaches - like ray tracing - practical. On the flip side, graphics is certainly not solved and there are many effects that we can't do at all in real-time today, so you will continue to see games push forward on graphics innovation, new algorithms, new ways to use the hardware

UPDATE: Some tech websites such as xbitlabs, have taken the comment from the live chat about rasterization out of context, stating that we will have to wait at least another 10 years before ray tracing will be used in games. Apparently, they didn't read the answers from David Luebke very well. The way I understood it, is that rasterization will be used in conjunction with ray tracing techniques, both integrated in novel rendering algorithms such as image space photon mapping. From the ISPM paper:
Image Space Photon Mapping (ISPM) rasterizes a light-space bounce map of emitted photons surviving initial-bounce Russian roulette sampling on a GPU. It then traces photons conventionallyon the CPU. Traditional photon mapping estimates final radiance by gathering photons from a k-d tree. ISPM instead scatters indirect illumination by rasterizing an array of photon volumes. Each volume bounds a filter kernel based on the a priori probability density of each photon path. These two steps exploit the fact that initial path segments from point lights and final ones into a pinhole camera each have a common center of projection.

So ray tracing in games will definitely show up well within 10 years and according to Luebke you can expect "heavy ray tracing of dynamic content to be possible in a generation or two". Considering that Fermi was a little bit behind schedule, I would expect Nvidia's next generation GPU to come around March 2011 (on schedule), and the generation after that around September 2012. So only 2 years before real-time raytracing is feasible in games ;-D.