Here is what is going on with the SO answer:įigure out the space you need, for example by counting the number of total elements, this can be done with add reduce, a basic GPU algorithmic primitive, it simply adds all the values together (of the counts array). The second article does not help for decoding RLE AFAICT, and indeed the thrust implementation of this algorithm is rather complicated to read (the one linked to in the SO answer). Keep going further with this and eventually you'll end up with S3 Texture Compression ) So, a better approach would be to break up the image into smaller blocks. The number of rows in most textures is actually pretty small compared to how many threads a GPU wants to run. That way threads 0-1024 can know to read slots 0-1024 from that array to know where to find they RLE data for the row each of them are responsible for.įrom there each thread can decode one row completely independent from the others. Your rows are variable length, so start the data with an array of offsets to the start of each row's data. What you could do is break up the work into rows of the image. That'll be way slower than doing it in 1 thread on the CPU. If you have one big while loop for the whole image, then 1 thread will work on it while the other 9999 sit idle. The GPU is made of thousands of slow processors that want to march together in groups of 32-256 threads. See, for example, the answers here: StackOverflow: Decoding RLE in CUDA Efficiently or this article: "Implementing Run-length encoding in CUDA" So, it doesn't quite "fit" the GPU way of doing things, at first glance. The typical way you'd process that is more like: handle input1, then input 2, then input 3. However, the encoding you described (essentially a run-length encoding), is not very parallelizable in this way. The classic example is rasterization, where each of the "output N" items is a pixel on the screen (approximately). Something like this: input1 -> shader run 1 -> output 1 Each shader invocation operates on one of these chunks of data, and lots of such invocations run in parallel on the GPU. The GPU likes so-called "embarrassingly parallel" problems, where the input data can be chopped up into independent pieces, which can each be processed separately. to get the result you want.īut in truth, there are some algorithms that are more "GPU friendly" than others. In general, yes the GPU can do general-purpose calculations - you can write loops and branches, you can do multi-pass processing, etc.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |