Neural Rendering · Google Summer of Code

Stream: Google Summer of Code

Topic: Neural Rendering

Sean (Mar 02 2024 at 16:42):

Hi @Senthil Palanisamy, lets continue the discussion here. Can you tell me more about your background and what you understand about this effort? Have you read the AMD paper?

Vidit Jain (Mar 02 2024 at 17:27):

(deleted)

Senthil Palanisamy (Mar 02 2024 at 23:36):

Hi Sean, Thanks for the message.

About my background: I come from a geometric vision background where I worked more than 6 years, to go along with my masters in robotics from Northwestern. I have developed solutions to problems like Stereo visual Odometry, Depth data based SLAM / TSDF fusion of depth maps into a volumetric volume to generate geometry, Extrinsic sensor calibration algorithms (the act of optimizing solutions to where sensors are located). I have done a few deep learning works as well like deep reinforcement policy learning for a knot tieing, weed localization and human character recgonition. I can send you my resume to any mail id of interest, if you want to know about my background further. Though I don't actively have a graphics background, I do understand the broader details and am able to grasp quickly to execute ideas. My programming languages of comfort are C++ and python (though I do think I can pick up any language within a reasonable time)

About the work: I did spent a few hours trying to understand the work. Here is my summary - A classic ray tracing pipeline on BVH for rendering objects is slow. One of the deep learning ways to speed this up is to train a MLP to be a neural intersection function - a network that primarily predicts occupancy as a probability (0-1), but it can be trained to predict other properties as well like shading / reflectence. This network can be trained directly on position and ray directions, but it does not practically work well. So, the solution is to learn some feature embeddings for position and direction, which then feed into MLP. Each of position and direction are points in R^3, but this leads to ray duplicates, so they can be compactly represented as points in R^2, by using spherical co-ordinates and substituting the ray origin with ray intersection. Raycasting is usually done in multi bounces, where secondary rays are traced from the primary ray. So there are two NIFs one (outer NIF) predicting the primary intersection, and the other (inner NIF) predicting occupancy from the secondary intersection. The inner NIF takes ray distance embedding as input in addition to position and direction embeddings. It seems like a classic tracing tracing scheme employs two hierarchies in BVH to trace rays. The top BVH tracing is cheap while the bottom BVH tracking is expensive. NIFs replace the bottom BVH part, while the top BVH tracing is done through classical means. The outer NIF just predicts occupancy while the inner NIF predicts other properties in addition to occupancy.

Some open questions I am still trying to find answers to

What is the feature embedding scheme for position and direction? I know that sine and cosine bases (fourier analysis motivated) is used a lot in works like nerfs, but I have not got to the bottom of what embedding is used here
It seems like the networks used here are AMD specific C++ libraries. Is this something that we will use in this work? Or is the idea to use a more popular library like pytorch.

My personal motivation for taking this work

I have never contributed to open source and I was scouting for places to begin my open source journey. Contributing to open source increases my visibility and exposes me to learn a lot more than from just contributing to the private repos that I work for.
I want to work on a geometric deep learning project to update myself to the SOTA

Sean (Mar 04 2024 at 15:06):

@Senthil Palanisamy Thank you for the excellent introduction. From your background, it sounds like you could probably propose a couple different projects that would align with needs (like a slam-based object reconstruction using ray tracing), which is all outstanding background for working on neural rendering. Obviously a lot of related concepts. As for your resume, you can just submit that when you submit a proposal (which you can/should do early and then continuously update, whenever it opens). Given the nature of GSoC, resume's and the project write-up itself are only a small fraction of selection criteria. It's communication that matters most (both here and via code).

I think your understanding of the project is right on track and you did a great job summarizing AMD's work. They essentially showed that it can be faster, so now my question is whether it can be faster and more generally applicable for a set of conditions like expensive geometries and reasonably precise hit points. Can we actually use it as an acceleration structure in practice, or is the training phase entirely impractical; what sort of net is needed to capture all the necessary detail; is the two-layer network adequate; are the two NIFs actually necessary; ... lots of open questions many of which are hand-waved in the paper and proved challenging in our previous investigations by a team at TAMU.

As to your questions, if I'm understanding you correctly, then 1) is really just a dimensionality reduction so lookups can be fast and fewer parameters in the latent space are needed for encoding. Instead of using 6 float/double values for the input layer (xyz pos + ijk dir), they use 4 float values (az/el pos + az/el dir). There may be more nuance implied for feature embedding, though, so there may be work needed to understand it if changes are to be made. I'm a little surprised I didn't see residual linkages, and that they got away with such simple topology, but then we have yet to reproduce their results either. As for 2) the did rely on libs for performance and they certainly can be used but they're not a focus or requirement. The fundamental question is is this approach even feasible as an acceleration approach to represent a given geometry model. That has both elements of accuracy and performance, but can be proven without making it production-ready. Libs like pytorch can certainly be used. In fact, the previous team fully bridged from C to Py to C during ray-tracing which was of course painfully slow, but they weren't successful in getting accurate hit points so the question is still an open one.

Senthil Palanisamy (Mar 04 2024 at 17:23):

Thanks for the response. The grid look up makes sense now. I was intially under the misunderstanding that some network converts position, direction -> position, direction embedding. But it does make sense that it is just a grid that is trained as well in the training process. I have a few follow up questions

What kind of training practically are we talking of? Since this is a CAD product, is the idea that the network has to train real time as the user possibly edits the model to update the rendering in real time? Or is the idea the training can be done once, in a somewhat offline fashion (like a few minutes or few 10s of minutes) and then can be used for rendering (where the user is assumed to not edit)
As for the implementation is concerned, the network itself looks simple and innocuous, each NIF is a MLP - the outer is a 2 layer MLP with 64 units in each layer, while the inner NIF is a two layer MLP with 48 units in each layer, I am just trying to get my head around how the training data is generated and from what sources, to set up this training problem. I should be able to code this up once I figure out the training data.
You mentioned something about another project - "slam-based object reconstruction using ray tracing", Can you point me to any thread or resources to understand this better? I would love to know details about this.

Sean (Mar 05 2024 at 05:59):

It's not really a grid lookup. It's just a different way to express a vector. For the neural net, it's less input data and thus less latent space nodes are needed to encode. It's still just a vector though, a clever optimization. We even coincidentally have an image depicting both (the vector can point outwards or inwards): https://brlcad.org/gallery/picture.php?/4

The latter. A given model will get "baked" such that a NN is trained, ideally in just a couple minutes, and then is available+valid for use until the geometry changes.
The training data should come directly from the ray tracer (either in advance or streamed real-time). Imagine shooting a million rays at an object from it's bounding sphere, some hit, some miss. Each hit is not just the first hit but all the possible hits along the path. That's the training corpus. Rays input, hit list output. We can probably shoot rays faster than the net can train an epoch, so you'd probably want to just stream rays as fast as it can train. Alternatively, shoot a million, train, shoot another million, train, etc.
rppvalidation_dirs.png
there's no threads on the topic. general idea though would be to propose something CAD-related involving slam like making a portable 3D CAD scanner app. e.g., try to output CAD instead of just point clouds or meshes. maybe hook into iphone lidar sensor. could also be slam and point cloud based, but then imports into BRL-CAD on the fly for processing. lots of possibilities.

You'd have to really make a strong case for your project regardless. Needs to have a compelling benefit that ties into CAD specifically, not just 3D or vision or graphics-related.

Senthil Palanisamy (Mar 07 2024 at 17:26):

Thanks for the clarification. I went back and spent more time. I think I am getting closer to the true understanding. Its funny my understanding about the work keeps getting refined and is probably starting to converge. So it seems like the grid encoding is done on a per object basis - the inputs for both for the outer and inner network, while the networks itself are object agnostic occupancy predictors. So we would end up having two encoded grids (one for outer and one for innter) for each unique object and two network that are sort of object agnostic, but acts on any given object to predict occupancy.

In order to accomplish this, I would need to modify the ray tracing pipeline on our system - record input data - for both outer and inner network and then attempt to train those networks

So my two action items are

Modify the existing pipeline to record data
Work on the network that can train on this

I think I can do action 2 pretty independently (the networks are simple) but to do 1, I would need some directions on how I can do this. What's the fastest way to get there? Are there any documents /sections of code in our repository that I can go through to better understand how to do this? Also, I will try to setup BRL CAD (https://github.com/BRL-CAD/brlcad) in the first place, which I will get started with

I know we are interested in the core hypothesis of - is this fast enough? If there is a way to get this data, without having to collect it ourselves (may be any opensource datasets that can be directly plugged in), if any such options exist, the core hypothesis on speed can be validated quicker? Any thoughts on this

Sean (Mar 27 2024 at 06:42):

@Matteo Balice and others, I have uploaded information and materials relevant to the neural rendering project to https://brlcad.org/design/neural

Sean (Mar 28 2024 at 11:41):

@Matteo Balice I looked over your proposal update and it looks really pretty good. I like that you identified some potential errors in the tamu team's approach. I think overall you have a good plan.

That said, I think you could make your proposal even better by more clearly identifying what the results of your project will be exactly. I love that you dedicated time to writing up results -- it's reasonable to spend time getting a paper out of this work given the nature of the work. In additional to a technical paper, though, what precisely will be the resulting products of your work in addition to things learned, which will be documented and which you underscored in your writeup.

Sean (Mar 28 2024 at 11:43):

Is the goal to implement a non-grid sampling method, train on that, and demonstrate the efficacy of using that method with "rt" or something similar? The tamu team resorted to a fixed view silhouette rendering via "rt" as their output target given they couldn't fully achieve generalization to 3D.

Sean (Mar 28 2024 at 11:45):

You mention nerf and potentially using different networks -- could definitely write more on what you mean there. I do suspect that the simple 2-layer FCN is inadequate for full generalization, but AMD's results certainly suggest otherwise may be possible.

Matteo Balice (Mar 28 2024 at 11:52):

@Sean Thank you for reading. Essentially, the goal is to achieve generalization, so we will try an approach aimed at improving the bounding box approach. However, even by adding this, I suspect that the neural network implemented by the students may not be sufficient to ensure good generalization for any view and thus for any arbitrary ray. For this reason, I believe (but it needs to be verified after obtaining the results of the first part) that it is reasonable to consider modifying the neural network with others more powerful to encode more information about the geometry of the 3D object. In any case, I will provide a better explanation in the proposal. Thank you.

Sean (Mar 28 2024 at 11:54):

An approach I think would be considerably more effective is generating training data based on spherical sampling. That in practice does a very good job to sample the volume unbiased and converges through potential shotlines. It's pretty simple to generate -- I just did that in our new "rtsurf" application.

Sean (Mar 28 2024 at 11:55):

Ends up looking a bit like this:
rppvalidation_dirs.png

Matteo Balice (Mar 28 2024 at 11:58):

Sean ha scritto:

Ends up looking a bit like this:
rppvalidation_dirs.png

yes i think it's the same of the bounding box approach used by tamu students or am I wrong?
image.png

Matteo Balice (Mar 28 2024 at 11:58):

I don't know why they call it bounding box even though it seems more like a sphere.

Matteo Balice (Mar 28 2024 at 12:04):

So it seems that Tamu students have previously employed this method to generate training data, but encountered issues with its generalization. I believe that the neural network might be lacking in its ability to extract all pertinent information from the geometry.

Sean (Mar 28 2024 at 12:07):

I have not looked into their code to see whether they're actually evaluating the bounding box or bounding sphere, but I do recall them saying all rays sample through the origin so it's not an unbiased sampling regardless.

Sean (Mar 28 2024 at 12:08):

that image there could also simply be hits on a sphere in a bounding box, heh. would have to double check that as well.

Sean (Mar 28 2024 at 12:13):

That is, I believe they were sampling the geometry like this: samples.png

The general assumption being they were sampling and trying to reconstruct simple shapes like a box, sphere, torus, etc.

Sean (Mar 28 2024 at 12:14):

i.e., box.png

Matteo Balice (Mar 28 2024 at 14:19):

@Sean you were right. I studied better what they did as sampling methods and it seems they used these two approaches: one is a grid approach (and ok we all agree it can't be used for generalization), the second is a mixed bounding box exactly like you said (ray origins were selected from a bounding sphere around the geometry, as well as from within the bounding sphere itself).
But there was another approach they never tested, the "pure" bounding sphere approach: this
method find a random point at radius distance away from the center of the geometry. This would serve as the ray origin. Then, it would find a different, random point at radius distance away from the origin. It would then determine the direction between the two points to determine the direction vector of the ray. It's pretty clear that this is the approach you suggested.
This morning I had understood that they had used this latest approach (so I assumed that neither with this sampling they could generalize), but I was wrong. So, this changes everything because there might not be a need to implement any more sophisticated neural network (but it all depends on the results we will have on thanks to this different spherical sampling).

Vidit Jain (Mar 28 2024 at 15:00):

@Daniel Rossberg I have updated my proposal and submitted it. I am looking forward for hearing from you.
Thanks

fall Rainy (May 23 2024 at 11:36):

While looking through previous work(https://brlcad.org/design/neural/), I noticed this file to what looks like a python version of the NIF implementation. Perhaps I could start by converting this job to a C++ version?Then in the meantime, I'll try to optimize it.

Vidit Jain (May 23 2024 at 12:28):

(deleted)

fall Rainy (Jun 09 2024 at 15:34):

I have some thoughts on why AMD chose a simple neural network. Recently, while deploying a Transformer model, I found that when there are too many parameters, the model's speed also noticeably decreases. Therefore, overly complex neural networks may sacrifice some efficiency.

fall Rainy (Jun 12 2024 at 12:22):

I have a question about '''rt_shootray()". When calling rt with default parameters, does it call rt_shootray only once for each pixel to calculate the RGB value, and does it not use the Monte Carlo algorithm at all during the process?

Sean (Jun 12 2024 at 15:27):

fall Rainy said:

I have some thoughts on why AMD chose a simple neural network. Recently, while deploying a Transformer model, I found that when there are too many parameters, the model's speed also noticeably decreases. Therefore, overly complex neural networks may sacrifice some efficiency.

Yes, and I mentioned something to that effect -- it was absolutely made that simple in order to achieve their realtime performance goal. What's still particularly amazing though is that it achieved such precise matching output on such a complex scene with so few parameters.

Sean (Jun 12 2024 at 15:31):

fall Rainy said:

I have a question about '''rt_shootray()". When calling rt with default parameters, does it call rt_shootray only once for each pixel to calculate the RGB value, and does it not use the Monte Carlo algorithm at all during the process?

It will call rt_shootray one for each primary ray -- which typically results in secondary rays as well for reflection, specular, light/shadow samples, etc. Rt is not a montecarlo renderer, but there are options like -H for hypersampling where there will be multiple samples per pixel. There are also different lighting modes and different renderers that employ different methods. For example -l7 uses photon mapping and the 'art' ray tracer is a PBR renderer based on appleseed.

fall Rainy (Jun 18 2024 at 06:14):

After the rendering is complete, I want to plot some sampling points. I think I should operate on this object. Is there any related function?

struct fb *fbp = FB_NULL; /* Framebuffer handle */

Sean (Jun 18 2024 at 20:56):

If rendering is complete do you mean 2d plotting over the image?? Or are you wanting to plot 3D points and render them as well?

Sean (Jun 18 2024 at 20:58):

If you’re just wanting to visualize some diagnostic info for debugging, you can make geometry (eg point cloud or spheres) and view that in mged or with rt, you can plot to 3D with annotation points, you could manually project 3D points to 2d and draw them

fall Rainy (Jun 19 2024 at 02:19):

Sean said:

If you’re just wanting to visualize some diagnostic info for debugging, you can make geometry (eg point cloud or spheres) and view that in mged or with rt, you can plot to 3D with annotation points, you could manually project 3D points to 2d and draw them

Sorry for not being clear. I just want to visualize these points, and I will try to plot them in a 3D view.

Erik (Jun 21 2024 at 22:54):

Happy Friday! did you figure out how to plot the points? Is there enough progress for a little show&tell?

fall Rainy (Jun 23 2024 at 04:37):

Erik said:

Happy Friday! did you figure out how to plot the points? Is there enough progress for a little show&tell?

I'm sorry for the late reply. I completed the drawing in a strange way...

fall Rainy (Jun 23 2024 at 04:37):

I do have some results to report, and I would like to know if my direction is correct. Is next Wednesday or Thursday okay?

fall Rainy (Jun 25 2024 at 07:22):

where is the center of model bounding sphere?

$0.5(mdl\_min+mdl\_max)?$

struct rt_i{
...
/* THESE ITEMS ARE AVAILABLE FOR APPLICATIONS TO READ */
point_t             mdl_min;        /**< @brief  min corner of model bounding RPP */
point_t             mdl_max;        /**< @brief  max corner of model bounding RPP */
point_t             rti_pmin;       /**< @brief  for plotting, min RPP */
point_t             rti_pmax;       /**< @brief  for plotting, max RPP */
double              rti_radius;     /**< @brief  radius of model bounding sphere */
struct db_i *       rti_dbip;       /**< @brief  prt to Database instance struct */

fall Rainy (Jun 25 2024 at 07:24):

And please let me know when will you're available to meet. I will be free after Tuesday.

fall Rainy (Jun 25 2024 at 14:09):

I have finished a few sample methods , this is uniform sphere sample:
sample.png

Erik (Jun 26 2024 at 00:52):

I can make time on Wednesday, or on Thursday until 1630EDT, or Friday after 0830EDT. But I think Sean is more aware of what's going on and it'd be more valuable to have him present?
the center of the bounding sphere is the same as the bounding box, um, I think the AABB is used more than the sphere. iirc, when I did the metaball primitive, I did a bounding box, then just said the bounding sphere was the same center and had a radius equal to the distance from a corner of the bounding box to the center?

fall Rainy (Jun 26 2024 at 04:16):

Okay, I'll ask Sean when he's available.

Erik said:

I can make time on Wednesday, or on Thursday until 1630EDT, or Friday after 0830EDT. But I think Sean is more aware of what's going on and it'd be more valuable to have him present?
the center of the bounding sphere is the same as the bounding box, um, I think the AABB is used more than the sphere. iirc, when I did the metaball primitive, I did a bounding box, then just said the bounding sphere was the same center and had a radius equal to the distance from a corner of the bounding box to the center?

fall Rainy (Jun 26 2024 at 09:15):

There looks to be a small error in rt/worker.c: there is no need to statistic colorsum for normal sample, just do it in hyper sample

if (hypersample == 0) {
...
VADD2(colorsum, colorsum, a.a_color);

} else {
/* hypersampling, so iterate */

Vidit Jain (Jun 26 2024 at 11:37):

@Daniel Rossberg @Himanshu I have created a new pull request addressing the selectPrimitive feature's bug here

fall Rainy (Jun 27 2024 at 06:26):

@Erik Is June 28 11.30(EDT) OK for you?

fall Rainy (Jun 30 2024 at 09:14):

I finished the whole process of neural network rendering and generated a not-so-good rgb map：
render.png

fall Rainy (Jun 30 2024 at 10:18):

why the direction of ray keep const when rendering? According to ray tracing algorithm, each ray should have a different direction:
RaysViewportSchema.png

fall Rainy (Jun 30 2024 at 14:46):

Is there a function to calculate the intersection of a ray and a sphere?

fall Rainy (Jul 01 2024 at 05:06):

fall Rainy said:

why the direction of ray keep const when rendering? According to ray tracing algorithm, each ray should have a different direction:
RaysViewportSchema.png

I get it , the default is to use parallel projection

fall Rainy (Jul 01 2024 at 14:01):

fall Rainy said:

Is there a function to calculate the intersection of a ray and a sphere?

I implemented the algorithm myself

fall Rainy (Jul 03 2024 at 13:15):

I found an interesting parper:NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis(https://dl.acm.org/doi/abs/10.1145/3503250)

Matteo Balice (Jul 03 2024 at 13:41):

fall Rainy ha scritto:

I found an interesting parper:NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis(https://dl.acm.org/doi/abs/10.1145/3503250)

Why do you think it is interesting?

fall Rainy (Jul 03 2024 at 14:24):

The Positional encoding, according to Rahaman's work On the Spectral Bias of Neural Networks, deep networks are biased towards learning lower frequency functions. They use positional encoding methods to solve this problem.

Matteo Balice (Jul 03 2024 at 16:38):

IMG_1921.jpeg
this is the ground truth

Matteo Balice (Jul 03 2024 at 16:39):

IMG_1922.jpeg
And this is with neural network prediction (bounding box). There are some issues to resolve

Matteo Balice (Jul 03 2024 at 16:41):

Probably they are mainly with rendering and not with the NN itself

Matteo Balice (Jul 03 2024 at 16:49):

Or maybe the number of samples in the dataset are too low.

Matteo Balice (Jul 05 2024 at 12:09):

I solved the issue with the rendering and got improvements

Matteo Balice (Jul 05 2024 at 12:09):

IMG_1929.jpeg

Matteo Balice (Jul 05 2024 at 12:10):

With n=100'000 rays I got 0.94% of accuracy which is not bad but the major problem is that with bounding sphere approach we need to sample a lot of rays in order to achieve a good accuracy for every angle of the camera.

Matteo Balice (Jul 05 2024 at 12:13):

The problem isn't with the Neural Network because I have a 0.99 accuracy with Training Set, but since we have an infinite number of rays that goes in all the directions in the bounding sphere, I think it is not possible to achieve the same results with the rendering without changing something because otherwise we need to sample an infinite number of rays to give to the NN (and of course it is not possible).

Matteo Balice (Jul 05 2024 at 12:15):

So we must be smarter than this and try to change representations of the input features or trying different architectures of NN

Matteo Balice (Jul 05 2024 at 12:16):

But first I want to try with n=1 million samples to see how good are the performances.

Matteo Balice (Jul 05 2024 at 12:20):

Matteo Balice ha scritto:

IMG_1921.jpeg
this is the ground truth

Remember that this is the ground truth.

Matteo Balice (Jul 05 2024 at 13:55):

IMG_1930.jpeg

Matteo Balice (Jul 05 2024 at 13:56):

This is with 1 million rays, I got the best epoch with 0.986 % of accuracy. We are on the right path :)

fall Rainy (Jul 05 2024 at 14:58):

I use a deep resnet to learn rgb value. The left is the result of neural rendering and the right is normal rendering
res_net.png

Matteo Balice (Jul 05 2024 at 14:59):

fall Rainy ha scritto:

I use a deep resnet to learn rgb value. The left is the result of neural rendering and the right is normal rendering
res_net.png

Which sampling approach have you used?

fall Rainy (Jul 05 2024 at 15:00):

Random sample, with 1million rays which have the same direction and hit the bounding sphere

fall Rainy (Jul 05 2024 at 15:01):

I want to improve network first and then sample method

Matteo Balice (Jul 05 2024 at 15:02):

Mh ok so it cannot be generalized with every angle at the moment... Predicting rgb is much more difficult, I see...

fall Rainy (Jul 05 2024 at 15:02):

The current network fits a continuous function, but the objective function is not actually continuous..

fall Rainy (Jul 05 2024 at 15:02):

We need a new structure...

Matteo Balice (Jul 05 2024 at 15:10):

I don't know if it can improve the results, but have you tried giving to the network only the rays that hit the object and not all the rays?
This can be done because first we predict with "my" network if it has hitted or not, so "your" network should predict only the ones that hit the object

fall Rainy (Jul 05 2024 at 15:13):

This a direction. I will try it later

fall Rainy (Jul 05 2024 at 15:13):

Can you tell which object you've hit?

fall Rainy (Jul 05 2024 at 15:15):

This would be very helpful to me, maybe I could train two networks

Matteo Balice (Jul 05 2024 at 15:19):

fall Rainy ha scritto:

Can you tell which object you've hit?

Regarding this, we must decide if we have to train one network for each object or not... I think deciding all this in a meeting would be more appropriate.

fall Rainy (Jul 05 2024 at 15:20):

OK, let's decide this later

fall Rainy (Jul 07 2024 at 06:42):

I got a great result with gridnet:
grid_methos
here is the loss:
loss

Matteo Balice (Jul 07 2024 at 06:59):

fall Rainy ha scritto:

I got a great result with gridnet:
grid_methos
here is the loss:
loss

Very good. This is grid + resnet?

fall Rainy (Jul 07 2024 at 07:24):

Just grid net. I will add resnet later.

Matteo Balice (Jul 07 2024 at 07:29):

fall Rainy ha scritto:

Just grid net. I will add resnet later.

What is the mean absolute error?I think it would be easier to understand how good is this model.

Matteo Balice (Jul 07 2024 at 07:30):

fall Rainy ha scritto:

I got a great result with gridnet:
grid_methos
here is the loss:
loss

And why do you think there are those white dots (noise) ?

Matteo Balice (Jul 07 2024 at 07:31):

Anyway, seeing your results, I think I will also add this grid method, I think it would improve my network too.

Matteo Balice (Jul 08 2024 at 10:03):

I tried with grid encoding but my network does not seem to improve any further. I will try different sampling approach.

Matteo Balice (Jul 08 2024 at 10:21):

I think that grid encoding is appropriate only when the direction of the rays is fixed, because if you divide your sphere in 256*256 = 65536 cells, then each cell will have an average number of 15 rays to be trained (if your training set has 1 million rays). The problem is that if your direction is fixed, then 15 rays for each cell can be acceptable but if the direction isn't fixed, then 15 is a very small number of rays to train.
I tried also with a different size of grid (20x20, 50x50, 100x100) and the NN performs like before more or less (98% accuracy with 20x20).
I also implemented different grid versions in the same NN (20x20, 50x50, 100x100, 256x256) and combined them together with some weights, but I got the same results...
So this makes me think that grid encoding isn't appropriate when rays are not fixed.

Matteo Balice (Jul 08 2024 at 17:38):

Today I improved results to 0.991 (test set) using a different optimizer.

Sean (Jul 08 2024 at 18:49):

fall Rainy said:

why the direction of ray keep const when rendering? According to ray tracing algorithm, each ray should have a different direction:
RaysViewportSchema.png

There are two different kinds of cameras -- perspective and orthogonal. Orthogonal is default for most engineering purposes and perspective is what you want for visualizations approximating human vision. Perspective rays (typically) diverge. Ortho are parallel grids.

For volumetric encoding, you almost certainly will want ortho or unbiased random.

Sean (Jul 08 2024 at 18:50):

fall Rainy said:

I finished the whole process of neural network rendering and generated a not-so-good rgb map：
render.png

That's actually not "terrible". Not great but not terrible to say the least. It's clearly recognizable.

fall Rainy (Jul 09 2024 at 16:18):

I also use a 4-Dimensions network to train, the inputs are both direction and position:
net.png

Sean (Jul 10 2024 at 13:46):

fall Rainy said:

I also use a 4-Dimensions network to train, the inputs are both direction and position:
net.png

That's clearly "seeing" the model in some sense, even getting some of the surfacing right but in a dream state. How many training epochs is that?

One thing you could try is to reduce the input dimensionality to just azimuth and elevation (2 floats). That's a much smaller space to optimize across and will result in the same centered visual for our current purposes.

fall Rainy (Jul 10 2024 at 13:50):

Oh, I have convert both direction and position to azimuth and elevation, so there are just four float inputs. It takes me one hour to train(with a 4060 GPU)

fall Rainy (Jul 10 2024 at 13:52):

I use a grid which has a shape of 12812864*64 to train. That's the maximum number of ginsengs my GPU can handle.

fall Rainy (Jul 10 2024 at 13:53):

I noticed this paper: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, they used a A100 to train and it costs more than two days.

fall Rainy (Jul 10 2024 at 13:55):

I'm trying to improve my net from the sampling methodology. Since I know more about active learning, I might train another network for sampling

Matteo Balice (Jul 10 2024 at 14:33):

fall Rainy ha scritto:

I noticed this paper: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, they used a A100 to train and it costs more than two days.

Yeah, this paper and its related works are the state of the art about neural rendering.

Sean (Jul 11 2024 at 20:31):

Close to state of the art -- I listened to that paper when it came out back in 2020. There's been a lot of work on NeRFs since then and a lot of advancements. That's still a rather different approach altogether that thus far hasn't generalized very well. That's why the newer AMD research is more the basis for this work.

Matteo Balice (Jul 11 2024 at 20:59):

Sean ha scritto:

Close to state of the art -- I listened to that paper when it came out back in 2020. There's been a lot of work on NeRFs since then and a lot of advancements. That's still a rather different approach altogether that thus far hasn't generalized very well. That's why the newer AMD research is more the basis for this work.

Do you know why was there a need to try a different approach rather than NeRFs?

Sean (Jul 11 2024 at 21:13):

Well fundamentally, nerfs are based on trying to obtain output visualizations of a 3D object based on just having some 2D view points (pictures). They only use the 3D in the test and evaluation phase, but the model is generally unknown or non-existent.

Sean (Jul 11 2024 at 21:15):

We have the 3D models. They aren't the unknown in our case. Our situation is really needing to know precisely where the 3D object is for given (unknown) view points. That's where the optimization and needs are a bit different. It's more about encoding and/or estimating a given model from the known model, but sufficiently abstracted that we can get accurate queries really fast.

Sean (Jul 11 2024 at 21:17):

We could certainly build up and train a NeRF by feeding it a set of renderings for the 3D model, to see if it can generalize it well enough, but every approach I've seen thus far is about achieving optically adequate results, not necessarily something that would pass any fidelity comparison with the ground truth 3D model.

Matteo Balice (Jul 11 2024 at 21:32):

Ok, understood. However, I believe that AMD research still lacks comprehensive volumetric information about 3D models, preventing the NN from predicting rays hit/miss with 100% precision if we wanna achieve full generalization. Perhaps we can draw inspiration from NeRF and its related works. I plan to update my work on GitHub tomorrow so that you can review and test what I have done so far.

Matteo Balice (Jul 12 2024 at 17:46):

@Sean Here there is the repo https://github.com/bralani/rt_volume/tree/neural_rendering. Please be sure you follow in order these steps:

Installation:

1) Download libtorch from here https://pytorch.org/ paying attention to build "preview nightly" and remember the version you have installed (debug or release):
image.png

2) Unzip the folder and put the libtorch folder wherever you want and then go to src\rt\CMakeLists.txt, line 56 and change these three paths with your libtorch path:
image.png

3) Now be sure to build "rt_trainneural" with "Debug" version if you have installed libtorch "Debug" or with "Release" if you have installed "Release" libtorch.

4) This step is not always required but in my case it was essential otherwise there were some issues in the execution: go to path_libtorch\lib, copy all the files inside this folder and paste these files in the build/bin (where there is the .exe) of brl-cad.

Ok now we can finally run the code with two steps:

Go to rt/train_neural.cpp and see these options:
image.png

TRAINING:
1) In the first step we generate the dataset with bounding sphere sampling (so make sure that opts.generate_dataset is true), set your db and obj as you want and set also the num of rays you want to generate (I suggest 1 million). Run the code.
2) You should see a file "train_neural.json" and "test_neural.txt" inside the path you have run the file (build/bin).
3) Go to rt/train.py and set on top your variables:
image.png
4) Run the script, the NN will stop on epoch 200 but on each epoch it validate on test set and if the accuracy of current epoch is higher than all the previous, it overwrites the model "model_sph.pt".

RENDERING:
1) Go back to rt/train_neural.cpp, set opts.generate_dataset = false, set properly opts.model_path with the path of the model trained, adjust the azimuth and elevation as you wish (to perform render) and then set opts.neural_render=0 if you want the ground truth rendering, otherwise set to 1 if you want to perform your neural rendering. :)

Matteo Balice (Jul 12 2024 at 17:52):

I guarantee that it works in Windows, I have not tried with any other OS.

Matteo Balice (Jul 12 2024 at 17:53):

I have also a Mac M1 but I have always a lot of errors when I try to build BRL-CAD...

Sean (Jul 12 2024 at 17:56):

Awesome, thank you @Matteo Balice I'll definitely be taking a deeper look at it later today, and see if I can get it up and running. I'm on M2, so will see if there are issues.

Sean (Jul 12 2024 at 17:56):

@Matteo Balice on Mac, you must enable -DBRLCAD_BUNDLED_LIBS=ON or the build will fail when it tries to use the system Tcl/Tk

Sean (Jul 12 2024 at 17:56):

(during cmake)

Sean (Jul 12 2024 at 17:57):

I already have pytorch installed and a brl-cad build, so should hopefully all just work!

Matteo Balice (Jul 12 2024 at 17:59):

Ahh ok I will try it now so you don't have any issues tomorrow on your M2.
I have also prepared a small script here that tries to find CUDA if you have a GPU or metal acceleration if you have any "M" series of Mac:
image.png

Sean (Jul 12 2024 at 20:35):

Matteo Balice said:

Ok, understood. However, I believe that AMD research still lacks comprehensive volumetric information about 3D models, preventing the NN from predicting rays hit/miss with 100% precision if we wanna achieve full generalization. Perhaps we can draw inspiration from NeRF and its related works. I plan to update my work on GitHub tomorrow so that you can review and test what I have done so far.

I'd like to hear more what you meant by this -- encoding comprehensive volumetric information isn't the goal, but some faithful encoding. A prediction with some general precision assertion (hopefully). Similar in the text space, we want a reasonably accurate response to an input prompt that is more than vague (blurry) writing.

Issue I have with radiance fields is the method isn't exactly aligned well with our available training data. We'd literally throw away information to then try to reconstitute it. The method is really a whole field that's trying to construct 3D where it did not exist previously (i.e., from photos or scans).

Sean (Jul 12 2024 at 20:37):

From my reading, the AMD research is compelling because it is just a bifurcated training of two networks, one for outside, one for inside/near, and lots of reductions made for the sake of adequate performance. My thinking is lets increase the network a bit, and see how well it can generalize.

Sean (Jul 12 2024 at 20:39):

On a related note, here's a nice site that summarizes a lot of the neural field papers -- definitely are concepts that are relevant in some of them: https://radiancefields.com/siggraph-2024-program-announced

Matteo Balice (Jul 12 2024 at 20:40):

I have already made just a little bit more complex the NN of AMD research and it generalizes very well.

Matteo Balice (Jul 12 2024 at 20:40):

This means that for each angle of the object, the NN is able to understand the shape.

Matteo Balice (Jul 12 2024 at 20:41):

But the problem is that on the boundaries of the objects, there are still some artifacts (for example the shape is smooth even though it should be sharp from the ground truth). It is not very precise there.

Matteo Balice (Jul 12 2024 at 20:42):

My idea was to substitute the grid encoding approach they used.

Matteo Balice (Jul 12 2024 at 20:42):

Because for me, the grid encoding was working only because they trained the NN with fixed directions of the rays. But in our case, the rays can have infinite directions from the same origin.

Matteo Balice (Jul 12 2024 at 20:45):

If you think about the grid encoding, every ray that hit a specific cell have a very similar origin since the direction is fixed.

Matteo Balice (Jul 12 2024 at 20:45):

But in our case this is not true.

Matteo Balice (Jul 12 2024 at 20:49):

I know that encoding volumetric informations are not the final goal, but I believe that the NN is not able to capture very well this details due to the lack of volumetric informations of the model itself. We could use for example an encoding like "voxels" because they are invariant to directions. This is just an idea for the moment, I don't know if it can work.

Matteo Balice (Jul 12 2024 at 20:54):

I will read some papers in these days about this.

Sean (Jul 13 2024 at 03:57):

@Matteo Balice That is sort of incorporating some of the concepts of a radiance field (what you're calling a grid encoding), also known as a volumetric encoding. Your intuition about the direction vectors being fixed does certainly sound plausible. While the image space was fixed, the rays themselves were scattering in nearly all directions due to the physically based lighting model they were using (lots of reflection rays, refraction rays, light sampling rays, diffuse surface rays, and more). Still, that is certainly a vast subset for the net to train on, and like I said, you have a reasoned impetus for trying to encode it volumetrically.

Sean (Jul 13 2024 at 03:58):

You'd definitely need something more descriptive than voxels unless we go pixar route to sub-pixel resolution, which is not practical (for lots of reasons).

Sean (Jul 13 2024 at 04:00):

You'd probably need something more like a vdb signed distance field (sdf) where you have voxel occupancy as well as surface direction vectors.

fall Rainy (Jul 14 2024 at 05:39):

@Matteo Balice AMD research doesn't use rays with fixed directions. They encoded both diection and position, then concatenated them to a vector:
AMD.png

fall Rainy (Jul 14 2024 at 05:48):

I think a different strategy could be used for grid coding. Replacing a 2-dimensional network with a 4-dimensional network, this grid coding encodes both positional and directional information in a single vector

Matteo Balice (Jul 14 2024 at 07:02):

Maybe I was not so clear. Yes, they encoded both direction and position BUT they trained with a fixed viewpoint:
Screenshot-2024-07-14-alle-09.00.20.png
This means that all the rays have more or less the same directions each training.
Screenshot-2024-07-14-alle-09.01.26.png

This is very different from our goal and I explained why it can not work in our case (from my hypothesis).

fall Rainy (Jul 14 2024 at 07:06):

I got it. Thanks

Matteo Balice (Jul 14 2024 at 07:06):

fall Rainy ha scritto:

I think a different strategy could be used for grid coding. Replacing a 2-dimensional network with a 4-dimensional network, this grid coding encodes both positional and directional information in a single vector

Can you elaborate more about this?

fall Rainy (Jul 14 2024 at 07:09):

I put my codes here:trainDir. I use a 128×128×64×64 net to encodes both dir and pos.

Matteo Balice (Jul 14 2024 at 07:15):

Mh I cannot quite understand this. Are you using a total of 128x128x64x64=67billions cells?

Matteo Balice (Jul 14 2024 at 07:16):

This is quite overfitting because you will get train error 0 since you have less rays than cells.

Matteo Balice (Jul 14 2024 at 07:17):

Or am I wrong?

fall Rainy (Jul 14 2024 at 07:17):

Yes... I am trying to improve it.

Matteo Balice (Jul 14 2024 at 07:17):

Maybe it can work only if the viewpoint is fixed.

fall Rainy (Jul 14 2024 at 07:17):

There are too many parameters

fall Rainy (Jul 14 2024 at 16:17):

I've been doing a lot of experimenting with grid net lately, and it's hard for him to predict rays in all directions, which may require a lot of parameters. One of the big problems is that the objective function is non-differentiable

fall Rainy (Jul 14 2024 at 16:20):

Instead of using nif directly for rendering, AMD just uses NIF to intersect to speed up the rendering process. The focus is really on intersection, which is effective for rendering complex geometry.

fall Rainy (Jul 14 2024 at 16:25):

Nerf is actually rendered differently than we are, they are using volume rendering while we are doing ray tracing.

fall Rainy (Jul 14 2024 at 16:27):

I'd like to try to modify the objective function later in order to make ray-tracing a differentiable process, and if all the data for the model is known, can I then return exactly which point was hit, and the distance between the hit point and the origin?

Matteo Balice (Jul 14 2024 at 17:45):

@Sean I confirm that on my Mac m1 it works well now.

Sean (Jul 14 2024 at 18:06):

Matteo Balice said:

Maybe I was not so clear. Yes, they encoded both direction and position BUT they trained with a fixed viewpoint:

I certainly got what you meant, and there's certainly truth in both statements. The fact that the training is happening across two networks and with rays very much scattered in all directions does to me indicate that it likely can generalize (with more parameters). They did not pick a simple scene to say the least, and their sampling was not at a low resolution. The fixed viewpoint is what let them achieve their target performance, but I don't think that their approach was really indicative of an overtrained solution. On the contrary, they showed how well it performed on other viewpoints in their talk (they just don't go into that detail on their paper -- that's another paper).

Matteo Balice (Jul 14 2024 at 18:29):

Sean ha scritto:

Matteo Balice said:

Maybe I was not so clear. Yes, they encoded both direction and position BUT they trained with a fixed viewpoint:

I certainly got what you meant, and there's certainly truth in both statements. The fact that the training is happening across two networks and with rays very much scattered in all directions does to me indicate that it likely can generalize (with more parameters). They did not pick a simple scene to say the least, and their sampling was not at a low resolution. The fixed viewpoint is what let them achieve their target performance, but I don't think that their approach was really indicative of an overtrained solution. On the contrary, they showed how well it performed on other viewpoints in their talk (they just don't go into that detail on their paper -- that's another paper).

Yes I agree with you, their approach is certainly able to generalize and I have proved it but we need to add a more complex encoding like we were saying yesterday to reach the maximum accuracy. I will study more about the papers you gave us and expecially with sdf.

Matteo Balice (Jul 15 2024 at 16:06):

Before implementing a more complex encoding (like sdf), today I tried first with positional encoding of Nerf's work. The idea is that MPL neural networks perform poorly at representing high-frequency variation in geometry. Mapping the inputs to a higher dimensional space using high frequency functions before passing them to the network enables better fitting of data that contains high frequency variation.

I got an improvement of 0.02% of accuracy, from 99.1% to 99.3%. These are two renders of the same objects but with different angles (and of course it's the same model without retraining):

output0_ground.png
output0_pred.png
output1_ground.png
output1_pred.png

Matteo Balice (Jul 15 2024 at 16:15):

Left is ground truth, right is prediction.

Matteo Balice (Jul 15 2024 at 16:17):

Major issues concern the boundaries, which are not as sharp as they should be but rather tend to be smooth.

Matteo Balice (Jul 15 2024 at 16:22):

I believe both of these issues can be resolved by using a different and more complex encoding approach, as we discussed. @Sean @fall Rainy

Matteo Balice (Jul 15 2024 at 21:59):

Another idea I was thinking to help the NN is to modify a bit the sampling method. Now I use a totally randomic sampling around the bounding sphere but we can use a smarter approach using an importance sampling to sample more the regions where we are more uncertain.
There are 4 steps:
-first we sample randomly using the same approach as now.
-Second, we can divide in N cells (like a grid) the rays basing on their origin and directions so that very similar rays will be found in the same cell.
-Third, we calculate for each cell the uncertainty (very easy to calculate).
-Lastly, We resemple but this time using the uncertainty in order to gather more samples in regions where we are more uncertain.

Do you think it could work?

Sean (Jul 15 2024 at 22:36):

Think those percentages might be a little misleading? Is it taking all the black background into account also? If we count up expected hits vs predicted hits, that output0 in particular looks considerably more than 1% deviated.

Sean (Jul 15 2024 at 22:41):

Importance sampling would be good, but that's typically done as an optimization -- Would need to see a graph over epochs to see if this actually could converge onto the solution (even if over-trained). Can you make a graph?

That said, I think it might help with some of the perimeter and higher frequency detail, but it's really easy to get the sampling ever so subtle wrong and introduce bias or error. Kind of what to see proven that we can get a robust fit, that a given network topology is capable of precise match before going down that route.

fall Rainy (Jul 16 2024 at 02:57):

@Sean Can we treat the image as a whole, rather than a single pixel, so that we can use the filtering algorithm to do some post-processing?

Matteo Balice (Jul 16 2024 at 06:45):

Sean ha scritto:

Think those percentages might be a little misleading? Is it taking all the black background into account also? If we count up expected hits vs predicted hits, that output0 in particular looks considerably more than 1% deviated.

Yes you are right. The dataset is unbalanced (more black than white) and this is the reason why accuracy is not the best in this case (I print also precision recall and better F1). But I remember that previous students used accuracy as the main metric and I wouldn’t change it. If you are interested in F1, it is 0.988 (a bit lower)

Matteo Balice (Jul 16 2024 at 06:45):

Sure, I can plot a graph over epochs.

Matteo Balice (Jul 16 2024 at 11:14):

I want to discuss just a moment about metrics. Do you think precision, recall or F1 is the most important one in our case?

Just to remember (I know you are familiar with all of these):

precision is the percentage of true and predicted hitted rays among those predicted and hitted.
recall is the pecentage of true predicted hitted rays among all true hitted rays.
F1 is an armonic mean between precision and recall.

I think both precision and recall are important in our case, so I believe that F1 is the most significant one.

Matteo Balice (Jul 16 2024 at 11:15):

Or maybe do you think some others metrics I have not mentioned are more relevant in our case?

Matteo Balice (Jul 16 2024 at 11:38):

This is the plot of the model of yesterday (so without importance sampling):
download.png

As we can notice, as we increase the number of epochs we get an average F1 between 0.98 and 0.99, so we can say that our model has an average error of 1.5%.

Matteo Balice (Jul 16 2024 at 11:39):

Today I will implement importance sampling to see if we get improvements.

Matteo Balice (Jul 16 2024 at 16:22):

And this is the behaviour with importance sampling (a moderate importance sampling):
download.png

Matteo Balice (Jul 16 2024 at 16:24):

I tried also with a more aggressive importance sampling and I got an improvement of F1: 0.991, accuracy: 0.994. So overall:
importance sampling helps the model to understand better the regions more uncertain but it is not sufficient alone to achieve best results.

Matteo Balice (Jul 16 2024 at 16:25):

So it means that we need a more complex encoding or a more complex NN. For the moment I will focus on exploring encoding based on sdf.

Sean (Jul 18 2024 at 03:00):

fall Rainy said:

Sean Can we treat the image as a whole, rather than a single pixel, so that we can use the filtering algorithm to do some post-processing?

To what end, what exactly do you mean? If the end result is pixel approximation, then it will be potentially useful for visualization purposes (only). That has use, but it's definitely a different target. The distinguishing feature that makes this challenge is replacing rt_shootray() with a neural net or the slightly higher level do_pixel(). Going full image robustly might allow for a (real-time) preview.

fall Rainy (Jul 18 2024 at 03:52):

The image rendered with neural network will have noise, if I can use the denoising algorithm after generating the image, the effect will be good, the left side of the image rendered by neural network, the right side is the image
denose.png
after denoising

Matteo Balice (Jul 18 2024 at 12:50):

Ok after studying some papers about sdf/nerf/gaussian splatting I have understood that one key information I do not actually use in the NN is the direction itself of the ray. Currently as input of the NN I use the spherical coordinates of the first and second intersection but not the direction vector.
Before introducing any further and more complex encoding I need to add this new information as input of the NN because all these methods rely on the direction.
I was thinking that we can use a smart idea to help the NN: we are not interested in the orientation of the vector direction but only on its direction.
Direction-bounding-sphere.jpeg.png
Imagine a vector in 2D, the maximum range we can have is from 0 to 180 (red region) since we are interested only on its direction. So, if the vector is in the green area, all we need to do is reflecting its angle to the positive half space.
The same idea can be applied to 3D vectors in spherical coordinates (with a fixed radius).
I think that using these new input features will be beneficial for two reasons:
1) we could use some more complex encoding based on sdf/gaussian that they all rely on the concept of direction.
2) using these input features, we will have as the new input of our NN the first intersection on the bounding sphere and this new vector direction. The advantage is that the range of theta and phi (of the new vector direction) is smaller than the previous input features that used the full ranges in spherical coordinates of the second intersection on the bounding sphere. This should help the NN because it will have less input space to analyse but without loss of informations.

PS: we could even try grid encoding associated to direction and see how it works.

fall Rainy (Jul 18 2024 at 15:54):

Is there any way to know exactly which object the light hit?

Matteo Balice (Jul 19 2024 at 14:32):

Ok I have added also the direction of the ray in the input features. As I was expecting, the NN does not improve adding the direction itself because we are not adding more informations but we are only using different features (I have also tried grid encoding).
But the advantage is that now we can implement a more complex encoding that rely on the concept of direction.

Matteo Balice (Jul 19 2024 at 14:58):

I have 2 ideas about encodings to try:

first idea (and the best one in my opinion) is similar to the work of DeepSDF. The idea is to train latent vectors to predict the sdf for each ray. Then this latent vectors will be the input of our NIF architecture. The problem is that we need to compute the sdf for each ray in the sampling approach. I don't know if there is a fast way to do it. @Sean @Erik
Note that the sdf will be used only to train the latent vectors, so the NIF architecture won't use the sdf but only the latent vectors.
second one (more difficult than the first) is a similar work of Neural pull. In this case we do not have ground truth sdf, so the idea is to train latent vectors to "pull" rays towards the nearest surface by using the predicted signed distance values and their gradients, which the network computes. The movement of each rays is determined by the predicted distance and can be either towards or away from the surface, depending on the sign of the distance.

In both the cases, we need to choose an arbitrary number of latent vectors that will be associated to each direction (and this is the reason why I have added the direction to the input features).

I believe that if we can calculate the sdf for each ray the first encoding will be more efficient. I wait for your opinion. @Sean @Erik

fall Rainy (Jul 22 2024 at 04:46):

I have read a lot of papers, nerf, nerf++,etc...and I decided to use the methodology in this paper:3D Gaussian Splatting for Real-Time Radiance Field Rendering

fall Rainy (Jul 22 2024 at 06:01):

Matteo Balice said:

I have 2 ideas about encodings to try:

first idea (and the best one in my opinion) is similar to the work of DeepSDF. The idea is to train latent vectors to predict the sdf for each ray. Then this latent vectors will be the input of our NIF architecture. The problem is that we need to compute the sdf for each ray in the sampling approach. I don't know if there is a fast way to do it. Sean Erik
Note that the sdf will be used only to train the latent vectors, so the NIF architecture won't use the sdf but only the latent vectors.

second one (more difficult than the first) is a similar work of Neural pull. In this case we do not have ground truth sdf, so the idea is to train latent vectors to "pull" rays towards the nearest surface by using the predicted signed distance values and their gradients, which the network computes. The movement of each rays is determined by the predicted distance and can be either towards or away from the surface, depending on the sign of the distance.

In both the cases, we need to choose an arbitrary number of latent vectors that will be associated to each direction (and this is the reason why I have added the direction to the input features).

I believe that if we can calculate the sdf for each ray the first encoding will be more efficient. I wait for your opinion. Sean Erik

I'm reading some papers on sdf and have some questions, sdf is used to represent geometric objects, how to represent ray with sdf?

Matteo Balice (Jul 22 2024 at 06:29):

I have currently implemented 3d Gaussian ahah

Matteo Balice (Jul 22 2024 at 06:32):

fall Rainy ha scritto:

Matteo Balice said:

I have 2 ideas about encodings to try:

first idea (and the best one in my opinion) is similar to the work of DeepSDF. The idea is to train latent vectors to predict the sdf for each ray. Then this latent vectors will be the input of our NIF architecture. The problem is that we need to compute the sdf for each ray in the sampling approach. I don't know if there is a fast way to do it. Sean Erik
Note that the sdf will be used only to train the latent vectors, so the NIF architecture won't use the sdf but only the latent vectors.

second one (more difficult than the first) is a similar work of Neural pull. In this case we do not have ground truth sdf, so the idea is to train latent vectors to "pull" rays towards the nearest surface by using the predicted signed distance values and their gradients, which the network computes. The movement of each rays is determined by the predicted distance and can be either towards or away from the surface, depending on the sign of the distance.

In both the cases, we need to choose an arbitrary number of latent vectors that will be associated to each direction (and this is the reason why I have added the direction to the input features).

I believe that if we can calculate the sdf for each ray the first encoding will be more efficient. I wait for your opinion. Sean Erik

I'm reading some papers on sdf and have some questions, sdf is used to represent geometric objects, how to represent ray with sdf?

The idea is to find the point on the ray such that it has the minimum distance to the surface. See the ray marching algorithm or sphere tracing.

Matteo Balice (Jul 22 2024 at 13:05):

Matteo Balice ha scritto:

I have currently implemented 3d Gaussian ahah

I am trying with a probabilistic approach because I wasn't sure how to correctly extract the sdf for rays.
My idea is to use an autoencoder (giving as input only the direction) so as to encode in an embedding a number n of gaussians that represent the shape of the object for that direction.

Matteo Balice (Jul 22 2024 at 13:07):

In principle we do not need to encode all the "pixels" for a given direction in an embedding, but we need only to encode those areas which are the most uncertain.

Matteo Balice (Jul 22 2024 at 13:07):

So I think I will merge this idea with my previous network which was good apart from the boundaries of the object.

Matteo Balice (Jul 22 2024 at 13:12):

I made few tries in python so as to take confidence with gaussian splatting (for example I tried approximating an image using n gaussians) and I was really impressed how good is this technique.

fall Rainy (Jul 22 2024 at 14:15):

Matteo Balice said:

So I think I will merge this idea with my previous network which was good apart from the boundaries of the object.

I agree with you. The problem is finding the boundaries.

Matteo Balice (Jul 22 2024 at 14:17):

I have an idea about this. Since I use a sigmoid activation function as last layer to predict hit/miss it is somehow an information of the uncertainty of the prediction.

Matteo Balice (Jul 22 2024 at 14:18):

So if it gives me a number around 0.5 it means that it is uncertain, so the gaussian should have more weight on that area.

fall Rainy (Jul 22 2024 at 14:22):

I think consider using Bayesian optimization

Matteo Balice (Jul 22 2024 at 14:23):

About this, I would also calculate what is the mean probability that the NN gives me to each missclassified ray to check whether it can work.

fall Rainy (Jul 22 2024 at 14:23):

Bayesian networks can output both mean and variance simultaneously

Matteo Balice (Jul 22 2024 at 14:23):

what is your idea

fall Rainy (Jul 22 2024 at 14:24):

I'm going a little slow, and I'm still considering how to integrate 3dgs

fall Rainy (Jul 22 2024 at 14:25):

I've completely given up on grid net.

Matteo Balice (Jul 22 2024 at 14:25):

If you want to know how I want to implemented it, i use an embedding of N gaussians for each direction

Matteo Balice (Jul 22 2024 at 14:26):

each gaussian can have a number of parameter that you want

fall Rainy (Jul 22 2024 at 14:26):

Can I refer to your code?

Matteo Balice (Jul 22 2024 at 14:26):

but for simplicity I use only mean, variance and I think I should add also a weight

Matteo Balice (Jul 22 2024 at 14:26):

fall Rainy ha scritto:

Can I refer to your code?

Ok later I will upload to github

fall Rainy (Jul 22 2024 at 14:27):

Ok, thx

Matteo Balice (Jul 22 2024 at 14:27):

I tried with a small number of rays (10000) and a small number of gaussians for each direction and the NN is perfectly able to discretize all the rays

Matteo Balice (Jul 22 2024 at 14:28):

The good thing is that I do not need to add any grid encoding to separate each direction, because the autoencoder is able to output the embedding in a continuos way since the input I give to it (only direction) is continuos

fall Rainy (Jul 22 2024 at 14:29):

That sounds great. I'd like to replace the whole rendering process with a 3dgs approach, which might turn into a rasterized rendering

Matteo Balice (Jul 22 2024 at 19:54):

fall Rainy ha scritto:

Can I refer to your code?

i have uploaded the code for gaussian splatting.
https://github.com/bralani/rt_volume/blob/neural_rendering2/src/rt/gaussian_splatting.py

Matteo Balice (Jul 23 2024 at 21:48):

Today I made another improvement with gaussian splatting. I decided to associate a single gaussian to each positive hit of ray in the training set with a large variance. Each gaussian has a mean (origin_theta, origin_phi, dir_phi, dir_theta) and a variance (variance_theta, variance_phi, dir_phi, dir_theta). Dir_phi and dir_theta are set to a fixed number of 0.05 because in my opinion we can save some memory in this way and the training process will be faster. The only thing the neural network is supposed to do is to find the maximum variance of each gaussian so that also the negative hits are correctly classified.
The reason why we want to find the largest variance of gaussians is because of overfitting issues.

Matteo Balice (Jul 23 2024 at 21:50):

This works pretty good with 10000 rays but the problem is that if we increase just a bit the number of examples in the training set, the NN will become very complex in the number of parameters

Matteo Balice (Jul 23 2024 at 21:52):

I believe that this approach has a lot of potential

Matteo Balice (Jul 23 2024 at 21:54):

Tomorrow I will focus on reducing the number of gaussians without losing accuracy

Matteo Balice (Jul 24 2024 at 17:16):

There is a bottleneck of the current NN because as I increase the number of examples in the training set, it goes out of memory

Sean (Jul 24 2024 at 17:16):

@Matteo Balice Can you recap the structure of the computations and memory involved with the gaussian approach? How is that related to the Encoder/Decoder networks you have/had in your gaussian_splatting.py

Matteo Balice (Jul 24 2024 at 17:17):

Matteo Balice (Jul 24 2024 at 17:18):

I should update that file, anyway

Matteo Balice (Jul 24 2024 at 17:18):

The network is very simple: there is an encoder (which has the task to produce the embedding) and a decoder which has the task to produce the output (a probability between 0 and 1)

Sean (Jul 24 2024 at 17:19):

I'm seeing a 6-layer fully connected network there, where the layers are pretty hefty number of weights in total, which would explain the memory explosion

Matteo Balice (Jul 24 2024 at 17:20):

Sean ha scritto:

I'm seeing a 6-layer fully connected network there, where the layers are pretty hefty number of weights in total, which would explain the memory explosion

It is the old version, I don't use any layer anymore.

Matteo Balice (Jul 24 2024 at 17:21):

I have uploaded the new version now

Sean (Jul 24 2024 at 17:22):

Okay. In the old, looks like approximately 4MB of memory for that latent_dim=100 construction on just the encoder side.

Sean (Jul 24 2024 at 17:23):

Sorry, way off... that's better.

Matteo Balice (Jul 24 2024 at 17:24):

Ok, in this new version I associate to each gaussian 4 numbers for the mean and 4 for the variance.

Matteo Balice (Jul 24 2024 at 17:24):

And the number of gaussians are proportional to the number of positive hit in the training set.

Matteo Balice (Jul 24 2024 at 17:25):

When I use 10k samples it works fine, with 100k is very slow and with 1 million it goes out of memory

Matteo Balice (Jul 24 2024 at 17:29):

Maybe I should not associate a single gaussian to each positive hit in the training set but I should randomly take a subset of positive hits...

Sean (Jul 24 2024 at 17:33):

So you're using 10k ray samples currently right? Is that your number of embeddings?

Matteo Balice (Jul 24 2024 at 17:34):

With 10k rays, half are positive hits so about 5k are the embeddings

Sean (Jul 24 2024 at 17:34):

Okay, but worst case it's 10k?

Matteo Balice (Jul 24 2024 at 17:35):

yes

Sean (Jul 24 2024 at 17:35):

or is there more on disk? You have it actually using whatever is in the data folder

Sean (Jul 24 2024 at 17:36):

(just need to make sure you don't have a json with 100M lines or something)

Matteo Balice (Jul 24 2024 at 17:36):

my json has 1 million data

Matteo Balice (Jul 24 2024 at 17:36):

but I take randomly only 10k samples

Sean (Jul 24 2024 at 17:37):

Heh, okay .. but you're still creating an Encoder based on embeddings, which is based on how much is in your json, not how many samples, unless I'm reading this differently

Sean (Jul 24 2024 at 17:38):

if you print(embeddings.shape[0]) in get_embeddings, what's that report?

Matteo Balice (Jul 24 2024 at 17:39):

4683

Matteo Balice (Jul 24 2024 at 17:39):

Line 47 i cut the json, so I take only 10k examples

Sean (Jul 24 2024 at 17:39):

Okay, so it is hitting the else case in the constructor

Matteo Balice (Jul 24 2024 at 17:39):

yes

Sean (Jul 24 2024 at 17:43):

So in my back-of-napkin calculations, you're really not using much memory at all, nothing that explains running out.

Matteo Balice (Jul 24 2024 at 17:44):

With 10k or with 1M?

Sean (Jul 24 2024 at 17:44):

The autoencoder network is trivial as you noted, about 0.25MB total (which is dubious for any real model)

Sean (Jul 24 2024 at 17:45):

I don't (yet) see where you're actually accruing memory in the test iterations.

Sean (Jul 24 2024 at 17:45):

Unless pytorch is doing something under the hood that isn't being used but is growing

Matteo Balice (Jul 24 2024 at 17:46):

It does not even start training with 1M

Sean (Jul 24 2024 at 17:48):

You mean all you change is num_epochs = 1000000 and it dies?

Sean (Jul 24 2024 at 17:49):

That doesn't add up

Matteo Balice (Jul 24 2024 at 17:49):

not epochs

Matteo Balice (Jul 24 2024 at 17:49):

I comment the line 47, so I load all the json

Sean (Jul 24 2024 at 17:49):

Oooh oh, gotcha -- so you're chaning the [:10000] to other values, how much data, how many embeddings

Matteo Balice (Jul 24 2024 at 17:50):

yes

Matteo Balice (Jul 24 2024 at 17:51):

In the real paper of gaussian splatting they associate a single embedding to each example but I don't understand how they don't run out of memory

Sean (Jul 24 2024 at 17:55):

I don't see how you're running out of memory.

Sean (Jul 24 2024 at 17:55):

there must be some bug or cleanup issue that is ballooning. Even with 1M samples, that's only about 38MB of data

Sean (Jul 24 2024 at 17:55):

you surely have more than that available :)

Matteo Balice (Jul 24 2024 at 17:56):

how do you calculate it?

Sean (Jul 24 2024 at 17:57):

with double precision, your 5 origin+dir+label tensors consume just 40 bytes

Matteo Balice (Jul 24 2024 at 17:58):

mmmm, maybe the issue is with the gradient of pytorch

Matteo Balice (Jul 24 2024 at 17:59):

I remember I had this problem a long time ago

Matteo Balice (Jul 24 2024 at 17:59):

I must do some checks, thanks

Sean (Jul 24 2024 at 17:59):

at 5000 embeddings, that's not even 1MB for the autoencoding (embeddings + embedding params + proxy vars)

Matteo Balice (Jul 24 2024 at 18:00):

got it

Sean (Jul 24 2024 at 18:00):

even if it scaled linearly, to 500000 embeddings, that'd be 100MB max

Matteo Balice (Jul 24 2024 at 18:01):

The json file is 100MB, so you are right

Sean (Jul 24 2024 at 18:02):

Well that's text, but even as 8-byte double precision floats there's just not enough data

Sean (Jul 24 2024 at 18:02):

I'd suggest adding some print or pause statements and watch the process memory usage, see if some particular operation is increasing usage substantially

Matteo Balice (Jul 24 2024 at 18:03):

Ok thanks 👍🏻

Sean (Jul 24 2024 at 18:03):

got to be something relatively simple. If you were getting to iterations, I would suggest adding a gc.collect() or something to ensure python has opportunity to purge, but clearly something is going on before that even if it doesn't get to iterations.

Sean (Jul 24 2024 at 18:04):

I don't see it yet, but something is consuming gobs of memory

Matteo Balice (Jul 24 2024 at 18:04):

Yes it is really strange

Matteo Balice (Jul 24 2024 at 18:05):

Thanks

Sean (Jul 24 2024 at 18:07):

Like I could totally see if it your constructor data hash was making 1M copies of all those string hash keys... but those are reset on each iteration of self.datas.

Sean (Jul 24 2024 at 18:08):

maybe some linear overhead of torch.Tensor, but that doesn't make sense to me

Sean (Jul 24 2024 at 18:09):

any change if you replace them with:

origin = torch.tensor(data["point1_sph"], dtype=torch.float64)
dir = torch.tensor(data["dir_sph"], dtype=torch.float64)
label = torch.tensor([data["label"]], dtype=torch.float64)

Matteo Balice (Jul 24 2024 at 18:10):

I have not anymore the pc with me

Matteo Balice (Jul 24 2024 at 18:10):

I will try later

Sean (Jul 24 2024 at 18:12):

Okay. I'd try that but then also exit before the training use and see if you can get a pause before exiting to see how much memory the app is actually using before RayDataset, after RayDataset construction, and after Autoencoder construction, see where it balloons out.

Matteo Balice (Jul 24 2024 at 18:13):

Yep it is what I am going to try 👍🏻

Sean (Jul 24 2024 at 18:20):

Came across this interesting high-level article posted today, nice generic tutorial... https://gpuopen.com/learn/deep_learning_crash_course/

Matteo Balice (Jul 24 2024 at 22:02):

Screenshot-2024-07-24-alle-23.59.31.png
Ok I got it. @Sean You were right on all the estimations of the memory. The problem is in the decoder which you hadn't seen, when I calculate the probability of the examples that belongs to each gaussian.
I had implemented broadcasting (to speed up calculations) so I clone the embeddings (gaussian) n times where n is the batch size (number of examples).
In this image the batch size was of 256 examples.

Matteo Balice (Jul 24 2024 at 22:04):

Broadcasting in this case isn't probably a smart way to speed up calculations since a lot of gaussians are totally useless for the calculations of the probability for the current examples, but only the closest one to the current example are relevant.

Matteo Balice (Jul 25 2024 at 13:23):

I have solved the broadcasting issue that run out memory and I have first results. For the moment I am trying with only 10k rays because the calculation of probability is pretty slow.

Here the best epoch of the previous NIF network with only 10k rays:

F1: 0.880218316493941
Accuracy: 0.9283117186132877
Precision: 0.9087676930301201
Recall: 0.8534080886985007

Here, instead the best epoch with this new gaussian splatting network with 10K rays and only 1k gaussians embeddings (I decided to lower the number of gaussians to better explain the power of this model):

F1 Score: 0.9155
Accuracy: 0.9261
Precision: 0.9316
Recall: 0.9000

As you can notice, the gaussian splatting architecture has more power than the NIF architecture BUT it is way slower.
The interesting part is that I achieved this result in only 11 epochs, so the training is not slow but the inference is slow (the action from the start of the neural network to giving back the prediction).

For this reason, I will now focus on speeding up the gaussian splatting architecture. I believe I should read some papers about real-time renderings with gaussian splatting so as to achieve this speed up.

One idea I have is cutting some gaussians basing on the input I want to predict. (ie the furthest ones from the ray I want to predict).

P.S: the results I have showed here are not close to the ones of the previous week (accuracy: 0.994 and F1: 0.991) just because they were achieved with 1 Million examples. Here I use only 10k rays just to compare NIF architecture with Gaussian Splatting.

Matteo Balice (Jul 26 2024 at 18:38):

Today I had implemented parallelization of inference process and the training process is much faster.
However, I had another problem with the covariance matrix. To calculate the pdf of a gaussian we need to invert the covariance matrix but when it is close to singularity, it is not possible to invert it. The problem is that the covariance matrix for all the gaussians has small values due to the nature of the problem (if you imagine each gaussian as an ellipsoid in 3D, it will have very small magnitude).
To mitigate this problem I decided to use double precision (float64) and for the moment it seems to perform better.

Matteo Balice (Jul 26 2024 at 18:42):

Do you know any other method to solve this issue of matrix ill-conditioning?

Matteo Balice (Jul 26 2024 at 18:46):

Apart from this, tomorrow I will train with more examples to see the true results of this model.

Matteo Balice (Jul 27 2024 at 23:13):

Today I had further speed up the inference process by taking only the closest gaussians to the example to predict. I had also tried with more examples (100k) and the metric F1 improves even to 94-95%. However, it seems more difficult to improve this result with the current architecture. My opinion is due to the number of gaussians that now are fixed for all the training. The original paper (3DGS) uses instead a variable number of gaussians that can be lowered during the training process (if some gaussians have a very small variance) or it can be increased if the accuracy is low.

Sean (Jul 28 2024 at 00:10):

That's sounding a lot better. I think it'll still need to get into the 99% realm, but that's a distinct improvement. What's that 94%-95% look like?

Sean (Jul 28 2024 at 00:11):

By the way, had a lovely discussion with one of the authors of AMD's Neural Intersection Function paper today. He said they've made some headway on generalizing themselves, but that's obviously a hard problem. He's probably going to publish on it next year.

Sean (Jul 28 2024 at 00:15):

One thought I had, and it's obviously in a different direction but maybe applicable is what if we try integrating over multiple networks. That is, use the simple network they demonstrated works well for a single view, but then lets create one for 32 (or 32768) views, get estimated in-hit points from all of them, and integrate spatially.

Sean (Jul 28 2024 at 00:16):

That's actually not terribly dissimilar from the NeRF approach, but the goal would not be a radiance field. It would be like a surface occupancy field, or a point cloud with weights.

fall Rainy (Jul 28 2024 at 11:42):

This is really possible. I recently saw some algorithms for interpolation between multiple pictures. Maybe we can predict in key directions and then make differences in other directions.

fall Rainy (Jul 28 2024 at 13:27):

If the direction is contant, network is learning a func like:
2024-07-28-212551.png

fall Rainy (Jul 28 2024 at 13:29):

Interpolated fits between different perspectives have been done before:
2024-07-28-212847.png
from VIINTER: View Interpolation with Implicit Neural Representations of Images

Matteo Balice (Jul 28 2024 at 14:29):

Sean ha scritto:

That's sounding a lot better. I think it'll still need to get into the 99% realm, but that's a distinct improvement. What's that 94%-95% look like?

I have not rendered any frame for gaussian splatting network yet because if the percentage are under 0.97 - 0.98, the predicted image is very far from the true one. I prefer first to achieve better results and then plotting it.

Matteo Balice (Jul 28 2024 at 14:35):

Sean ha scritto:

One thought I had, and it's obviously in a different direction but maybe applicable is what if we try integrating over multiple networks. That is, use the simple network they demonstrated works well for a single view, but then lets create one for 32 (or 32768) views, get estimated in-hit points from all of them, and integrate spatially.

This idea is not far from the idea of grid encoding of direction which I have already tried without any improvement. In that case I used embeddings for each direction, you suggest using a network for each direction. I can try it. :thumbs_up:

Matteo Balice (Jul 28 2024 at 14:54):

fall Rainy ha scritto:

Interpolated fits between different perspectives have been done before:
2024-07-28-212847.png
from VIINTER: View Interpolation with Implicit Neural Representations of Images

This could work but it means that we have to change the sampling to fixed view sampling. I have some doubts about the number of views we should sample because:
If we choose for example 1 million rays in the training set and we choose a reasonable number of 1000 rays for each view this means that we have a total of 1000 views.
If we distribute uniformly these views around the bounding sphere it means that, remember that theta goes from 0 to pi and phi goes from 0 to 2pi:
In my case I can use half of phi because in my case the problem is simpler and the direction is invariant to orientation (so I have theta that goes from 0 to pi and phi that goes from 0 to pi):
Theta x phi = 3,14 x 3,14 = 9,8596 -> total space
Uniform distribution of views:
9,8596/1000 views = 0,009859 rad = 0,56° between two views.
I think it is a reasonable error and it can be easily interpolated.
In your case, however, you can't reduce the range of phi (because rgb is not invariant to orientations), so you will have:
Theta x phi = 3,14 x 6,28 = 19,719 -> total space
19,719/1000 views = 0,0197 rad = 1,128° between two views.
An error of 1° is reasonable also in your case and if my calculations are exact, it could work even in your case but the error is the double of mine.

Matteo Balice (Jul 28 2024 at 14:57):

I believe we should try this approach using NIF architecture and then interpolating them.

Matteo Balice (Jul 28 2024 at 15:10):

@Sean I have a doubt about rays and pixels: if we want to render a frame of 100x100 it means that the algorithm sample for each pixel a ray (for a total of 10k rays) or is it different?

Sean (Jul 28 2024 at 23:13):

Matteo Balice said:

I have not rendered any frame for gaussian splatting network yet because if the percentage are under 0.97 - 0.98, the predicted image is very far from the true one. I prefer first to achieve better results and then plotting it.

That is absolutely not best practice and not recommended to ignore the predicted images solely based on having low percentages. Not looking gives you no information. Looking may give no information, or may provide helpful clues as to what isn't encoding well.

It can be high frequency detail, it can be a straight up bug where values are simply shifted, it can be low frequency undulations, and more. It's not in your interest to ignore them even if 19 times out of 20 it's just a "drunk wet mess".

Sean (Jul 28 2024 at 23:18):

Matteo Balice said:

I believe we should try this approach using NIF architecture and then interpolating them.

I will just reiterate what we'd discussed earlier, that there should be two different approaches being taken (in general), or even better two different goals (e.g., 3d shape vs 2d image). There is value in exploring the same method with different implementation detail (on the off chance there is some detail that matters more than anticipated).

Sean (Jul 28 2024 at 23:22):

Matteo Balice said:

This idea is not far from the idea of grid encoding of direction which I have already tried without any improvement. In that case I used embeddings for each direction, you suggest using a network for each direction. I can try it. :thumbs_up:

It's not far off, but the separate networks is the key. AMD really proved that surface illumination can be almost perfectly encoded for a given view. Maybe if we were to first reproduce their research, that would give more confidence, but lacking that it's not terribly unexpected.

Now that said, I don't think there's a whole lot of difference with random rays in random dirs -- naively I think that can work with the right network and right amount of training. Remains to be proven though. The idea with the 32+ grid views, however, is a compromise, banking on the notion that a single view should converge that view. In other analysis work we're involved with, there's mathematical proofs that lend evidence towards 32 views being a sweet spot approximation for complete random, converging much faster than pure random.

Sean (Jul 28 2024 at 23:25):

Matteo Balice said:

fall Rainy ha scritto:

Interpolated fits between different perspectives have been done before:
2024-07-28-212847.png
from VIINTER: View Interpolation with Implicit Neural Representations of Images

This could work but it means that we have to change the sampling to fixed view sampling. I have some doubts about the number of views we should sample because:
If we choose for example 1 million rays in the training set and we choose a reasonable number of 1000 rays for each view this means that we have a total of 1000 views.
If we distribute uniformly these views around the bounding sphere it means that, remember that theta goes from 0 to pi and phi goes from 0 to 2pi:
In my case I can use half of phi because in my case the problem is simpler and the direction is invariant to orientation (so I have theta that goes from 0 to pi and phi that goes from 0 to pi):
Theta x phi = 3,14 x 3,14 = 9,8596 -> total space
Uniform distribution of views:
9,8596/1000 views = 0,009859 rad = 0,56° between two views.
I think it is a reasonable error and it can be easily interpolated.
In your case, however, you can't reduce the range of phi (because rgb is not invariant to orientations), so you will have:
Theta x phi = 3,14 x 6,28 = 19,719 -> total space
19,719/1000 views = 0,0197 rad = 1,128° between two views.
An error of 1° is reasonable also in your case and if my calculations are exact, it could work even in your case but the error is the double of mine.

Again, I would just caution whether we're following research that is attempting to capture shape in the embedding or whether the goal is capturing the shape just barely enough that color, i.e., a visual image can be constructed that "looks good enough". On quick read, that VINTER paper appears to be the latter, but I'd have to read it in more detail.

Sean (Jul 28 2024 at 23:28):

As for total number of views, I think you could try as coarse as 45-degree increments. Resolution will need to be as fine as the smallest detail, which depends on the model size and detail complexity. I'd personally start at 1024x1024, about 1M per view.

Sean (Jul 28 2024 at 23:29):

That 1024^2 resolution at 45-degree probably means something like 256 or 512 resolution alignment, whatever that cell size resolves to.

Matteo Balice (Jul 29 2024 at 08:52):

Sean ha scritto:

Matteo Balice said:

I have not rendered any frame for gaussian splatting network yet because if the percentage are under 0.97 - 0.98, the predicted image is very far from the true one. I prefer first to achieve better results and then plotting it.

That is absolutely not best practice and not recommended to ignore the predicted images solely based on having low percentages. Not looking gives you no information. Looking may give no information, or may provide helpful clues as to what isn't encoding well.

It can be high frequency detail, it can be a straight up bug where values are simply shifted, it can be low frequency undulations, and more. It's not in your interest to ignore them even if 19 times out of 20 it's just a "drunk wet mess".

Yes, you are totally right. I did not considered to render it because I got that result with only 100K rays and I wanted first to train the NN with at least 1 million rays. Thanks for the advise.

Matteo Balice (Jul 29 2024 at 08:57):

Sean ha scritto:

Matteo Balice said:

This idea is not far from the idea of grid encoding of direction which I have already tried without any improvement. In that case I used embeddings for each direction, you suggest using a network for each direction. I can try it. :thumbs_up:

It's not far off, but the separate networks is the key. AMD really proved that surface illumination can be almost perfectly encoded for a given view. Maybe if we were to first reproduce their research, that would give more confidence, but lacking that it's not terribly unexpected.

Now that said, I don't think there's a whole lot of difference with random rays in random dirs -- naively I think that can work with the right network and right amount of training. Remains to be proven though. The idea with the 32+ grid views, however, is a compromise, banking on the notion that a single view should converge that view. In other analysis work we're involved with, there's mathematical proofs that lend evidence towards 32 views being a sweet spot approximation for complete random, converging much faster than pure random.

I was wondering... Isn't the limit of the previous NN (NIF network which I got 0.994 for accuracy) was simply that the number of examples in the training set was too low? Maybe we should try with even more samples (more than 1 million) to see how it works.

Matteo Balice (Jul 29 2024 at 21:45):

Today I had implemented the adaptive learning for Gaussian splatting architecture. I want first to finish and evaluate this architecture before trying with multiple NIFs.

fall Rainy (Jul 31 2024 at 08:38):

After many optimizations, I got a pretty good result(grid net, fix direction)
2024-07-31-163626.png

Sean (Jul 31 2024 at 16:20):

@fall Rainy please elaborate, (and point to latest code!) what's the resolution of the grid net, what are the layers, how many epochs, how long did training take, how long does lookup take, etc...

Sean (Jul 31 2024 at 22:09):

Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen

Matteo Balice (Jul 31 2024 at 22:12):

Sean said:

Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen

I have some experience with diffusion models for 3D. I did a diffusion network for automatic retopology (for my university).

Matteo Balice (Jul 31 2024 at 22:13):

Regarding Gaussian splatting, I am not so convinced about metrics, tomorrow I will render some frames to see graphical results…

Sean (Aug 01 2024 at 06:02):

Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh

Matteo Balice (Aug 01 2024 at 11:07):

Sean ha scritto:

Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh

This is fantastic, I just tried the software of https://github.com/NVlabs/instant-ngp (which is the base code they use) and the training time is of the order of seconds even with my poor GTX 1060!!

starseeker (Aug 01 2024 at 11:48):

Nuts. instant-ngp code license is non-commercial only

Matteo Balice (Aug 01 2024 at 11:57):

starseeker said:

Nuts. instant-ngp code license is non-commercial only

It’s not so bad, all we need to do is understand their paper and the one that Sean sent. The coding part shouldn’t be difficult even if we have to code from 0.

Sean (Aug 01 2024 at 14:00):

Yeah, the implementation seems pretty straightforward. The main limitation is they didn't get a performance gain. They train for a few minutes, then ray query performance is on par with the ray tracing time. The one big gain they saw was getting that on-par ray tracing time with an order of magnitude less memory use. So exceptional compression in the latent space.

fall Rainy (Aug 01 2024 at 14:08):

Sean said:

fall Rainy please elaborate, (and point to latest code!) what's the resolution of the grid net, what are the layers, how many epochs, how long did training take, how long does lookup take, etc...

There are three resolutions:
first,I give up bilinear interpolation and try to learn a matrix to express the relationship between neighboring vectors
second, I consider neighboring vectors in the range 7×7 instead of 2×2
third, I use a threshold to reduce noise.
here is my codes: https://github.com/Rainy-fall-end/Rendernn/blob/main/networks/gridnet3.py
100,000 points need to be sampled, but the model actually converges when 20,000 points are used，The training will take 2 minutes total.(4060)
The yellow curve is the improved gridnet, the purple one is the original gridnet.
2024-08-01-220614.png

fall Rainy (Aug 01 2024 at 14:09):

Sean said:

Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh

But this one looks so much better than mine....

Sean (Aug 01 2024 at 14:09):

Cool, thanks! I'll take a look in more detail. Looks like it converges pretty quickly? How long did it take to get to step 200?

Sean (Aug 01 2024 at 14:10):

fall Rainy said:

Sean said:

Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh

But this one looks so much better than mine....

Don't worry about that -- this is very ripe area of research.

fall Rainy (Aug 01 2024 at 14:10):

About 30 seconds.

Sean said:

Cool, thanks! I'll take a look in more detail. Looks like it converges pretty quickly? How long did it take to get to step 200?

Sean (Aug 01 2024 at 14:10):

It also means it's worth exploring all avenues as the details matter.

fall Rainy (Aug 01 2024 at 14:15):

Matteo Balice said:

Sean ha scritto:

Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh

This is fantastic, I just tried the software of https://github.com/NVlabs/instant-ngp (which is the base code they use) and the training time is of the order of seconds even with my poor GTX 1060!!

This looks like it's implemented in C++, are you going to reproduce it in pytorch?

Matteo Balice (Aug 01 2024 at 14:16):

fall Rainy ha scritto:

Matteo Balice said:

Sean ha scritto:

Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh

This is fantastic, I just tried the software of https://github.com/NVlabs/instant-ngp (which is the base code they use) and the training time is of the order of seconds even with my poor GTX 1060!!

This looks like it's implemented in C++, are you going to reproduce it in pytorch?

I am still understanding all the ideas of those two papers, but yes probably I will reproduce in pytorch.

Sean (Aug 01 2024 at 15:53):

The N-BVH paper was an outstanding talk -- will see if I can get a copy, but it's a direct response to AMD's NIF paper. They key insight was to not use the ray+dir but to instead use 3 sample points along the ray on the interior of the bounding volume along with a BVH.
https://weiphil.github.io/portfolio/neural_bvh

Sean (Aug 01 2024 at 15:55):

The key insight of using interior points was demonstrated with just 3-10 sample points which of course made training slower as points are added, but achieved essentially perfect occupancy recall even with high frequency detail. Adding in a BVH was a training optimization so they could reduce it back down to just 3 points per BVH node.

Matteo Balice (Aug 01 2024 at 21:59):

I was studying that paper and had a few questions:

They mention using a batch size of 2^18 rays, which is 262,144.
They use a total of 100 batches.

This means their training set consists of 2^18×100, equating to approximately 26 million rays. Moreover, they state that the training time is at most 2-3 minutes.

I just discovered that it is indeed possible to use such a large batch size (I had been training NIF with a batch size of just 512 rays until now...). However, theory suggests that using a large batch size can increase variance error. Likely, with such a large training set, they do not encounter this issue.

What puzzles me most is how they can achieve convergence in just about 2-3 minutes.

To investigate, I increased the batch size for the NIF model (with which I previously achieved an accuracy of 0.994). As expected, the model trains faster (20 minutes for 1 million rays), and the performance metrics remained roughly the same (just a little worse).

Matteo Balice (Aug 01 2024 at 22:02):

However, convergence for NIF is much slower than their approach, meaning that training a simple NIF requires more time and results in lower accuracy. Likely, the bounding volume hierarchies approach they use significantly helps the neural network.

Sean (Aug 02 2024 at 03:17):

That 262144 is almost certainly a 512x512 grid, 100 different views

Sean (Aug 02 2024 at 03:17):

Now have to take performance with a grain of salt. I didn't see the hardware, but it they're on a high-end GPU, that might be an hour of training on a CPU..

Sean (Aug 02 2024 at 03:24):

In their talk the BVH optimization using 3 samples vs 10 samples cut the training time roughly in half (1min)

Matteo Balice (Aug 02 2024 at 08:17):

Sean ha scritto:

Now have to take performance with a grain of salt. I didn't see the hardware, but it they're on a high-end GPU, that might be an hour of training on a CPU..

They used an RTX 3090

Matteo Balice (Aug 02 2024 at 08:18):

How do you estimate that it would be an hour on a CPU?

Matteo Balice (Aug 02 2024 at 12:05):

I decided to implement first the multi-resolution hash grid and validate it, then on top of this I will add the BVH approach so as to integrate rays in the 3D grid.

Matteo Balice (Aug 02 2024 at 17:09):

I found this wonderful repository https://github.com/ashawkey/torch-ngp and I am using this as a base code for the multi-resolution hash grid.

Matteo Balice (Aug 02 2024 at 21:11):

I implemented the multi-resolution hash grid and I tested it on top of my previous NIF network. It's incredible how it converges in just few seconds even on my poor gpu.
It achieves only 96% of F1 but it was expected since this encoding is not appropriated for rays+dir input (as we have in NIF) but they are appropriated for 3D points (like in NVBH).
Tomorrow I will focus on this part :+1:

Matteo Balice (Aug 03 2024 at 18:06):

Figure_1.png

Figure_2.png

Today I tried the multi-resolution hash grid sampling N points along the the ray. In this picture I sampled 70 points along each ray. I want to mention that this is without the BVH approach but only the multi-grid resolution. The metric F1 is about 0.985 and it converges in few seconds. The training set was about 3 million rays and the picture is 512x512. I noticed that increasing the number of the sampling points, the metrics are better (and without the BVH approach I have to use a lot of sampling points).

Matteo Balice (Aug 03 2024 at 18:07):

Before implementing the BVH approach I want to try with more samples (like the paper -> 26 millions).

Matteo Balice (Aug 04 2024 at 08:55):

Figure_1.png
Here there is the prediction of that image training with only this fixed direction just to prove that this model is perfectly able to discretize IF the number of training samples are sufficient.

Matteo Balice (Aug 04 2024 at 08:55):

Now I will train with 26 million samples.

Matteo Balice (Aug 04 2024 at 15:14):

@Sean @Erik Generating 26 million samples takes me several hours. Are there any ways to optimize the ray tracing algorithm in BRL-CAD to speed up the process?

Matteo Balice (Aug 04 2024 at 19:46):

Figure_1.png
This is with 6 millions rays.

Matteo Balice (Aug 05 2024 at 18:34):

Figure_1.png
This is with 6 million rays BUT generated from 22 different and fixed views (512x512x22) instead of using random sampling. Metrics are higher with this sampling: I got 0.997 in both accuracy and F1.

Sean (Aug 05 2024 at 21:28):

Curious, @Matteo Balice why 22? That’s what, 16degrees or so in one axis of rotation?

Matteo Balice (Aug 05 2024 at 21:38):

Well that is not exactly 22 views. I finally successfully sampled 100 views (thanks to my Mac M1) each with 512x512 rays for a total of 26 millions. However when I load all these rays on PyTorch I go out of memory… so I decided to cut only 6 millions rays.

Matteo Balice (Aug 05 2024 at 21:39):

But before cutting I randomly shuffle the 26 millions rays, so it is not right to say that they are 22 views.

Matteo Balice (Aug 05 2024 at 21:40):

Now I am trying a way to load more samples

Matteo Balice (Aug 05 2024 at 21:43):

So it’s better to say that we have 60k rays for each view (100 views)

Matteo Balice (Aug 05 2024 at 22:35):

Figure_1.png
10 millions. Getting better :mechanical_arm:

Matteo Balice (Aug 05 2024 at 22:37):

Even though there is no much difference with 6 millions... But I noticed that with 10 millions I did not converge in 10 epochs like in 6 millions case... Probably I need to train more.

Matteo Balice (Aug 05 2024 at 22:49):

(just to be clear: I always print this view because I noticed it’s one of the most difficult to render, but the model is able to render also all the others views)

fall Rainy (Aug 06 2024 at 03:15):

Matteo Balice said:

Sean Erik Generating 26 million samples takes me several hours. Are there any ways to optimize the ray tracing algorithm in BRL-CAD to speed up the process?

You can consider using multithreading

fall Rainy (Aug 06 2024 at 03:18):

Sean said:

Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen

But I remember brl-cad doesn't seem to be based on b-rep modeling?

fall Rainy (Aug 06 2024 at 14:09):

A great imple implementation of the hash encoding:HashNeRF. It is sad to find that many of my ideas have already been realized, but I'll finish my other ideas on that basis

Matteo Balice (Aug 06 2024 at 20:09):

Figure_4.png
This is with 26 million samples. I had to increase the resolution of hash grid but a lot of white dots appears. It seems that we need to add also the BVH approach so as to delete all these noisy dots.

Matteo Balice (Aug 06 2024 at 20:09):

The F1 metric is improved to 0.998

Sean (Aug 06 2024 at 22:26):

Don't quite understand how you'd run out of memory.. Is that with replicated view information or with offsets?

Sean (Aug 06 2024 at 22:30):

Even with independent 6 doubles (xyz+dir), that should be about 1.1GB, and depending on how that's encoded, that could be reduced to just 4 floats (azel on bounding sphere + azel direction) which is about 400MB for 100x512x512 views.

Matteo Balice (Aug 06 2024 at 22:36):

Because when I loaded the dataset, I computed 70 points xyz along each rays. So I had in memory 70 points xyz for 26 millions rays

Matteo Balice (Aug 06 2024 at 22:36):

Now I compute these 70 points only in the forward method of the neural network (so only for the batch).

Matteo Balice (Aug 06 2024 at 22:36):

For the current batch

Matteo Balice (Aug 06 2024 at 22:37):

Tomorrow I will upload the code on GitHub

Sean (Aug 06 2024 at 22:42):

fall Rainy said:

Sean said:

Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen

But I remember brl-cad doesn't seem to be based on b-rep modeling?

Yes and no, @fall Rainy ... BRL-CAD does have support for BREP models. They import, display, and raytrace. There's even some basic tessellation (conversion) and preliminary export support. There's just not much support yet for editing and we want ray tracing performance to be better before we push it harder.

It's fundamentally no different than all the other primitives, can be used in boolean expressions (which raytrace just fine), can be volumetric/solid or plate-mode like meshes. There's also some direct Boolean evaluation support which is converting BREP used in CSG expressions to BREP without CSG, but that work is incomplete.

What's really cool about that paper is the figured out how to encode solid geometry (in BREP form) into a NNet. Not only is that a general concept that extends to other geometry forms, it's a way to actually encode CAD in the latent space, not just SDFs or volume grids or radiance fields.

Sean (Aug 06 2024 at 22:44):

Matteo Balice said:

Now I compute these 70 points only in the forward method of the neural network (so only for the batch).

Why 70 points?? The paper demonstrated complete convergence with less than 10...

Matteo Balice (Aug 06 2024 at 22:44):

Sean said:

Matteo Balice said:

Now I compute these 70 points only in the forward method of the neural network (so only for the batch).

Why 70 points?? The paper demonstrated complete convergence with less than 10...

Yes but because they use the BVH

Matteo Balice (Aug 06 2024 at 22:44):

I don’t have implemented it yet

Matteo Balice (Aug 06 2024 at 22:45):

This will be the next step

Sean (Aug 06 2024 at 23:13):

If I recall correctly, they did not use BVH in their first iterations -- they went from 3 points to 10 points to get convergence.

Sean (Aug 06 2024 at 23:13):

They introduced a BVH to make the performance of 10 points take less time than the original 3 points.

Sean (Aug 06 2024 at 23:14):

So in theory, it should converge just fine with 10 points, just not quickly. Also means 70 should converge, but in 7x time or more.

Matteo Balice (Aug 07 2024 at 08:36):

Figure_6.png
Here there is the figure with only 10 points... As you can see it's pretty weird and the F1 metric is only about 0.97... Instead, with 70 points I got 0.998 of F1

Matteo Balice (Aug 07 2024 at 08:38):

The accuracy should be higher (according to the paper) IF the points are sampled near the surface. So even with 10 points it should be ok IF they are sampled near the surface. But how can we guarantee that they are sampled there if we do not know the intersection of the ray with the surface?

Matteo Balice (Aug 07 2024 at 08:40):

This is the reason why I uniformly sample along the ray (all the points have the same distance along the ray)... And this is the reason why increasing the number of points it converges with higher accuracy.

Matteo Balice (Aug 07 2024 at 08:42):

If there is some ideas about smarter sampling along the ray it should improve a lot the model...

Matteo Balice (Aug 07 2024 at 08:58):

Matteo Balice ha scritto:

Figure_6.png
Here there is the figure with only 10 points... As you can see it's pretty weird and the F1 metric is only about 0.97... Instead, with 70 points I got 0.998 of F1

Let's think for instance at the torus in the figure. Why is it so ugly? In my opinion, since we have rays that start and end in a bounding sphere, if we sample along these rays there is a medium/high probability that all the points sampled aren't close to the torus since this one is very thin. And this is the reason why the cube and the sphere are better represented (because their volume is larger and so it is easier that the points are closer to the cube/sphere).

Matteo Balice (Aug 07 2024 at 09:01):

In my opinion if we use BVHs that wrap the surface, not only we can use less sampling points but I think that also the accuracy must increase.

Matteo Balice (Aug 07 2024 at 15:14):

@Sean Is the BVH algorithm already implemented in brl-cad?

fall Rainy (Aug 08 2024 at 00:44):

Matteo Balice said:

Figure_4.png
This is with 26 million samples. I had to increase the resolution of hash grid but a lot of white dots appears. It seems that we need to add also the BVH approach so as to delete all these noisy dots.

Add a threshold layer before output may solve this question.

fall Rainy (Aug 08 2024 at 07:54):

I'd like to combine these tricks with the hashencoder to see how much improvement can be gained

fall Rainy said:

Sean said:

fall Rainy please elaborate, (and point to latest code!) what's the resolution of the grid net, what are the layers, how many epochs, how long did training take, how long does lookup take, etc...

There are three resolutions:
first,I give up bilinear interpolation and try to learn a matrix to express the relationship between neighboring vectors
second, I consider neighboring vectors in the range 7×7 instead of 2×2
third, I use a threshold to reduce noise.
here is my codes: https://github.com/Rainy-fall-end/Rendernn/blob/main/networks/gridnet3.py
100,000 points need to be sampled, but the model actually converges when 20,000 points are used，The training will take 2 minutes total.(4060)
The yellow curve is the improved gridnet, the purple one is the original gridnet.
2024-08-01-220614.png

fall Rainy (Aug 08 2024 at 08:02):

Of course, there are some other things that need to be improved

most of the current hashencoder are for Cartesian coordinates, and I'd like to implement it with spherical coordinates(two version, both dir and pos imported as inputs, only pos imported as inputs)
Modification of the Neighborhood Algorithm. The range of spherical coordinate is [0,pi] and [0,2pi]. 0 and pi actually represent the same point
Improved initialization strategy. During my training, I found that I could initialize the output of the model to 0, 0, 0 instead of [127.5,127.5,127.5] and it would be beneficial for the model to converge

Matteo Balice (Aug 10 2024 at 09:50):

Figure_13.png
I finally achieved the 0.9991 of F1 overtraining with 200 points for each ray and 26 millions total rays. I modified the prediction so that for each ray, I take only the maximum of those 200 points and if it is greater than 0.5 it is a hit, otherwise it's a miss. It's like we have a voxel grid, in which for each voxel we have a probability of a hit/miss.

Matteo Balice (Aug 10 2024 at 09:52):

Of course it's much slower BUT I have an idea. We can train the multi-resolution grid in this way (using a lot of points for each ray) but after the training we can take the grid already trained and build another model on top of this (without editing the grid) trying to predict the right "voxel" for each ray. We can leverage the fact that closest rays have closest "voxels" hit.

Matteo Balice (Aug 10 2024 at 09:54):

And trying to build an hashmap (similar as the grid encoding) so that for each input ray we have an O(1) complexity to retrieve the right voxel and then the inference process would be very fast!

Matteo Balice (Aug 10 2024 at 09:54):

If this works, it could be even faster than the paper of nvbh!

Sean (Aug 10 2024 at 15:43):

Matteo Balice said:

Sean Is the BVH algorithm already implemented in brl-cad?

Yes there is, see src/librt/cut_hlbvh.*

Sean (Aug 10 2024 at 15:45):

See it in use in clt_prep() in src/librt/prep.cpp

Sean (Aug 10 2024 at 15:48):

fall Rainy said:

Improved initialization strategy. During my training, I found that I could initialize the output of the model to 0, 0, 0 instead of [127.5,127.5,127.5] and it would be beneficial for the model to converge

That's one of the optimizations mentioned in the paper -- the model is not only normalized in position, but also scaled/centered/normalized in size also. So values are all 0 to 1 or -1 to 1 for XYZ. That was pretty essential in limiting the training space.

Sean (Aug 10 2024 at 15:56):

Matteo Balice said:

Figure_13.png
I finally achieved the 0.9991 of F1 overtraining with 200 points for each ray and 26 millions total rays. I modified the prediction so that for each ray, I take only the maximum of those 200 points and if it is greater than 0.5 it is a hit, otherwise it's a miss. It's like we have a voxel grid, in which for each voxel we have a probability of a hit/miss.

Please show the code for what you're doing here? That's definitely interesting results, but I'm still not understanding why you need 70 or 200 sample points. I get that it's sampling like a voxel grid, but that shouldn't be necessary (and degenerates to a simple grid query). Implies some fundamental difference in setup or evaluation. What's the network you're using at this point?

Matteo Balice (Aug 10 2024 at 15:57):

Sean ha scritto:

Matteo Balice said:

Figure_13.png
I finally achieved the 0.9991 of F1 overtraining with 200 points for each ray and 26 millions total rays. I modified the prediction so that for each ray, I take only the maximum of those 200 points and if it is greater than 0.5 it is a hit, otherwise it's a miss. It's like we have a voxel grid, in which for each voxel we have a probability of a hit/miss.

Please show the code for what you're doing here? That's definitely interesting results, but I'm still not understanding why you need 70 or 200 sample points. I get that it's sampling like a voxel grid, but that shouldn't be necessary (and degenerates to a simple grid query). Implies some fundamental difference in setup or evaluation. What's the network you're using at this point?

I'm going to upload the code in half an hour more or less.

Matteo Balice (Aug 10 2024 at 16:08):

@Sean @fall Rainy https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh

Matteo Balice (Aug 10 2024 at 16:09):

We can summarize the neural network in this picture:

image.png

Matteo Balice (Aug 10 2024 at 16:10):

In the prediction (forward method), I take the ray, sample n points along the ray, then I pass this to the encoder and finally the embedding are scaled to a probability [0;1]

Matteo Balice (Aug 10 2024 at 16:12):

In the end, for each ray I take the maximum for all the points sampled and this works well because:

if the ray is a miss, all the points must have a probability less than 0.5 so taking the maximum it's ok.
if the ray is a hit, there should be at least one voxel in the ray that has a probability greater than 0.5 and using the maximum it's ok.

Matteo Balice (Aug 10 2024 at 16:15):

This neural network works well in the training of the multi-resolution hash grid. My idea is to use the trained grid of this network as a base for another network much simpler and that uses less points. Do you have some ideas how to achieve this result (I proposed the idea of the hashmap; the paper used the BVH approach for instance)?

Matteo Balice (Aug 10 2024 at 16:31):

I would also remark some differences about the nvbh paper:
image.png

They used an MLP network with 4 hidden layers (each with relu) and a sigmoid as output but I simply use one single sigmoid layer as output without any hidden layer. (better in my network)
They used 8 levels for the grid, I use 4 levels. (better in my network)
They used a very high base resolution (8^3=512), I simply use 32 or 16 as a base resolution. (better in my network)
They used 4 features per level, I use 8 features. I have to try with 4 features. (better in their network)

Matteo Balice (Aug 10 2024 at 16:35):

Overall my network is much faster than theirs (if we do not take into consideration the sampling points part).

Matteo Balice (Aug 11 2024 at 19:29):

Screenshot-2024-08-11-alle-21.27.55.png
Screenshot-2024-08-11-alle-21.28.26.png
Screenshot-2024-08-11-alle-21.28.51.png

I have implemented an hierarchy of bbox like in a tree.

Matteo Balice (Aug 11 2024 at 19:30):

My idea is to leverage this architecture so as to retrieve all the leaf nodes of a given ray.

Screenshot-2024-08-11-alle-21.30.39.png

Matteo Balice (Aug 11 2024 at 19:31):

And moreover, we can precompute all the leaf nodes for all the rays with a given tolerance.

Matteo Balice (Aug 11 2024 at 19:32):

In this way the sampling parts of the NN should be way faster (less points to sample)

fall Rainy (Aug 12 2024 at 15:49):

with hashnet: 1 million data, 1 minute to train(4060)
hashnet.png

Matteo Balice (Aug 12 2024 at 15:53):

fall Rainy ha scritto:

with hashnet: 1 million data, 1 minute to train(4060)
hashnet.png

Does this work with arbitary rays or only with fixed directions?

fall Rainy (Aug 12 2024 at 15:53):

In a small range

Matteo Balice said:

fall Rainy ha scritto:

with hashnet: 1 million data, 1 minute to train(4060)
hashnet.png

Does this work with arbitary rays or only with fixed directions?

Matteo Balice (Aug 12 2024 at 19:37):

I am currently recording predicting times of my network and I got these results (for 1024x1024) with batch size of 8k:

if I sample 200 points I have 0.1004147 s = 100ms
if I sample 3 points (like in the paper) I have 0.123648 s = 124 ms
So it seems that the rendering times are independent from the number of points sampled (and I cannot understand why). It seems pretty strange.

Matteo Balice (Aug 12 2024 at 19:37):

Here in the picture there are the times of the paper:

image.png

Matteo Balice (Aug 12 2024 at 19:39):

In my opinion my network is better in rendering times because I use batch sizes of only 8k. They instead used 260k as batch size.

Matteo Balice (Aug 12 2024 at 19:40):

I have bought an RTX 4070 and in these days I will set up it on my PC

Matteo Balice (Aug 12 2024 at 19:40):

I will re-record times on my new rtx using their same batch size.

fall Rainy (Aug 13 2024 at 10:39):

Matteo Balice said:

I am currently recording predicting times of my network and I got these results (for 1024x1024) with batch size of 8k:

if I sample 200 points I have 0.1004147 s = 100ms

if I sample 3 points (like in the paper) I have 0.123648 s = 124 ms
So it seems that the rendering times are independent from the number of points sampled (and I cannot understand why). It seems pretty strange.

This may be due to the GPU's acceleration in matrix computation

Matteo Balice (Aug 13 2024 at 12:52):

I think I need to compare the same object as in the paper.

Matteo Balice (Aug 13 2024 at 12:52):

And using the same batch size as theirs

Matteo Balice (Aug 13 2024 at 12:53):

Because otherwise results are not comparable

Matteo Balice (Aug 13 2024 at 15:29):

I installed the new RTX 4070, but I discovered today that the power supply is no longer sufficient. I’ve ordered a new power supply, but it hasn’t arrived yet (hopefully it’ll be here by tomorrow), so I’ll be without a PC for a couple of days.

Sean (Aug 14 2024 at 05:12):

It would be very good if you're going to follow their approach @Matteo Balice to see if you can indeed match their results. If you can, then everything you're learning and asserting with different geometry is new insight. If you can't, then that may lead to discovering where there are differences/bugs/issues/assumptions that need to be considered.

Matteo Balice (Aug 14 2024 at 19:30):

img-2025_Ym53QguN.mp4
Today I had implemented a 3D visualizer directly in python.

Matteo Balice (Aug 14 2024 at 19:31):

In this way it's easier to see how much the model is predicting well.

Matteo Balice (Aug 14 2024 at 19:39):

All the frames are rendered with the neural network.

Matteo Balice (Aug 18 2024 at 11:39):

image.png

This is the "statuette" model by the nbvh paper with the memory and render times.
image.png

These are the memory and render times of my network in pytorch:
Memory: 27mb
Render time: 20,68 ms

Regarding the memory, I believe that the grid network I use as base model does not compress a lot the grid itself.
About rendering times, I think they are good because of course python is much slower than C++ (their code is in C++).
Moreover, you have to take into consideration also a small overhead in my rendering time due to the conversion to spherical to cartesian coordinates (because my training set is in spherical coordinates as previous methods), but this can be easily avoided.

Matteo Balice (Aug 18 2024 at 11:41):

About the error, I got the same as the last model (F1 of about 0.9991)

Matteo Balice (Aug 19 2024 at 11:46):

@Sean @fall Rainy I have figured out why in my network using N = 200 has the same speed as N = 3. The difference with the NBVH network is that they encode N = 3 points and then they concatenate all of these 3 points in order so as to include also the direction of the vector. But the main difference is that the input of their network has 3 * number of features for each point.
In my case it is different because I get 200 points BUT I compute in parallel all of these 200 points, so the input of my network is only 1 * number of features for each point because I do not need the direction of the ray (I just have to predict whether the voxel is hit/miss).

photo_2024-08-19_13-41-50.jpg
(sorry for my bad handwriting).

Matteo Balice (Aug 19 2024 at 12:42):

I had a talk on LinkedIn with Philippe Weier, the author of the nbvh paper and he said to me that the inference time of one ray depends on the time of computing the intersection of the ray with the first node + the time of getting the three points + the time of prediction. Then, if it is a miss, you should add the time of computing the intersection of the second node + etc...

I believe that their network can be improved using this approach:
instead of using the bvh, we can simply get N = 201 points and merging them in order in triples of three points (67 triples). In this way we can parallelize all these computations in GPU and take only the informations on the first hit. I think it should be faster than the bvh approach.

fall Rainy (Aug 19 2024 at 12:45):

It sounds a little like volume rendering

Matteo Balice (Aug 19 2024 at 12:46):

Yes, it's like it

Matteo Balice (Aug 19 2024 at 12:47):

Today I was thinking that the multi resolution hash encoding isn't very far from nif network

Matteo Balice (Aug 19 2024 at 12:47):

For example the grid encode they use has the bilinear interpolation

Matteo Balice (Aug 19 2024 at 12:48):

But it is more like voxels

fall Rainy (Aug 19 2024 at 12:48):

Maybe attach more information, like whether the point is visible or not, and then do an integration at the end

fall Rainy (Aug 19 2024 at 12:49):

Matteo Balice said:

Today I was thinking that the multi resolution hash encoding isn't very far from nif network

Yes, it has the advantage of being more efficient in the use of voxels

Matteo Balice (Aug 19 2024 at 12:50):

fall Rainy ha scritto:

Maybe attach more information, like whether the point is visible or not, and then do an integration at the end

Yes

fall Rainy (Aug 19 2024 at 12:50):

I'd like to make some attempts as well, do you currently have the code for that?

Matteo Balice (Aug 19 2024 at 12:53):

No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh

fall Rainy (Aug 19 2024 at 12:54):

Actually, I think it's kind of like a combination of nbvh and nerf.

fall Rainy (Aug 19 2024 at 12:56):

Matteo Balice said:

No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh

Thx

Matteo Balice (Aug 21 2024 at 10:38):

I got another improvement in both rendering time and memory. Talking with the author of nbvh he suggested me to use only 1 number as the dimension of each level (and he was right because I only need to predict visibility ie the shape of the object).
Here it is the grid:
image.png

Moreover, now the statistics are:

memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere, otherwise the time is lower)

These statistics are indipendent from the object encoded. I have tried with moss.g and statuette (one of their object) and the reason is very simple: I do not use any bvh, so every ray will have the same inference time IF the grid is the same between object. The only thing that can change between objects is the metric F1, because some objects can have small details that are more complex to encode but using this grid is ok both for moss.g and statuette.

Matteo Balice (Aug 21 2024 at 10:40):

I have written some metrics here if you are interested:

NBVH
hardware: RTX 4070 12GB
batch size: 2^14 (16'384 rays)

Both models achieve 0.9991 F1 with 26 millions rays in about 10 minutes
memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere)

MOSS MODEL
training set: 1 million rays
training time: 47 seconds
F1: 0.9958

training set: 5 million rays
training time: 4 min 10 sec
F1: 0.9974

STATUETTE MODEL
training set: 1 million rays
training time: 40 seconds
F1: 0.9942

training set: 5 million rays
training time: 4 min
F1: 0.9972

fall Rainy (Aug 21 2024 at 15:06):

Matteo Balice said:

No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh

I read these codes. It still looks like gridnet&hashnet? not nbvh

Matteo Balice (Aug 21 2024 at 15:07):

fall Rainy ha scritto:

Matteo Balice said:

No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh

I read these codes. It still looks like gridnet&hashnet? not nbvh

Yes I call it nbvh but it is only hashnet and gridnet

fall Rainy (Aug 21 2024 at 15:08):

Matteo Balice said:

I got another improvement in both rendering time and memory. Talking with the author of nbvh he suggested me to use only 1 number as the dimension of each level (and he was right because I only need to predict visibility ie the shape of the object).
Here it is the grid:
image.png

Moreover, now the statistics are:

memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere, otherwise the time is lower)

These statistics are indipendent from the object encoded. I have tried with moss.g and statuette (one of their object) and the reason is very simple: I do not use any bvh, so every ray will have the same inference time IF the grid is the same between object. The only thing that can change between objects is the metric F1, because some objects can have small details that are more complex to encode but using this grid is ok both for moss.g and statuette.

what do you mean about 1920×1080?

Matteo Balice (Aug 21 2024 at 15:09):

The resolution of the rendering

Matteo Balice (Aug 21 2024 at 15:11):

I am using your rgb training set now and I have one picture to show you

fall Rainy (Aug 21 2024 at 15:12):

I've actually found out before. And for rgb prediction, dimension=3 is better

Matteo Balice said:

I got another improvement in both rendering time and memory. Talking with the author of nbvh he suggested me to use only 1 number as the dimension of each level (and he was right because I only need to predict visibility ie the shape of the object).
Here it is the grid:
image.png

Moreover, now the statistics are:

memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere, otherwise the time is lower)

These statistics are indipendent from the object encoded. I have tried with moss.g and statuette (one of their object) and the reason is very simple: I do not use any bvh, so every ray will have the same inference time IF the grid is the same between object. The only thing that can change between objects is the metric F1, because some objects can have small details that are more complex to encode but using this grid is ok both for moss.g and statuette.

Matteo Balice (Aug 21 2024 at 15:14):

image.png
The torch.max is not suitable for rgb prediction

fall Rainy (Aug 21 2024 at 15:15):

You can do it with this:

        x = self.output_layers(x)
        x = torch.sigmoid(x) # map output to 0,1
        x = F.threshold(x, 0.1, 0.0) #Reducing noise
        x = x * 255 #map output to 0,255

Matteo Balice (Aug 21 2024 at 15:16):

Is this differentiable?

fall Rainy (Aug 21 2024 at 15:16):

Yes(I think so

Matteo Balice (Aug 21 2024 at 15:16):

Because I tried something similar but it wasnt differentiable

Matteo Balice (Aug 21 2024 at 15:17):

Have you tried rgb with hashnet&gridnet?

fall Rainy (Aug 21 2024 at 15:17):

This is my res:
91aa8fe67e981e4ed47dc11c6bda546.png

Matteo Balice (Aug 21 2024 at 15:18):

Does it work with all directions?

fall Rainy (Aug 21 2024 at 15:18):

with 1million rays to train

fall Rainy (Aug 21 2024 at 15:18):

Just for fixed direction

Matteo Balice (Aug 21 2024 at 15:19):

Ah ok

fall Rainy (Aug 21 2024 at 15:19):

For all directions, I'm getting similar results to you.

fall Rainy (Aug 21 2024 at 15:20):

For all directions, my predictions for binary classification actually turned out pretty well

fall Rainy (Aug 21 2024 at 15:20):

I'll organize the results tomorrow.

Matteo Balice (Aug 21 2024 at 15:20):

ok good

fall Rainy (Aug 21 2024 at 15:21):

Matteo Balice said:

image.png
The torch.max is not suitable for rgb prediction

This image doesn't look like it's been rendered by brl-cad

Matteo Balice (Aug 21 2024 at 15:21):

it's in python

fall Rainy (Aug 21 2024 at 15:21):

OK, got it

Matteo Balice (Aug 21 2024 at 15:21):

they are rgb colors of the pygame renderer

Matteo Balice (Aug 21 2024 at 15:22):

it's a python library

fall Rainy (Aug 21 2024 at 15:22):

I will show all of my results here: neural rendering

fall Rainy (Aug 21 2024 at 15:23):

Matteo Balice said:

they are rgb colors of the pygame renderer

The resulting graph is very nice.

Matteo Balice (Aug 21 2024 at 15:23):

which graph

fall Rainy (Aug 21 2024 at 15:23):

Matteo Balice said:

image.png
The torch.max is not suitable for rgb prediction

this one

fall Rainy (Aug 21 2024 at 15:24):

Looks more like a point cloud map.

Matteo Balice (Aug 21 2024 at 15:25):

well, this is because at the moment if the ray intersects 2 or more surfaces, my network does not know which color to show

Matteo Balice (Aug 21 2024 at 15:25):

so it seems weird

fall Rainy (Aug 21 2024 at 15:26):

OK, got it

Matteo Balice (Aug 21 2024 at 18:50):

img-2044_owThMErC.mp4

Matteo Balice (Aug 21 2024 at 18:51):

@fall Rainy I believe it's just a matter of choosing the right loss function and the right hyperparameters.

Matteo Balice (Aug 21 2024 at 18:53):

If we add also the position of where the ray intersects the surface, I believe we will achieve a great accuracy.

Matteo Balice (Aug 21 2024 at 19:17):

(The training set is 1 million)

fall Rainy (Aug 22 2024 at 08:17):

for all direction?

Matteo Balice (Aug 22 2024 at 08:17):

yes

fall Rainy (Aug 22 2024 at 08:18):

That's amazing

fall Rainy (Aug 22 2024 at 08:18):

just hashnet/gridnet?

Matteo Balice (Aug 22 2024 at 08:18):

yes

fall Rainy (Aug 22 2024 at 08:19):

Wow

Matteo Balice (Aug 22 2024 at 08:19):

well it is not a surprise

Matteo Balice (Aug 22 2024 at 08:19):

it was already done in the nbvh paper

fall Rainy (Aug 22 2024 at 08:20):

Matteo Balice said:

If we add also the position of where the ray intersects the surface, I believe we will achieve a great accuracy.

I've created such a dataset before, but it didn't work well and I gave up on it

Matteo Balice (Aug 22 2024 at 08:20):

on which network?

fall Rainy (Aug 22 2024 at 08:20):

Matteo Balice said:

it was already done in the nbvh paper

But I don't think you're using the bvh?

fall Rainy (Aug 22 2024 at 08:21):

Matteo Balice said:

on which network?

gridnet

Matteo Balice (Aug 22 2024 at 08:21):

the bvh is useful only to decrease the number of points but the network is exactly hashnet + gridnet

Matteo Balice (Aug 22 2024 at 08:22):

As I said, I use a different approach as a sampling approach (I parallelize the points)

fall Rainy (Aug 22 2024 at 08:23):

OK, got it. What did you do?

Matteo Balice (Aug 22 2024 at 08:23):

I just extendend my previous network with rgb

Matteo Balice (Aug 22 2024 at 08:23):

I take the first hit along the 200 points

Matteo Balice (Aug 22 2024 at 08:23):

And I take the embedding of that voxel to show the rgb color

fall Rainy (Aug 22 2024 at 08:24):

wait a minute. you just pridict hit points?

Matteo Balice (Aug 22 2024 at 08:25):

Well something similar, the network is able to predict itself the voxels hitted or not

Matteo Balice (Aug 22 2024 at 08:25):

Because all the rays that miss the object will put 0 on all the voxels along their path

Matteo Balice (Aug 22 2024 at 08:27):

Instead, if we are talking about a ray that is a hit, there must be at least one voxel along the path that it is a hit.

fall Rainy (Aug 22 2024 at 08:27):

How many parameters did you enter? Spherical or Cartesian coordinates?

Matteo Balice (Aug 22 2024 at 08:27):

All the points along the path are in cartesian coordinates

Matteo Balice (Aug 22 2024 at 08:28):

But they are scaled like they are in a sphere of radius 1.

Matteo Balice (Aug 22 2024 at 08:28):

I need this to make hashnet work between range -1,1

Matteo Balice (Aug 22 2024 at 08:28):

Do you want to see the code?

fall Rainy (Aug 22 2024 at 08:28):

yes

Matteo Balice (Aug 22 2024 at 08:29):

Ok some minutes, I am going to upload it.

fall Rainy (Aug 22 2024 at 08:29):

I'm a little confused.

Matteo Balice (Aug 22 2024 at 08:40):

https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
The file nvbh_rgb.py

fall Rainy (Aug 22 2024 at 08:42):

OK, thx.

fall Rainy (Aug 22 2024 at 08:44):

I'll try it after I submit my final evaluation.

Matteo Balice (Aug 22 2024 at 08:48):

As you can see, it understands the volume by itself

epoch 1
image.png
image.png

epoch 5
image.png

Matteo Balice (Aug 22 2024 at 08:53):

There are still a lot of improvements to do, first I do not encode the direction of the ray

fall Rainy (Aug 22 2024 at 08:56):

Matteo Balice said:

There are still a lot of improvements to do, first I do not encode the direction of the ray

I'm getting more and more confused, wait until I read your code

Matteo Balice (Aug 22 2024 at 08:56):

haha ok

fall Rainy (Aug 22 2024 at 08:57):

If you want to get more message, like distance, you can find it here:

(void)rt_shootray(&a);
*dist = a.a_dist;
*hit_ = a.a_user;

Matteo Balice (Aug 23 2024 at 12:31):

IMG_2047.mp4

Matteo Balice (Aug 23 2024 at 12:33):

@fall Rainy it’s better with the intersection point

Matteo Balice (Aug 23 2024 at 12:34):

This is trained with 10 millions rays but only in 5 epochs

Matteo Balice (Aug 23 2024 at 12:34):

It can be even better

Matteo Balice (Aug 23 2024 at 12:36):

I train in parallel the shape of the object and also the rgb. Probably it is better to separate the two networks because the second one (rgb) depends on the first (shape-> hit/miss)

Matteo Balice (Aug 23 2024 at 12:37):

(Don’t see the fact that the object stretches when I rotate, it’s a bug of the camera…)

Matteo Balice (Aug 23 2024 at 12:39):

If you see below the cube there is an error of the color, probably there aren’t any rays in the training set that go below the cube :joy:

Matteo Balice (Aug 23 2024 at 22:19):

img-2048_FQBUrdcC 2.mp4

Matteo Balice (Aug 23 2024 at 22:20):

I have added also the direction of the ray and as you can see it is essential in the prediction of the color because the same intersection point can have a different color if the direction is different (see how the light on the surface change).

Matteo Balice (Aug 23 2024 at 22:21):

Again, don’t see the fact that the shape stretches.

fall Rainy (Aug 25 2024 at 08:06):

@Matteo Balice I just read your code. Any example for your dataset?

Matteo Balice (Aug 25 2024 at 08:06):

Wait a moment

Matteo Balice (Aug 25 2024 at 08:08):

(deleted)

Matteo Balice (Aug 25 2024 at 08:08):

https://drive.google.com/file/d/1G-HR5PSxtaXoZCaQYEKGXHrx1sTcfSwh/view?usp=drive_link

fall Rainy (Aug 25 2024 at 08:08):

I don't have access to this link.

fall Rainy (Aug 25 2024 at 08:09):

Just sent an access request

Matteo Balice (Aug 25 2024 at 08:09):

I have accepted

fall Rainy (Aug 25 2024 at 08:10):

OK thx

Matteo Balice (Aug 25 2024 at 08:11):

The code you read is not updated. In that code I did not managed with the intersection point.

fall Rainy (Aug 25 2024 at 08:14):

Can you update it now?

Matteo Balice (Aug 25 2024 at 08:14):

But in the dataset there are (for each example i.e each row):

2 values for intersection 1 (spherical coord)
2 values for intersection 2 (spherical coord)
1 value for label (hit/miss 1 or 0)
3 values for rgb
1 value for the intersection point: it is the distance from intersection1 to intersection2 (scaled to 0-1) -> if it is 1 it means that the intersection is exactly on intersection2

Matteo Balice (Aug 25 2024 at 08:14):

fall Rainy ha scritto:

Can you update it now?

Yes just some minutes

Matteo Balice (Aug 25 2024 at 08:22):

Done

fall Rainy (Aug 25 2024 at 08:24):

Got it, Thx

fall Rainy (Aug 25 2024 at 09:58):

What do these two spherical coordinates represent?

Matteo Balice (Aug 25 2024 at 09:58):

the first and second intersection on the bounding sphere

fall Rainy (Aug 25 2024 at 10:05):

It looks like you've trained two networks, one for determining if it's a hit or not, and the other for predicting rgb values

Matteo Balice (Aug 25 2024 at 10:05):

yes

fall Rainy (Aug 25 2024 at 10:15):

I see how you're controlling the direction, the two intersections actually determine the direction of light

Matteo Balice (Aug 25 2024 at 10:24):

yes but it is controlled also by the order of the points sampled close to the hit surface

Matteo Balice (Aug 25 2024 at 10:25):

which they are sampled thanks to the two intersections as you said

fall Rainy (Aug 25 2024 at 10:54):

How do you get this loss function:

def loss_fn(output_label, labels, dist, all_outputs):

    mask = (labels > 0.5).squeeze()

    # Applichiamo la soglia morbida
    indices_first = find_index_with_exponential_decay(all_outputs.view(-1, int(n_points))).view(-1, 1) / (int(n_points))

    # L1 Loss tra gli indici soft e dist
    loss1 = nn.L1Loss()(indices_first[mask], dist[mask])
    loss0 = nn.BCELoss()(output_label, labels)

    return loss0 + loss1

Matteo Balice (Aug 25 2024 at 10:57):

loss0 is simply the loss for hit/miss

Matteo Balice (Aug 25 2024 at 10:57):

loss1 instead is the loss for the intersection point on the surface

Matteo Balice (Aug 25 2024 at 11:00):

These are the steps:
1) I sample n_points on the segment between intersection1 and intersection2
2) I calculate the probability of hit for each point
3) I need to get the first hit point on these n_points but I need to get it in a differentiable way to calculate the loss function, so the function find_index_with_exponential_decay cover this issue

Matteo Balice (Aug 25 2024 at 11:00):

4) Thanks to the hit point prediction I can calculate the distance to the true hit point.

fall Rainy (Aug 25 2024 at 13:57):

I organized your code, after all, it's not good to put all the code in one file:neural_rendering

Matteo Balice (Aug 25 2024 at 14:36):

fall Rainy ha scritto:

I organized your code, after all, it's not good to put all the code in one file:neural_rendering

This morning I did another improvement: I have merged the two networks into one (now it is faster to train) and I have added the true normalized direction as input of the MLP neural network: training with the full dataset there is an average error of 6 for each RGB channel.

Matteo Balice (Aug 25 2024 at 14:37):

We are very close :mechanical_arm:

Matteo Balice (Aug 25 2024 at 14:48):

video-2024-08-25-16-46-36_R9pM18PD.mp4

Matteo Balice (Aug 26 2024 at 07:19):

true.png
pred.png
@fall Rainy this is the current difference if you train the NN (the prediction is computed directly in python)

Matteo Balice (Aug 26 2024 at 07:23):

It seems the prediction is brighter for some reason

Matteo Balice (Aug 26 2024 at 07:24):

@Sean Are there any post production processes in brlcad before saving the render?

Matteo Balice (Aug 26 2024 at 07:27):

Because I cannot explain why the light is brighter :distrust:

Matteo Balice (Aug 26 2024 at 07:33):

(the prediction seems more realistic than the true one :joy: )

fall Rainy (Aug 26 2024 at 07:37):

Matteo Balice said:

true.png
pred.png
fall Rainy this is the current difference if you train the NN (the prediction is computed directly in python)

This is my results(rendered by brl-cad)
1fba315da2f2ce6c381664d32c767a8.png
I think it may be that the rendering parameters are different

Matteo Balice (Aug 26 2024 at 07:41):

fall Rainy ha scritto:

Matteo Balice said:

true.png
pred.png
fall Rainy this is the current difference if you train the NN (the prediction is computed directly in python)

This is my results(rendered by brl-cad)
1fba315da2f2ce6c381664d32c767a8.png
I think it may be that the rendering parameters are different

Is this the prediction?

fall Rainy (Aug 26 2024 at 07:45):

No, the true result

Matteo Balice (Aug 26 2024 at 07:49):

Have you tried training the NN?

fall Rainy (Aug 26 2024 at 07:49):

Yes, I got the same result with you

fall Rainy (Aug 26 2024 at 07:50):

I'm trying to add some of my previous improvements on hashencoder to your network.

Matteo Balice (Aug 26 2024 at 07:51):

Ok you are more expert in rgb than me, I will wait for you

Matteo Balice (Aug 26 2024 at 07:54):

The L1 Loss says that there is an average error of 6 along each RGB channel. But I think that this value is not uniform for each pixel, because if you see on the shadows it is nearly perfect. So I believe that this error is an average between the bright areas (more than 6) and shadows one (almost perfect).

fall Rainy (Aug 26 2024 at 07:56):

Yes, it is easier to train the shadow parts

Matteo Balice (Aug 26 2024 at 07:59):

true

fall Rainy (Aug 26 2024 at 11:33):

I also merge the two networks into one:https://github.com/Rainy-fall-end/neural_rendering/tree/merge_network

Matteo Balice said:

fall Rainy ha scritto:

I organized your code, after all, it's not good to put all the code in one file:neural_rendering

This morning I did another improvement: I have merged the two networks into one (now it is faster to train) and I have added the true normalized direction as input of the MLP neural network: training with the full dataset there is an average error of 6 for each RGB channel.

Matteo Balice (Aug 26 2024 at 11:34):

Good

Matteo Balice (Aug 26 2024 at 11:35):

I have noticed also that the distance loss(loss1) is essential to predict the rgb color

Matteo Balice (Aug 26 2024 at 11:36):

Maybe the error on the color is due to the small error on the loss1?

Matteo Balice (Aug 26 2024 at 11:36):

I don’t think so

fall Rainy (Aug 26 2024 at 11:51):

I think some weight could be added in front of the loss?

def loss_fn3(output_label,labels,output_dists,dists,output_rgb,rgb):
    mask = (labels > 0.5).squeeze()

    # Applichiamo la soglia morbida
    indices_first = find_index_with_exponential_decay(output_dists.view(-1, int(n_points))).view(-1, 1) / (int(n_points))

    # L1 Loss tra gli indici soft e dist
    loss0 = nn.BCELoss()(output_label, labels)
    loss1 = nn.L1Loss()(indices_first[mask], dists[mask])
    loss2 = loss_fn2(output_rgb,rgb)
    return 0.2*loss0 + 0.2*loss1 + 0.6*loss2

Matteo Balice (Aug 26 2024 at 11:52):

Well I don't think it will change the convergence because the encoders of loss2 and the others two losses are separate

Matteo Balice (Aug 26 2024 at 11:53):

In other words, the gradients of rgb is separate from the the others two losses

Matteo Balice (Aug 26 2024 at 11:53):

I believe there is still something missing

Matteo Balice (Aug 26 2024 at 11:54):

Maybe we need to make more complex the encoder of rgb

Matteo Balice (Aug 26 2024 at 12:06):

@fall Rainy If you see my last commit of yesterday I have changed the input of the rgb encoder. There is no need to give in input 5 points but 1 it's ok if we add directly the direction of the ray.

Matteo Balice (Aug 26 2024 at 12:09):

https://github.com/bralani/rt_volume/blob/neural_rendering/src/rt/nvbh/nvbh_rgb.py

Matteo Balice (Aug 26 2024 at 12:57):

pred2.png
@fall Rainy the problem was the training set

Matteo Balice (Aug 26 2024 at 12:57):

The sampling was not uniform along each direction

Matteo Balice (Aug 26 2024 at 12:57):

https://drive.google.com/file/d/1yiyRmts0hboItRuw9VnC1OP1lriIxIOh/view?usp=drive_link
Try with this training set

Matteo Balice (Aug 26 2024 at 13:06):

:partying_face:

fall Rainy (Aug 26 2024 at 13:16):

Amazing, got it

Matteo Balice (Aug 26 2024 at 13:29):

Now it's all about finding the right combination with quality and inference time/memory. I will focus on the first encoder

Matteo Balice (Aug 26 2024 at 13:29):

So you can focus on the rgb encoder

fall Rainy (Aug 26 2024 at 14:53):

I think moss is a bit too simple to do training, may be we should test some of the more complex models

Matteo Balice (Aug 26 2024 at 15:00):

Sure, this is the uniform sampling I did:

// Funzione per calcolare i parametri azimuth ed elevation
std::vector<std::pair<double, double>> uniform_sphere_sampling(int N) {
    std::vector<std::pair<double, double>> angles;
    double phi = (1.0 + std::sqrt(5.0)) / 2.0;  // Costante aurea

    for (int i = 0; i < N; ++i) {
        // Calcolo dell'angolo di elevazione (phi)
        double elevation = std::acos(1 - 2.0 * (i + 0.5) / N);

        // Calcolo dell'angolo di azimut (theta)
        double azimuth = std::fmod(2.0 * M_PI * i / phi, 2.0 * M_PI);

                // Convert the azimuth to degrees
                azimuth = azimuth * 180 / M_PI;
                // Convert the elevation to degrees
                elevation = elevation * 180 / M_PI;

        // Aggiunge la coppia (azimuth, elevation) alla lista
        angles.push_back(std::make_pair(azimuth, elevation));
    }

    return angles;
}


void generate_renders_test_set(int num_renders)
{

    // make a file to write the test set
    FILE* file = fopen("./test_neural2.txt", "w");
    fclose(file);


    // test set
    set_generate_test_set(1);
    set_type(render_type::neural);

    auto para = uniform_sphere_sampling(num_renders);

    for (int i = 0; i < num_renders; i++)
    {
        printf("Rendering %d\n", i);

        do_ae(para[i].first, para[i].second);

        //outputfile = (char*)"./output.png";
        rt_neu::render();
    }
    set_generate_test_set(0);
}

Matteo Balice (Aug 26 2024 at 15:00):

If you want to generate the training set

Sean (Aug 26 2024 at 16:13):

Matteo Balice said:

Sean Are there any post production processes in brlcad before saving the render?

There are not any happening on an image / frame basis, no. Your ambient level appears to approximately match. What's not matching is the intensity from the sole light source itself, like it's being applied with a different scaling factor.

Matteo Balice (Aug 26 2024 at 16:14):

Sean ha scritto:

Matteo Balice said:

Sean Are there any post production processes in brlcad before saving the render?

There are not any happening on an image / frame basis, no. Your ambient level appears to approximately match. What's not matching is the intensity from the sole light source itself, like it's being applied with a different scaling factor.

Yep we have solved it. The error was on the training set due to the lack of samples for each direction :)

Sean (Aug 26 2024 at 16:14):

I caught up and saw that :)

Matteo Balice (Aug 26 2024 at 16:15):

Now it is almost perfect

Sean (Aug 26 2024 at 16:15):

What happens with the 'havoc' object in 'havoc.g' sample?

Sean (Aug 26 2024 at 16:16):

wildly different model

Matteo Balice (Aug 26 2024 at 16:16):

Is this more complex?

Matteo Balice (Aug 26 2024 at 16:18):

I will generate the training set later :)

Sean (Aug 26 2024 at 16:26):

It's much more complex, by about 3 orders of magnitude compared with moss.

Sean (Aug 26 2024 at 16:27):

Still considered a small model, but it has hard features that will be interesting to observe

Matteo Balice (Aug 26 2024 at 17:33):

pygame-window-2024-08-26-19-15-10.mp4

Matteo Balice (Aug 26 2024 at 17:34):

It's strange why there are those noisy dots below the helicopter.

Matteo Balice (Aug 26 2024 at 17:36):

Maybe the training set is small (it contains only 100 frames in 256x256 for a total of 5/6 millions)

Matteo Balice (Aug 27 2024 at 11:49):

During the sampling part (renderings) I have these warnings (and others similar):

Root solver reported 3 intersections != {0, 2, 4} on s.nos5a.i
shooting point (units mm): (16599.899526, 396.927869, 1211.109774)
shooting direction: (-0.947094, -0.274243, -0.166745)
377.103 366.627 86.6928

OVERLAP1: /havoc_front/cannopy/cannopy_wipers/2_wiper/2_wipe1_rubber/2_r.wipe1
OVERLAP2: /havoc_front/cannopy/cannopy_glass/r.glass1
OVERLAPa: dist=(1.91371, 3.2123) isol=2_s.wipe2 osol=2_s.wipe2
OVERLAPb: depth 1.29859mm at (15095.9, -244.662, 1677.06) x105 y148 lvl0

Matteo Balice (Aug 27 2024 at 11:50):

Could it be for this reason that the model isn't able to predict well?

Matteo Balice (Aug 27 2024 at 14:57):

Does a_dist contain always the first intersection even if there are more than 1 intersections of the ray on the surface?

VJOIN1(hit_point, ap->a_ray.r_pt, ap->a_dist, ap->a_ray.r_dir);

@Sean

Matteo Balice (Aug 27 2024 at 15:27):

This is the shape training only with the loss of hit/miss:
image.png

And this is the shape training also with the loss of the distance:
image.png

So the problem must be with the distance... I think that during the sampling part (in brl-cad), if there are more than 1 intersections, a_dist contains the wrong distance and I need only and always the first intersection. Could it be possible?

Sean (Aug 27 2024 at 21:56):

Matteo Balice said:

Does a_dist contain always the first intersection even if there are more than 1 intersections of the ray on the surface?
VJOIN1(hit_point, ap->a_ray.r_pt, ap->a_dist, ap->a_ray.r_dir);
Sean

No, a_dist isn't even set by librt -- that's an application-specific field so apps (e.g., rt) get to use it however they want. So code you're using/calling has set or used a_dist and that's what you'll have to refer to in order to determine what is there.

Sean (Aug 27 2024 at 21:57):

That shape training doesn't look right -- that looks like just the nose geometry. I mean it's great that it's recognizable, but missing a lot. Did you use the 'havoc' object?

Sean (Aug 27 2024 at 21:58):

Ah, and I see from your output reporting that you're using "havoc_front" .. you want "havoc" instead.

Matteo Balice (Aug 27 2024 at 21:58):

No i

Sean said:

That shape training doesn't look right -- that looks like just the nose geometry. I mean it's great that it's recognizable, but missing a lot. Did you use the 'havoc' object?

No I used only the havoc_front

Matteo Balice (Aug 27 2024 at 21:58):

Ah ok I will train with havoc

Sean (Aug 27 2024 at 21:59):

Matteo Balice said:

Could it be for this reason that the model isn't able to predict well?

No, those warnings are innocuous. They are known issues with the model that just affect a few pixels and do not affect the hit distance values or silhouetting at all.

Matteo Balice (Aug 27 2024 at 22:00):

It’s strange

Sean (Aug 27 2024 at 22:00):

Matteo Balice said:

Ah ok I will train with havoc

Yeah, I think it'll be more interesting to use the whole vehicle, especially the long rotors and complexity in certain parts.

Sean (Aug 27 2024 at 22:03):

Rotors are long and thin, which should approach a worst case for sampling.

Sean (Aug 27 2024 at 22:08):

Matteo Balice said:

It’s strange

One is the root solver saying it got 3 hits, which can happen on a couple primitives when rays just graze a surface. We make it print to eventually see if we can special-case the condition but it doesn't affect ray tracing.

The other is a report of a geometry overlap which is a modeling error, which also doesn't affect the hit results you're using.

Matteo Balice (Aug 27 2024 at 22:11):

I will see whether there are the same errors on the full havoc model

Matteo Balice (Aug 27 2024 at 22:11):

We are certain that the hit informations are right due to the first image here

Matteo Balice said:

This is the shape training only with the loss of hit/miss:
image.png

And this is the shape training also with the loss of the distance:
image.png

So the problem must be with the distance... I think that during the sampling part (in brl-cad), if there are more than 1 intersections, a_dist contains the wrong distance and I need only and always the first intersection. Could it be possible?

Matteo Balice (Aug 27 2024 at 22:12):

It shouldn’t be too difficult to understand the distance in those areas below the helicopter. I have seen the true model and it is pretty smooth

Matteo Balice (Aug 28 2024 at 09:32):

pygame-window-2024-08-28-11-30-51.mp4

Matteo Balice (Aug 28 2024 at 09:33):

This is trained with 100 frames 256x256

Matteo Balice (Aug 28 2024 at 09:34):

We need more frames with even higher resolution...

Matteo Balice (Aug 29 2024 at 11:13):

pygame-window-2024-08-29-13-11-50.mp4

Matteo Balice (Aug 29 2024 at 11:15):

I have trained without the loss function of the distance and it has understood by itself the true model. :joy:

Matteo Balice (Aug 29 2024 at 11:15):

In this case it seems that the loss of the distance gets worse the model

Matteo Balice (Aug 29 2024 at 11:15):

The resolution is always 256x256 but trained with 500 frames this time.

Matteo Balice (Aug 29 2024 at 11:16):

@Sean @fall Rainy

Matteo Balice (Aug 29 2024 at 19:47):

pygame-window-2024-08-29-21-45-53.mp4
I changed the loss function and now it is very good :)

Matteo Balice (Aug 30 2024 at 18:56):

Today I did a big improvement in the number of sample points. Depending on the model, you need a fixed tolerance between each sampling point to ensure good accuracy: for example in moss a good tolerance is 0.01 (remember that we should sample in range 0-1 where 0 means that we are sampling on the origin and 1 means we are sampling on the destination on the bounding sphere).
The trivial solution is using the uniform sampling for each ray (it was the solution I used before today). With uniform sampling you need a number of points that is equal to range/tolerance. This is the reason why in moss I need 1/0.01 = 100 points.

Matteo Balice (Aug 30 2024 at 18:58):

But we can do better than this because I tried to plot the distribution of all the rays and it is like this:
distribution-rays-moss.png

Matteo Balice (Aug 30 2024 at 18:59):

It means that in moss it follows more or less a gaussian distribution with a mean 0.7

Matteo Balice (Aug 30 2024 at 19:00):

In the havoc model it is slightly different and it is a gaussian but with mean 0.5:
distribution-rays-helicopter.png

Matteo Balice (Aug 30 2024 at 19:02):

So a first improvement would be to follow these distribution to sample more samples on the mean and less points far from the mean

Matteo Balice (Aug 30 2024 at 19:25):

But we can do even better: we can divide our bounding sphere into cells (choosing an arbitrary resolution) and precompute the distribution for each pair of cell_origin and cell_destination. This way, the distribution no longer follows a Gaussian pattern but instead aligns with the local distribution in those areas. This represents a significant improvement because, not only do we capture the local distribution, but the range will also be reduced for almost all pairs. This is evident, as similar rays will have similar hit distances, as in this case:

distribution-rays-helicopter-local.png

Matteo Balice (Aug 30 2024 at 19:26):

Or like this
distribution-rays-helicopter-local-2.png

Matteo Balice (Aug 30 2024 at 19:38):

Moreover not all the rays have the same length because in a bounding sphere (of ray 1) we have the maximum length for the rays that have exactly length equal to 2 * radius but all the other rays will have a smaller length so we can even sample less points

Matteo Balice (Aug 30 2024 at 19:39):

In these days I will try to merge all these ideas together and I will update you. If you have any other idea let me know :)

Matteo Balice (Aug 30 2024 at 19:51):

intersection-prediction.pdf
This paper for example use a similar approach to my idea but they use an hashmap based on quantization of input (rays origin + rays destination) instead of cells

Sean (Aug 31 2024 at 04:41):

Matteo Balice said:

pygame-window-2024-08-29-21-45-53.mp4
I changed the loss function and now it is very good :)

That is very good indeed! Very nice shape and color registration all around. That's considerably better than I would have expected for that model.

Sean (Aug 31 2024 at 04:44):

So now is that pygame visualizaiton doing real-time inference / lookups from the NNet or do you export the latent space to some fixed representation like mesh or voxels or lightfields, etc.? If they're real-time lookups, what's the lookup rate? Is there a significant diff between havoc and moss?

Sean (Aug 31 2024 at 04:45):

Either way, this is looking very much like you're approaching publication-worthy if you want to take this work to the next level... we could also probably turn it into a feature, depending on how long training takes, how long it takes to load the net, how long inference takes, etc.

Matteo Balice (Aug 31 2024 at 09:08):

Sean ha scritto:

So now is that pygame visualizaiton doing real-time inference / lookups from the NNet or do you export the latent space to some fixed representation like mesh or voxels or lightfields, etc.? If they're real-time lookups, what's the lookup rate? Is there a significant diff between havoc and moss?

They are real time lookups, you can see the code here https://github.com/bralani/rt_volume/blob/neural_rendering/src/rt/nvbh/camera.py. The inference time for each call of the forward method (ie this instruction output_label, _, _, output_rgb = model(input)) takes an average time of 1-2 ms for the havoc model (remember that in pygame the images are 256x256) but due to the limitations of python I have these metrics:
Time to generate rays: 3.155 ms -> time to calculate the origin ray + direction ray for each pixel of the 256x256 camera
Time to trace rays: 22 ms -> bottleneck due to the limitations of python (this includes the time to transfer data from cpu to gpu + the formula of intersection ray with sphere + the conversion to cartesian to spherical coordinates + the inference time + the time to transfer data from gpu to cpu. Since the inference time is only 1-2 ms, you can understand that all the time is taken by previous calculations + transfer time).
Time to render image: 0.99 ms -> time that pygame takes to show the input image
So summing all together it takes something like 26 ms to render one frame (and it can be much better in C++) so I have an average of 36 FPS (in python).

Matteo Balice (Aug 31 2024 at 09:15):

Instead, with moss model I can use a simpler network (with less resolution) and less sampling points (I have not implemented the smart sampling yet, I am still using the uniform one at the moment) and the times are these ones:
Time to generate rays: 3,15ms (the same as before)
Time to trace rays: 15,6 ms (with an inference time that is still of 1-2 ms)
Time to render image: 0.99 ms
In this case I have a total time of 19,75 ms with an average 50 FPS.

Matteo Balice (Aug 31 2024 at 10:51):

There was an error in the trace rays method, I did all the calculations on the cpu and I moved to the gpu just before the inference call. Now I move the rays after their generation and the times are these:

MOSS:
Time to generate rays: 3,15ms (the same as before)
Time to trace rays: 10,19 ms (with an inference time that is still of 1-2 ms)
Time to render image: 0.99 ms
In this case I have a total time of 14,34 ms with an average 70 FPS.

HAVOC:
Time to generate rays: 3,15ms (the same as before)
Time to trace rays: 14,13 ms (with an inference time that is still of 1-2 ms)
Time to render image: 0.99 ms
In this case I have a total time of 18,28 ms with an average 54 FPS.

Matteo Balice (Aug 31 2024 at 10:57):

I know these times are high but I believe they are due to the python overhead

Matteo Balice (Aug 31 2024 at 11:39):

They aren't so high like I was thinking, I have printed rendering times in brl-cad with my mac m1:

MOSS
65536 pixels (256x256) in 0.05 sec=50ms
HAVOC
65536 pixels (256x256) in 0.23 sec=230ms

There is an improvement even using python instead of c++.

Matteo Balice (Aug 31 2024 at 11:42):

Probably this is because in brl-cad all the rays are traced sequentially or am I wrong?

Matteo Balice (Sep 01 2024 at 18:31):

Today I implemented the first of the proposed optimizations to reduce the number of sampling points: now I follow the global rays distribution.

Matteo Balice (Sep 01 2024 at 18:32):

Here you can see an uniform sampling with 50 points.
uniform-distribution.mp4

Matteo Balice (Sep 01 2024 at 18:33):

And here you can see a sampling following the global rays distribution (always 50 points):
global-distribution.mp4

Matteo Balice (Sep 01 2024 at 18:37):

The difference is evident: following the global distribution we are losing something on the boundaries (because there are less rays in those areas) but in all the other parts the resolution is way better.
Following this sampling alone give us some advantages (but also drawbacks like you can see) but merging together with local distribution I expect to improve even more :)

Matteo Balice (Sep 01 2024 at 18:39):

@Sean Regarding the paper, which conferences/journal do you suggest? I don't have experience in this yet

fall Rainy (Sep 02 2024 at 05:27):

Is it possible to consider CVPR2025？:https://cvpr.thecvf.com/Conferences/2025

Sean (Sep 03 2024 at 20:09):

@fall Rainy & @Matteo Balice CVPR is possible and desirable, but the bar is very high. They are the premier conference right now for CV-related work including NNet reconstruction.

A paper there will need to very clearly demonstrate what the significant contributions are of the work, distinct and compared with other recent papers. Both efforts have demonstrated contributions, but will take some work to show how well or different they're performing on a test/benchmark common with some other paper (at least one).

The novel significant contribution will have to be identified first, which is going to be something like some % quality metric, or training convergence metric, or lookup performance metric, or some specific feature that hasn't been demonstrated (like sharp corners or some other measurable improvement).

Sean (Sep 03 2024 at 20:14):

Matteo Balice said:

They aren't so high like I was thinking, I have printed rendering times in brl-cad with my mac m1:

MOSS
65536 pixels (256x256) in 0.05 sec=50ms
HAVOC
65536 pixels (256x256) in 0.23 sec=230ms

There is an improvement even using python instead of c++.

So @Matteo Balice this is outstanding performance that opens up a potential state-of-the-art contribution. I'm not sure that's faster than the last paper that did 3 points -- you'd need to test the same models to really demonstrate that -- but it is a "new" capability if you can demonstrate something that ray traces slowly being looked up much faster. That is, demonstrate this as a viable preprocess acceleration structure for models that are expensive to raytrace.

Towards exploring that, I'd suggest downloading a 3DM (NURBS) model, running 3dm-g and raytracing it. It'll be slow as that's an unoptimized representation, but then you can show how long training takes, and how long inference/generation takes at the same resolution. Suggest 1024x1024 since that's pretty standard and results in convenient 1M primary rays.

Sean (Sep 03 2024 at 20:36):

Here's the stanford bunny in nurbs format that someone made and posted online where you can see how costly it is:
bunny_nurbs.g

For me, it takes about 35 sec to prep the first time (uncached) and 45sec to render at 1024 on my laptop, around 22k rays/sec.

Sean (Sep 03 2024 at 20:37):

Screenshot-2024-09-03-at-4.34.20-PM.png

Matteo Balice (Sep 04 2024 at 09:14):

pygame-window-2024-09-04-11-04-27.mp4
Here here is the 256x256 frames resolution trained with 100 different 512x512 frames (in pygame I have implemented only 256x256 views with a small range of the camera).

Regarding the 1024x1024, these are the times (in python without optimizations):
Time to generate rays: 49,7 ms
Time to trace rays: 12 ms
Time to render image: 3,9 ms
For a total time of 65,6 ms and an average 15,24 FPS. If you compare 65,6 ms (NN) with 66 seconds (true ray tracing on my mac M1), the NN is about 1006 times faster.

The true bottleneck is not the training time or the loading of the net but the generation of the training set since generating 100 frames in 1024x1024 would require 100 x 66 seconds (on my pc) = almost 2 hours... For this reason I have trained only with 100 frames 512x512 (it has taken something like 30 minutes to generate the training set) and about 10 minutes to train it but I have not implemented the optimizations to reduce the number of sampling points yet (local sampling), so I expect that the training time (and inference) can be even smaller.

Matteo Balice (Sep 04 2024 at 14:33):

I wanted to compare also the "statuette" model as in the nbvh paper and in the neural intersection paper, so I downloaded the stl model, converted to .g and ray-traced it. According to the nbvh paper it should take 4,9ms to ray-trace it (1920x1080):

Screenshot-2024-09-04-alle-16.30.49.png

But in brl-cad it takes several several minutes in 1024x1024...
Using only a 256x256 size it takes about 60 seconds:
outputstatuette.png

Why is there this behaviour?

Matteo Balice (Sep 04 2024 at 14:35):

The model .g weights 600mb, while the stl is 500mb

Matteo Balice (Sep 04 2024 at 14:41):

P.S:
Seeing the times to ray trace all those models (according to the nbvh paper) it seems there is a bottleneck in my python code in the generation of the rays since in 1024x1024 it takes me 49 ms and it cannot be so bad.

Matteo Balice (Sep 04 2024 at 16:05):

Matteo Balice ha scritto:

P.S:
Seeing the times to ray trace all those models (according to the nbvh paper) it seems there is a bottleneck in my python code in the generation of the rays since in 1024x1024 it takes me 49 ms and it cannot be so bad.

I was right, there was a bottleneck in the generation of rays, now it is less than 1ms to generate 1M rays so the bunny model is raytraced in about 16-17 ms.

Sean (Sep 10 2024 at 19:49):

@Matteo Balice can you show a 1024x1024 rendering of Bunny? Looking to directly compare both performance (time0 and deviation (pixdiff) for a given view.

Matteo Balice (Sep 12 2024 at 14:46):

bunny_pred.png
bunny_true.png
The time is an average of 17ms but I'm trying to figure out why the light does not look like the true one

Matteo Balice (Sep 12 2024 at 14:48):

The shape is correctly encoded, it seems I have the same problem as in the moss model (the training set was small)

Sean (Sep 12 2024 at 17:22):

thank you @Matteo Balice that's definitely looking production-viable even with the light source issue. How long did training take? How many samples, how much time, etc?

Sean (Sep 12 2024 at 17:25):

I'm wondering if we could try for actual integration in the time remaining, or if the shift to writing up a paper is more in order. For integration, it would entail modifying the 'rt' application to have a training mode where some command-line flag would tell it to train/use the neural net in place of the do_pixel() function (since you have color). Something like rt -N bunny.weights bunny.g bunny

Sean (Sep 12 2024 at 17:26):

If bunny.weights doesn't exist, it trains. If it does, it loads and uses it.

Matteo Balice (Sep 12 2024 at 19:45):

Sean ha scritto:

thank you Matteo Balice that's definitely looking production-viable even with the light source issue. How long did training take? How many samples, how much time, etc?

The training set consisted of 100 images with a resolution of 512x512 (approximately 26 million in total). The model was trained for approximately 5 minutes. The F1 loss for shape prediction was around 0.995. Additionally, the average error in predicting the hit position along the ray was about 0.006 -> shape and hit position almost perfect.
Regarding the RGB prediction, however, I have an average error of 24 (in "rgb" scale).

Each ray samples 200 points (each point has 3 float32 numbers: x y z) along its path, meaning each ray occupies 200 x (3 x 4 bytes (float32)) = 2,3 KB on the GPU. For training, I use a batch size of 2^13 rays per iteration, resulting in 2^13 x 2,3KB = 19.6 MB in memory.
During the prediction process, the larger the batch size, the better the GPU parallelism can be utilized. Therefore, I load all 1024x1024 rays simultaneously, which requires 1024 x 1024 x 2,3 KB = 2,5 GB (my GPU allows it but it's ok to execute more iterations of the set of rays, it will be just a little bit slower).

I believe it is possible to reduce the number of sampling points but I am still working on it, I have not achieved good results for example sampling locally.

Matteo Balice (Sep 12 2024 at 19:48):

I tried also with 500 frames in 256x256 and in this case the light's direction is encoded well but not the intensity.
bunny_1024.png

Matteo Balice (Sep 12 2024 at 19:52):

Sean ha scritto:

I'm wondering if we could try for actual integration in the time remaining, or if the shift to writing up a paper is more in order. For integration, it would entail modifying the 'rt' application to have a training mode where some command-line flag would tell it to train/use the neural net in place of the do_pixel() function (since you have color). Something like rt -N bunny.weights bunny.g bunny

Regarding the integration there are some points to understand first in order to achieve good results in inference time like I had in python:

First of all we need batch processing of multiple rays in parallel (for examples 2^13 rays in parallel). Is there already implemented batch processing of rays in brl-cad?

Matteo Balice (Sep 12 2024 at 19:53):

Second, I believe that it's better to write the NN in C++ and not binding with python to speed up even more, but this means that we need to write manually the code in CUDA for the GPU (for example I use the function torch.max in pytorch that are optimized for cuda GPUs).

Matteo Balice (Sep 12 2024 at 20:03):

I believe our approach is more suited to GPUs compared to NBVH, as it involves fewer conditional branches. This is because we completely avoid using BVHs; the only intersection we calculate is with the bounding sphere. After that, the inference process remains consistent for every ray, exploiting the true power of GPUs. This is the reason why even using more sampling point we have good results in inference time.

Matteo Balice (Sep 12 2024 at 20:06):

I discovered one drawback: when I render the scene in 1024x1024 in pygame it's good for the first seconds (16-17 ms for each frame) but after that the memory of GPU goes to 100% and the inference time slow down a lot. Probably this is because they do not use well the garbage collector of tensors accumulated for previous frames (I have a RTX 4070 with 13GB of VRAM).

Matteo Balice (Sep 26 2024 at 08:30):

I have written a report for my work, let me know if there are something missing or that I should change.

Neural_rendering_paper.pdf

@Sean @fall Rainy

Sean (Sep 26 2024 at 13:56):

That's outstanding @Matteo Balice

Sean (Sep 26 2024 at 13:57):

I'm still reading it, but I did a quick read through and really great write-up! Love the background and inclusion of your early incremental progress, and of course final results.

Matteo Balice (Sep 26 2024 at 14:18):

I attempted also to integrate the PyTorch code into the rt module, but encountered an issue with the PyTorch library I used for multi-hash resolution. The problem is that the library includes CUDA-specific scripts, preventing me from exporting the neural network...

Sean (Oct 03 2024 at 05:06):

@Matteo Balice and @fall Rainy I'd like to showcase your work at the mentor summit. Do you have a brief video or favorite image(s) that you'd like me to show that you feel best captures the results of your research?

Matteo Balice (Oct 03 2024 at 07:54):

Sean ha scritto:

Matteo Balice and fall Rainy I'd like to showcase your work at the mentor summit. Do you have a brief video or favorite image(s) that you'd like me to show that you feel best captures the results of your research?

bunny_pred.png
bunny_true.png
havoc_pred.png
havoc_true.png
moss_pred.png
moss_true.png

Sean (Oct 03 2024 at 14:20):

Can you make a diagram of your network layers? While a technical audience, they’re not necessarily familiar with graphics or ai at all.

Sean (Oct 03 2024 at 14:24):

Matteo Balice said:

pygame-window-2024-09-04-11-04-27.mp4
Here here is the 256x256 frames resolution trained with 100 different 512x512 frames (in pygame I have implemented only 256x256 views with a small range of the camera).

Regarding the 1024x1024, these are the times (in python without optimizations):
Time to generate rays: 49,7 ms
Time to trace rays: 12 ms
Time to render image: 3,9 ms
For a total time of 65,6 ms and an average 15,24 FPS. If you compare 65,6 ms (NN) with 66 seconds (true ray tracing on my mac M1), the NN is about 1006 times faster.

The true bottleneck is not the training time or the loading of the net but the generation of the training set since generating 100 frames in 1024x1024 would require 100 x 66 seconds (on my pc) = almost 2 hours... For this reason I have trained only with 100 frames 512x512 (it has taken something like 30 minutes to generate the training set) and about 10 minutes to train it but I have not implemented the optimizations to reduce the number of sampling points yet (local sampling), so I expect that the training time (and inference) can be even smaller.

Wonder if we could show bunny spinning orbitally. This video is really good results but the view stays focused on the bunny’s butt.. even a simple 360 orbital would probably do well.

Matteo Balice (Oct 03 2024 at 14:44):

Sean ha scritto:

Matteo Balice said:

pygame-window-2024-09-04-11-04-27.mp4
Here here is the 256x256 frames resolution trained with 100 different 512x512 frames (in pygame I have implemented only 256x256 views with a small range of the camera).

Regarding the 1024x1024, these are the times (in python without optimizations):
Time to generate rays: 49,7 ms
Time to trace rays: 12 ms
Time to render image: 3,9 ms
For a total time of 65,6 ms and an average 15,24 FPS. If you compare 65,6 ms (NN) with 66 seconds (true ray tracing on my mac M1), the NN is about 1006 times faster.

The true bottleneck is not the training time or the loading of the net but the generation of the training set since generating 100 frames in 1024x1024 would require 100 x 66 seconds (on my pc) = almost 2 hours... For this reason I have trained only with 100 frames 512x512 (it has taken something like 30 minutes to generate the training set) and about 10 minutes to train it but I have not implemented the optimizations to reduce the number of sampling points yet (local sampling), so I expect that the training time (and inference) can be even smaller.

Wonder if we could show bunny spinning orbitally. This video is really good results but the view stays focused on the bunny’s butt.. even a simple 360 orbital would probably do well.