Hi @Senthil Palanisamy, lets continue the discussion here. Can you tell me more about your background and what you understand about this effort? Have you read the AMD paper?
(deleted)
Hi Sean, Thanks for the message.
About my background: I come from a geometric vision background where I worked more than 6 years, to go along with my masters in robotics from Northwestern. I have developed solutions to problems like Stereo visual Odometry, Depth data based SLAM / TSDF fusion of depth maps into a volumetric volume to generate geometry, Extrinsic sensor calibration algorithms (the act of optimizing solutions to where sensors are located). I have done a few deep learning works as well like deep reinforcement policy learning for a knot tieing, weed localization and human character recgonition. I can send you my resume to any mail id of interest, if you want to know about my background further. Though I don't actively have a graphics background, I do understand the broader details and am able to grasp quickly to execute ideas. My programming languages of comfort are C++ and python (though I do think I can pick up any language within a reasonable time)
About the work: I did spent a few hours trying to understand the work. Here is my summary - A classic ray tracing pipeline on BVH for rendering objects is slow. One of the deep learning ways to speed this up is to train a MLP to be a neural intersection function - a network that primarily predicts occupancy as a probability (0-1), but it can be trained to predict other properties as well like shading / reflectence. This network can be trained directly on position and ray directions, but it does not practically work well. So, the solution is to learn some feature embeddings for position and direction, which then feed into MLP. Each of position and direction are points in R^3, but this leads to ray duplicates, so they can be compactly represented as points in R^2, by using spherical co-ordinates and substituting the ray origin with ray intersection. Raycasting is usually done in multi bounces, where secondary rays are traced from the primary ray. So there are two NIFs one (outer NIF) predicting the primary intersection, and the other (inner NIF) predicting occupancy from the secondary intersection. The inner NIF takes ray distance embedding as input in addition to position and direction embeddings. It seems like a classic tracing tracing scheme employs two hierarchies in BVH to trace rays. The top BVH tracing is cheap while the bottom BVH tracking is expensive. NIFs replace the bottom BVH part, while the top BVH tracing is done through classical means. The outer NIF just predicts occupancy while the inner NIF predicts other properties in addition to occupancy.
Some open questions I am still trying to find answers to
My personal motivation for taking this work
@Senthil Palanisamy Thank you for the excellent introduction. From your background, it sounds like you could probably propose a couple different projects that would align with needs (like a slam-based object reconstruction using ray tracing), which is all outstanding background for working on neural rendering. Obviously a lot of related concepts. As for your resume, you can just submit that when you submit a proposal (which you can/should do early and then continuously update, whenever it opens). Given the nature of GSoC, resume's and the project write-up itself are only a small fraction of selection criteria. It's communication that matters most (both here and via code).
I think your understanding of the project is right on track and you did a great job summarizing AMD's work. They essentially showed that it can be faster, so now my question is whether it can be faster and more generally applicable for a set of conditions like expensive geometries and reasonably precise hit points. Can we actually use it as an acceleration structure in practice, or is the training phase entirely impractical; what sort of net is needed to capture all the necessary detail; is the two-layer network adequate; are the two NIFs actually necessary; ... lots of open questions many of which are hand-waved in the paper and proved challenging in our previous investigations by a team at TAMU.
As to your questions, if I'm understanding you correctly, then 1) is really just a dimensionality reduction so lookups can be fast and fewer parameters in the latent space are needed for encoding. Instead of using 6 float/double values for the input layer (xyz pos + ijk dir), they use 4 float values (az/el pos + az/el dir). There may be more nuance implied for feature embedding, though, so there may be work needed to understand it if changes are to be made. I'm a little surprised I didn't see residual linkages, and that they got away with such simple topology, but then we have yet to reproduce their results either. As for 2) the did rely on libs for performance and they certainly can be used but they're not a focus or requirement. The fundamental question is is this approach even feasible as an acceleration approach to represent a given geometry model. That has both elements of accuracy and performance, but can be proven without making it production-ready. Libs like pytorch can certainly be used. In fact, the previous team fully bridged from C to Py to C during ray-tracing which was of course painfully slow, but they weren't successful in getting accurate hit points so the question is still an open one.
Thanks for the response. The grid look up makes sense now. I was intially under the misunderstanding that some network converts position, direction -> position, direction embedding. But it does make sense that it is just a grid that is trained as well in the training process. I have a few follow up questions
It's not really a grid lookup. It's just a different way to express a vector. For the neural net, it's less input data and thus less latent space nodes are needed to encode. It's still just a vector though, a clever optimization. We even coincidentally have an image depicting both (the vector can point outwards or inwards): https://brlcad.org/gallery/picture.php?/4
The latter. A given model will get "baked" such that a NN is trained, ideally in just a couple minutes, and then is available+valid for use until the geometry changes.
The training data should come directly from the ray tracer (either in advance or streamed real-time). Imagine shooting a million rays at an object from it's bounding sphere, some hit, some miss. Each hit is not just the first hit but all the possible hits along the path. That's the training corpus. Rays input, hit list output. We can probably shoot rays faster than the net can train an epoch, so you'd probably want to just stream rays as fast as it can train. Alternatively, shoot a million, train, shoot another million, train, etc.
rppvalidation_dirs.png
there's no threads on the topic. general idea though would be to propose something CAD-related involving slam like making a portable 3D CAD scanner app. e.g., try to output CAD instead of just point clouds or meshes. maybe hook into iphone lidar sensor. could also be slam and point cloud based, but then imports into BRL-CAD on the fly for processing. lots of possibilities.
You'd have to really make a strong case for your project regardless. Needs to have a compelling benefit that ties into CAD specifically, not just 3D or vision or graphics-related.
Thanks for the clarification. I went back and spent more time. I think I am getting closer to the true understanding. Its funny my understanding about the work keeps getting refined and is probably starting to converge. So it seems like the grid encoding is done on a per object basis - the inputs for both for the outer and inner network, while the networks itself are object agnostic occupancy predictors. So we would end up having two encoded grids (one for outer and one for innter) for each unique object and two network that are sort of object agnostic, but acts on any given object to predict occupancy.
In order to accomplish this, I would need to modify the ray tracing pipeline on our system - record input data - for both outer and inner network and then attempt to train those networks
So my two action items are
I think I can do action 2 pretty independently (the networks are simple) but to do 1, I would need some directions on how I can do this. What's the fastest way to get there? Are there any documents /sections of code in our repository that I can go through to better understand how to do this? Also, I will try to setup BRL CAD (https://github.com/BRL-CAD/brlcad) in the first place, which I will get started with
I know we are interested in the core hypothesis of - is this fast enough? If there is a way to get this data, without having to collect it ourselves (may be any opensource datasets that can be directly plugged in), if any such options exist, the core hypothesis on speed can be validated quicker? Any thoughts on this
@Matteo Balice and others, I have uploaded information and materials relevant to the neural rendering project to https://brlcad.org/design/neural
@Matteo Balice I looked over your proposal update and it looks really pretty good. I like that you identified some potential errors in the tamu team's approach. I think overall you have a good plan.
That said, I think you could make your proposal even better by more clearly identifying what the results of your project will be exactly. I love that you dedicated time to writing up results -- it's reasonable to spend time getting a paper out of this work given the nature of the work. In additional to a technical paper, though, what precisely will be the resulting products of your work in addition to things learned, which will be documented and which you underscored in your writeup.
Is the goal to implement a non-grid sampling method, train on that, and demonstrate the efficacy of using that method with "rt" or something similar? The tamu team resorted to a fixed view silhouette rendering via "rt" as their output target given they couldn't fully achieve generalization to 3D.
You mention nerf and potentially using different networks -- could definitely write more on what you mean there. I do suspect that the simple 2-layer FCN is inadequate for full generalization, but AMD's results certainly suggest otherwise may be possible.
@Sean Thank you for reading. Essentially, the goal is to achieve generalization, so we will try an approach aimed at improving the bounding box approach. However, even by adding this, I suspect that the neural network implemented by the students may not be sufficient to ensure good generalization for any view and thus for any arbitrary ray. For this reason, I believe (but it needs to be verified after obtaining the results of the first part) that it is reasonable to consider modifying the neural network with others more powerful to encode more information about the geometry of the 3D object. In any case, I will provide a better explanation in the proposal. Thank you.
An approach I think would be considerably more effective is generating training data based on spherical sampling. That in practice does a very good job to sample the volume unbiased and converges through potential shotlines. It's pretty simple to generate -- I just did that in our new "rtsurf" application.
Ends up looking a bit like this:
rppvalidation_dirs.png
Sean ha scritto:
Ends up looking a bit like this:
rppvalidation_dirs.png
yes i think it's the same of the bounding box approach used by tamu students or am I wrong?
image.png
I don't know why they call it bounding box even though it seems more like a sphere.
So it seems that Tamu students have previously employed this method to generate training data, but encountered issues with its generalization. I believe that the neural network might be lacking in its ability to extract all pertinent information from the geometry.
I have not looked into their code to see whether they're actually evaluating the bounding box or bounding sphere, but I do recall them saying all rays sample through the origin so it's not an unbiased sampling regardless.
that image there could also simply be hits on a sphere in a bounding box, heh. would have to double check that as well.
That is, I believe they were sampling the geometry like this: samples.png
The general assumption being they were sampling and trying to reconstruct simple shapes like a box, sphere, torus, etc.
i.e., box.png
@Sean you were right. I studied better what they did as sampling methods and it seems they used these two approaches: one is a grid approach (and ok we all agree it can't be used for generalization), the second is a mixed bounding box exactly like you said (ray origins were selected from a bounding sphere around the geometry, as well as from within the bounding sphere itself).
But there was another approach they never tested, the "pure" bounding sphere approach: this
method find a random point at radius distance away from the center of the geometry. This would serve as the ray origin. Then, it would find a different, random point at radius distance away from the origin. It would then determine the direction between the two points to determine the direction vector of the ray. It's pretty clear that this is the approach you suggested.
This morning I had understood that they had used this latest approach (so I assumed that neither with this sampling they could generalize), but I was wrong. So, this changes everything because there might not be a need to implement any more sophisticated neural network (but it all depends on the results we will have on thanks to this different spherical sampling).
@Daniel Rossberg I have updated my proposal and submitted it. I am looking forward for hearing from you.
Thanks
While looking through previous work(https://brlcad.org/design/neural/), I noticed this file to what looks like a python version of the NIF implementation. Perhaps I could start by converting this job to a C++ version?Then in the meantime, I'll try to optimize it.
(deleted)
I have some thoughts on why AMD chose a simple neural network. Recently, while deploying a Transformer model, I found that when there are too many parameters, the model's speed also noticeably decreases. Therefore, overly complex neural networks may sacrifice some efficiency.
I have a question about '''rt_shootray()". When calling rt
with default parameters, does it call rt_shootray
only once for each pixel to calculate the RGB value, and does it not use the Monte Carlo algorithm at all during the process?
fall Rainy said:
I have some thoughts on why AMD chose a simple neural network. Recently, while deploying a Transformer model, I found that when there are too many parameters, the model's speed also noticeably decreases. Therefore, overly complex neural networks may sacrifice some efficiency.
Yes, and I mentioned something to that effect -- it was absolutely made that simple in order to achieve their realtime performance goal. What's still particularly amazing though is that it achieved such precise matching output on such a complex scene with so few parameters.
fall Rainy said:
I have a question about '''rt_shootray()". When calling
rt
with default parameters, does it callrt_shootray
only once for each pixel to calculate the RGB value, and does it not use the Monte Carlo algorithm at all during the process?
It will call rt_shootray one for each primary ray -- which typically results in secondary rays as well for reflection, specular, light/shadow samples, etc. Rt is not a montecarlo renderer, but there are options like -H for hypersampling where there will be multiple samples per pixel. There are also different lighting modes and different renderers that employ different methods. For example -l7 uses photon mapping and the 'art' ray tracer is a PBR renderer based on appleseed.
After the rendering is complete, I want to plot some sampling points. I think I should operate on this object. Is there any related function?
struct fb *fbp = FB_NULL; /* Framebuffer handle */
If rendering is complete do you mean 2d plotting over the image?? Or are you wanting to plot 3D points and render them as well?
If you’re just wanting to visualize some diagnostic info for debugging, you can make geometry (eg point cloud or spheres) and view that in mged or with rt, you can plot to 3D with annotation points, you could manually project 3D points to 2d and draw them
Sean said:
If you’re just wanting to visualize some diagnostic info for debugging, you can make geometry (eg point cloud or spheres) and view that in mged or with rt, you can plot to 3D with annotation points, you could manually project 3D points to 2d and draw them
Sorry for not being clear. I just want to visualize these points, and I will try to plot them in a 3D view.
Happy Friday! did you figure out how to plot the points? Is there enough progress for a little show&tell?
Erik said:
Happy Friday! did you figure out how to plot the points? Is there enough progress for a little show&tell?
I'm sorry for the late reply. I completed the drawing in a strange way...
I do have some results to report, and I would like to know if my direction is correct. Is next Wednesday or Thursday okay?
where is the center of model bounding sphere?
struct rt_i{
...
/* THESE ITEMS ARE AVAILABLE FOR APPLICATIONS TO READ */
point_t mdl_min; /**< @brief min corner of model bounding RPP */
point_t mdl_max; /**< @brief max corner of model bounding RPP */
point_t rti_pmin; /**< @brief for plotting, min RPP */
point_t rti_pmax; /**< @brief for plotting, max RPP */
double rti_radius; /**< @brief radius of model bounding sphere */
struct db_i * rti_dbip; /**< @brief prt to Database instance struct */
And please let me know when will you're available to meet. I will be free after Tuesday.
I have finished a few sample methods , this is uniform sphere sample:
sample.png
I can make time on Wednesday, or on Thursday until 1630EDT, or Friday after 0830EDT. But I think Sean is more aware of what's going on and it'd be more valuable to have him present?
the center of the bounding sphere is the same as the bounding box, um, I think the AABB is used more than the sphere. iirc, when I did the metaball primitive, I did a bounding box, then just said the bounding sphere was the same center and had a radius equal to the distance from a corner of the bounding box to the center?
Okay, I'll ask Sean when he's available.
Erik said:
I can make time on Wednesday, or on Thursday until 1630EDT, or Friday after 0830EDT. But I think Sean is more aware of what's going on and it'd be more valuable to have him present?
the center of the bounding sphere is the same as the bounding box, um, I think the AABB is used more than the sphere. iirc, when I did the metaball primitive, I did a bounding box, then just said the bounding sphere was the same center and had a radius equal to the distance from a corner of the bounding box to the center?
There looks to be a small error in rt/worker.c: there is no need to statistic colorsum for normal sample, just do it in hyper sample
if (hypersample == 0) {
...
VADD2(colorsum, colorsum, a.a_color);
} else {
/* hypersampling, so iterate */
@Daniel Rossberg @Himanshu I have created a new pull request addressing the selectPrimitive feature's bug here
@Erik Is June 28 11.30(EDT) OK for you?
I finished the whole process of neural network rendering and generated a not-so-good rgb map:
render.png
why the direction of ray keep const when rendering? According to ray tracing algorithm, each ray should have a different direction:
RaysViewportSchema.png
Is there a function to calculate the intersection of a ray and a sphere?
fall Rainy said:
why the direction of ray keep const when rendering? According to ray tracing algorithm, each ray should have a different direction:
RaysViewportSchema.png
I get it , the default is to use parallel projection
fall Rainy said:
Is there a function to calculate the intersection of a ray and a sphere?
I implemented the algorithm myself
I found an interesting parper:NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis(https://dl.acm.org/doi/abs/10.1145/3503250)
fall Rainy ha scritto:
I found an interesting parper:NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis(https://dl.acm.org/doi/abs/10.1145/3503250)
Why do you think it is interesting?
The Positional encoding, according to Rahaman's work On the Spectral Bias of Neural Networks, deep networks are biased towards learning lower frequency functions. They use positional encoding methods to solve this problem.
IMG_1921.jpeg
this is the ground truth
IMG_1922.jpeg
And this is with neural network prediction (bounding box). There are some issues to resolve
Probably they are mainly with rendering and not with the NN itself
Or maybe the number of samples in the dataset are too low.
I solved the issue with the rendering and got improvements
With n=100'000 rays I got 0.94% of accuracy which is not bad but the major problem is that with bounding sphere approach we need to sample a lot of rays in order to achieve a good accuracy for every angle of the camera.
The problem isn't with the Neural Network because I have a 0.99 accuracy with Training Set, but since we have an infinite number of rays that goes in all the directions in the bounding sphere, I think it is not possible to achieve the same results with the rendering without changing something because otherwise we need to sample an infinite number of rays to give to the NN (and of course it is not possible).
So we must be smarter than this and try to change representations of the input features or trying different architectures of NN
But first I want to try with n=1 million samples to see how good are the performances.
Matteo Balice ha scritto:
IMG_1921.jpeg
this is the ground truth
Remember that this is the ground truth.
This is with 1 million rays, I got the best epoch with 0.986 % of accuracy. We are on the right path :)
I use a deep resnet to learn rgb value. The left is the result of neural rendering and the right is normal rendering
res_net.png
fall Rainy ha scritto:
I use a deep resnet to learn rgb value. The left is the result of neural rendering and the right is normal rendering
res_net.png
Which sampling approach have you used?
Random sample, with 1million rays which have the same direction and hit the bounding sphere
I want to improve network first and then sample method
Mh ok so it cannot be generalized with every angle at the moment... Predicting rgb is much more difficult, I see...
The current network fits a continuous function, but the objective function is not actually continuous..
We need a new structure...
I don't know if it can improve the results, but have you tried giving to the network only the rays that hit the object and not all the rays?
This can be done because first we predict with "my" network if it has hitted or not, so "your" network should predict only the ones that hit the object
This a direction. I will try it later
Can you tell which object you've hit?
This would be very helpful to me, maybe I could train two networks
fall Rainy ha scritto:
Can you tell which object you've hit?
Regarding this, we must decide if we have to train one network for each object or not... I think deciding all this in a meeting would be more appropriate.
OK, let's decide this later
I got a great result with gridnet:
grid_methos
here is the loss:
loss
fall Rainy ha scritto:
I got a great result with gridnet:
grid_methos
here is the loss:
loss
Very good. This is grid + resnet?
Just grid net. I will add resnet later.
fall Rainy ha scritto:
Just grid net. I will add resnet later.
What is the mean absolute error?I think it would be easier to understand how good is this model.
fall Rainy ha scritto:
I got a great result with gridnet:
grid_methos
here is the loss:
loss
And why do you think there are those white dots (noise) ?
Anyway, seeing your results, I think I will also add this grid method, I think it would improve my network too.
I tried with grid encoding but my network does not seem to improve any further. I will try different sampling approach.
I think that grid encoding is appropriate only when the direction of the rays is fixed, because if you divide your sphere in 256*256 = 65536 cells, then each cell will have an average number of 15 rays to be trained (if your training set has 1 million rays). The problem is that if your direction is fixed, then 15 rays for each cell can be acceptable but if the direction isn't fixed, then 15 is a very small number of rays to train.
I tried also with a different size of grid (20x20, 50x50, 100x100) and the NN performs like before more or less (98% accuracy with 20x20).
I also implemented different grid versions in the same NN (20x20, 50x50, 100x100, 256x256) and combined them together with some weights, but I got the same results...
So this makes me think that grid encoding isn't appropriate when rays are not fixed.
Today I improved results to 0.991 (test set) using a different optimizer.
fall Rainy said:
why the direction of ray keep const when rendering? According to ray tracing algorithm, each ray should have a different direction:
RaysViewportSchema.png
There are two different kinds of cameras -- perspective and orthogonal. Orthogonal is default for most engineering purposes and perspective is what you want for visualizations approximating human vision. Perspective rays (typically) diverge. Ortho are parallel grids.
For volumetric encoding, you almost certainly will want ortho or unbiased random.
fall Rainy said:
I finished the whole process of neural network rendering and generated a not-so-good rgb map:
render.png
That's actually not "terrible". Not great but not terrible to say the least. It's clearly recognizable.
I also use a 4-Dimensions network to train, the inputs are both direction and position:
net.png
fall Rainy said:
I also use a 4-Dimensions network to train, the inputs are both direction and position:
net.png
That's clearly "seeing" the model in some sense, even getting some of the surfacing right but in a dream state. How many training epochs is that?
One thing you could try is to reduce the input dimensionality to just azimuth and elevation (2 floats). That's a much smaller space to optimize across and will result in the same centered visual for our current purposes.
Oh, I have convert both direction and position to azimuth and elevation, so there are just four float inputs. It takes me one hour to train(with a 4060 GPU)
I use a grid which has a shape of 12812864*64 to train. That's the maximum number of ginsengs my GPU can handle.
I noticed this paper: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, they used a A100 to train and it costs more than two days.
I'm trying to improve my net from the sampling methodology. Since I know more about active learning, I might train another network for sampling
fall Rainy ha scritto:
I noticed this paper: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, they used a A100 to train and it costs more than two days.
Yeah, this paper and its related works are the state of the art about neural rendering.
Close to state of the art -- I listened to that paper when it came out back in 2020. There's been a lot of work on NeRFs since then and a lot of advancements. That's still a rather different approach altogether that thus far hasn't generalized very well. That's why the newer AMD research is more the basis for this work.
Sean ha scritto:
Close to state of the art -- I listened to that paper when it came out back in 2020. There's been a lot of work on NeRFs since then and a lot of advancements. That's still a rather different approach altogether that thus far hasn't generalized very well. That's why the newer AMD research is more the basis for this work.
Do you know why was there a need to try a different approach rather than NeRFs?
Well fundamentally, nerfs are based on trying to obtain output visualizations of a 3D object based on just having some 2D view points (pictures). They only use the 3D in the test and evaluation phase, but the model is generally unknown or non-existent.
We have the 3D models. They aren't the unknown in our case. Our situation is really needing to know precisely where the 3D object is for given (unknown) view points. That's where the optimization and needs are a bit different. It's more about encoding and/or estimating a given model from the known model, but sufficiently abstracted that we can get accurate queries really fast.
We could certainly build up and train a NeRF by feeding it a set of renderings for the 3D model, to see if it can generalize it well enough, but every approach I've seen thus far is about achieving optically adequate results, not necessarily something that would pass any fidelity comparison with the ground truth 3D model.
Ok, understood. However, I believe that AMD research still lacks comprehensive volumetric information about 3D models, preventing the NN from predicting rays hit/miss with 100% precision if we wanna achieve full generalization. Perhaps we can draw inspiration from NeRF and its related works. I plan to update my work on GitHub tomorrow so that you can review and test what I have done so far.
@Sean Here there is the repo https://github.com/bralani/rt_volume/tree/neural_rendering. Please be sure you follow in order these steps:
Installation:
1) Download libtorch from here https://pytorch.org/ paying attention to build "preview nightly" and remember the version you have installed (debug or release):
image.png
2) Unzip the folder and put the libtorch folder wherever you want and then go to src\rt\CMakeLists.txt, line 56 and change these three paths with your libtorch path:
image.png
3) Now be sure to build "rt_trainneural" with "Debug" version if you have installed libtorch "Debug" or with "Release" if you have installed "Release" libtorch.
4) This step is not always required but in my case it was essential otherwise there were some issues in the execution: go to path_libtorch\lib, copy all the files inside this folder and paste these files in the build/bin (where there is the .exe) of brl-cad.
Ok now we can finally run the code with two steps:
Go to rt/train_neural.cpp and see these options:
image.png
TRAINING:
1) In the first step we generate the dataset with bounding sphere sampling (so make sure that opts.generate_dataset is true), set your db and obj as you want and set also the num of rays you want to generate (I suggest 1 million). Run the code.
2) You should see a file "train_neural.json" and "test_neural.txt" inside the path you have run the file (build/bin).
3) Go to rt/train.py and set on top your variables:
image.png
4) Run the script, the NN will stop on epoch 200 but on each epoch it validate on test set and if the accuracy of current epoch is higher than all the previous, it overwrites the model "model_sph.pt".
RENDERING:
1) Go back to rt/train_neural.cpp, set opts.generate_dataset = false, set properly opts.model_path with the path of the model trained, adjust the azimuth and elevation as you wish (to perform render) and then set opts.neural_render=0 if you want the ground truth rendering, otherwise set to 1 if you want to perform your neural rendering. :)
I guarantee that it works in Windows, I have not tried with any other OS.
I have also a Mac M1 but I have always a lot of errors when I try to build BRL-CAD...
Awesome, thank you @Matteo Balice I'll definitely be taking a deeper look at it later today, and see if I can get it up and running. I'm on M2, so will see if there are issues.
@Matteo Balice on Mac, you must enable -DBRLCAD_BUNDLED_LIBS=ON or the build will fail when it tries to use the system Tcl/Tk
(during cmake)
I already have pytorch installed and a brl-cad build, so should hopefully all just work!
Ahh ok I will try it now so you don't have any issues tomorrow on your M2.
I have also prepared a small script here that tries to find CUDA if you have a GPU or metal acceleration if you have any "M" series of Mac:
image.png
Matteo Balice said:
Ok, understood. However, I believe that AMD research still lacks comprehensive volumetric information about 3D models, preventing the NN from predicting rays hit/miss with 100% precision if we wanna achieve full generalization. Perhaps we can draw inspiration from NeRF and its related works. I plan to update my work on GitHub tomorrow so that you can review and test what I have done so far.
I'd like to hear more what you meant by this -- encoding comprehensive volumetric information isn't the goal, but some faithful encoding. A prediction with some general precision assertion (hopefully). Similar in the text space, we want a reasonably accurate response to an input prompt that is more than vague (blurry) writing.
Issue I have with radiance fields is the method isn't exactly aligned well with our available training data. We'd literally throw away information to then try to reconstitute it. The method is really a whole field that's trying to construct 3D where it did not exist previously (i.e., from photos or scans).
From my reading, the AMD research is compelling because it is just a bifurcated training of two networks, one for outside, one for inside/near, and lots of reductions made for the sake of adequate performance. My thinking is lets increase the network a bit, and see how well it can generalize.
On a related note, here's a nice site that summarizes a lot of the neural field papers -- definitely are concepts that are relevant in some of them: https://radiancefields.com/siggraph-2024-program-announced
I have already made just a little bit more complex the NN of AMD research and it generalizes very well.
This means that for each angle of the object, the NN is able to understand the shape.
But the problem is that on the boundaries of the objects, there are still some artifacts (for example the shape is smooth even though it should be sharp from the ground truth). It is not very precise there.
My idea was to substitute the grid encoding approach they used.
Because for me, the grid encoding was working only because they trained the NN with fixed directions of the rays. But in our case, the rays can have infinite directions from the same origin.
If you think about the grid encoding, every ray that hit a specific cell have a very similar origin since the direction is fixed.
But in our case this is not true.
I know that encoding volumetric informations are not the final goal, but I believe that the NN is not able to capture very well this details due to the lack of volumetric informations of the model itself. We could use for example an encoding like "voxels" because they are invariant to directions. This is just an idea for the moment, I don't know if it can work.
I will read some papers in these days about this.
@Matteo Balice That is sort of incorporating some of the concepts of a radiance field (what you're calling a grid encoding), also known as a volumetric encoding. Your intuition about the direction vectors being fixed does certainly sound plausible. While the image space was fixed, the rays themselves were scattering in nearly all directions due to the physically based lighting model they were using (lots of reflection rays, refraction rays, light sampling rays, diffuse surface rays, and more). Still, that is certainly a vast subset for the net to train on, and like I said, you have a reasoned impetus for trying to encode it volumetrically.
You'd definitely need something more descriptive than voxels unless we go pixar route to sub-pixel resolution, which is not practical (for lots of reasons).
You'd probably need something more like a vdb signed distance field (sdf) where you have voxel occupancy as well as surface direction vectors.
@Matteo Balice AMD research doesn't use rays with fixed directions. They encoded both diection and position, then concatenated them to a vector:
AMD.png
I think a different strategy could be used for grid coding. Replacing a 2-dimensional network with a 4-dimensional network, this grid coding encodes both positional and directional information in a single vector
Maybe I was not so clear. Yes, they encoded both direction and position BUT they trained with a fixed viewpoint:
Screenshot-2024-07-14-alle-09.00.20.png
This means that all the rays have more or less the same directions each training.
Screenshot-2024-07-14-alle-09.01.26.png
This is very different from our goal and I explained why it can not work in our case (from my hypothesis).
I got it. Thanks
fall Rainy ha scritto:
I think a different strategy could be used for grid coding. Replacing a 2-dimensional network with a 4-dimensional network, this grid coding encodes both positional and directional information in a single vector
Can you elaborate more about this?
I put my codes here:trainDir. I use a 128×128×64×64 net to encodes both dir and pos.
Mh I cannot quite understand this. Are you using a total of 128x128x64x64=67billions cells?
This is quite overfitting because you will get train error 0 since you have less rays than cells.
Or am I wrong?
Yes... I am trying to improve it.
Maybe it can work only if the viewpoint is fixed.
There are too many parameters
I've been doing a lot of experimenting with grid net lately, and it's hard for him to predict rays in all directions, which may require a lot of parameters. One of the big problems is that the objective function is non-differentiable
Instead of using nif directly for rendering, AMD just uses NIF to intersect to speed up the rendering process. The focus is really on intersection, which is effective for rendering complex geometry.
Nerf is actually rendered differently than we are, they are using volume rendering while we are doing ray tracing.
I'd like to try to modify the objective function later in order to make ray-tracing a differentiable process, and if all the data for the model is known, can I then return exactly which point was hit, and the distance between the hit point and the origin?
@Sean I confirm that on my Mac m1 it works well now.
Matteo Balice said:
Maybe I was not so clear. Yes, they encoded both direction and position BUT they trained with a fixed viewpoint:
I certainly got what you meant, and there's certainly truth in both statements. The fact that the training is happening across two networks and with rays very much scattered in all directions does to me indicate that it likely can generalize (with more parameters). They did not pick a simple scene to say the least, and their sampling was not at a low resolution. The fixed viewpoint is what let them achieve their target performance, but I don't think that their approach was really indicative of an overtrained solution. On the contrary, they showed how well it performed on other viewpoints in their talk (they just don't go into that detail on their paper -- that's another paper).
Sean ha scritto:
Matteo Balice said:
Maybe I was not so clear. Yes, they encoded both direction and position BUT they trained with a fixed viewpoint:
I certainly got what you meant, and there's certainly truth in both statements. The fact that the training is happening across two networks and with rays very much scattered in all directions does to me indicate that it likely can generalize (with more parameters). They did not pick a simple scene to say the least, and their sampling was not at a low resolution. The fixed viewpoint is what let them achieve their target performance, but I don't think that their approach was really indicative of an overtrained solution. On the contrary, they showed how well it performed on other viewpoints in their talk (they just don't go into that detail on their paper -- that's another paper).
Yes I agree with you, their approach is certainly able to generalize and I have proved it but we need to add a more complex encoding like we were saying yesterday to reach the maximum accuracy. I will study more about the papers you gave us and expecially with sdf.
Before implementing a more complex encoding (like sdf), today I tried first with positional encoding of Nerf's work. The idea is that MPL neural networks perform poorly at representing high-frequency variation in geometry. Mapping the inputs to a higher dimensional space using high frequency functions before passing them to the network enables better fitting of data that contains high frequency variation.
I got an improvement of 0.02% of accuracy, from 99.1% to 99.3%. These are two renders of the same objects but with different angles (and of course it's the same model without retraining):
output0_ground.png
output0_pred.png
output1_ground.png
output1_pred.png
Left is ground truth, right is prediction.
Major issues concern the boundaries, which are not as sharp as they should be but rather tend to be smooth.
I believe both of these issues can be resolved by using a different and more complex encoding approach, as we discussed. @Sean @fall Rainy
Another idea I was thinking to help the NN is to modify a bit the sampling method. Now I use a totally randomic sampling around the bounding sphere but we can use a smarter approach using an importance sampling to sample more the regions where we are more uncertain.
There are 4 steps:
-first we sample randomly using the same approach as now.
-Second, we can divide in N cells (like a grid) the rays basing on their origin and directions so that very similar rays will be found in the same cell.
-Third, we calculate for each cell the uncertainty (very easy to calculate).
-Lastly, We resemple but this time using the uncertainty in order to gather more samples in regions where we are more uncertain.
Do you think it could work?
Think those percentages might be a little misleading? Is it taking all the black background into account also? If we count up expected hits vs predicted hits, that output0 in particular looks considerably more than 1% deviated.
Importance sampling would be good, but that's typically done as an optimization -- Would need to see a graph over epochs to see if this actually could converge onto the solution (even if over-trained). Can you make a graph?
That said, I think it might help with some of the perimeter and higher frequency detail, but it's really easy to get the sampling ever so subtle wrong and introduce bias or error. Kind of what to see proven that we can get a robust fit, that a given network topology is capable of precise match before going down that route.
@Sean Can we treat the image as a whole, rather than a single pixel, so that we can use the filtering algorithm to do some post-processing?
Sean ha scritto:
Think those percentages might be a little misleading? Is it taking all the black background into account also? If we count up expected hits vs predicted hits, that output0 in particular looks considerably more than 1% deviated.
Yes you are right. The dataset is unbalanced (more black than white) and this is the reason why accuracy is not the best in this case (I print also precision recall and better F1). But I remember that previous students used accuracy as the main metric and I wouldn’t change it. If you are interested in F1, it is 0.988 (a bit lower)
Sure, I can plot a graph over epochs.
I want to discuss just a moment about metrics. Do you think precision, recall or F1 is the most important one in our case?
Just to remember (I know you are familiar with all of these):
I think both precision and recall are important in our case, so I believe that F1 is the most significant one.
Or maybe do you think some others metrics I have not mentioned are more relevant in our case?
This is the plot of the model of yesterday (so without importance sampling):
download.png
As we can notice, as we increase the number of epochs we get an average F1 between 0.98 and 0.99, so we can say that our model has an average error of 1.5%.
Today I will implement importance sampling to see if we get improvements.
And this is the behaviour with importance sampling (a moderate importance sampling):
download.png
I tried also with a more aggressive importance sampling and I got an improvement of F1: 0.991, accuracy: 0.994. So overall:
importance sampling helps the model to understand better the regions more uncertain but it is not sufficient alone to achieve best results.
So it means that we need a more complex encoding or a more complex NN. For the moment I will focus on exploring encoding based on sdf.
fall Rainy said:
Sean Can we treat the image as a whole, rather than a single pixel, so that we can use the filtering algorithm to do some post-processing?
To what end, what exactly do you mean? If the end result is pixel approximation, then it will be potentially useful for visualization purposes (only). That has use, but it's definitely a different target. The distinguishing feature that makes this challenge is replacing rt_shootray() with a neural net or the slightly higher level do_pixel(). Going full image robustly might allow for a (real-time) preview.
The image rendered with neural network will have noise, if I can use the denoising algorithm after generating the image, the effect will be good, the left side of the image rendered by neural network, the right side is the image
denose.png
after denoising
Ok after studying some papers about sdf/nerf/gaussian splatting I have understood that one key information I do not actually use in the NN is the direction itself of the ray. Currently as input of the NN I use the spherical coordinates of the first and second intersection but not the direction vector.
Before introducing any further and more complex encoding I need to add this new information as input of the NN because all these methods rely on the direction.
I was thinking that we can use a smart idea to help the NN: we are not interested in the orientation of the vector direction but only on its direction.
Direction-bounding-sphere.jpeg.png
Imagine a vector in 2D, the maximum range we can have is from 0 to 180 (red region) since we are interested only on its direction. So, if the vector is in the green area, all we need to do is reflecting its angle to the positive half space.
The same idea can be applied to 3D vectors in spherical coordinates (with a fixed radius).
I think that using these new input features will be beneficial for two reasons:
1) we could use some more complex encoding based on sdf/gaussian that they all rely on the concept of direction.
2) using these input features, we will have as the new input of our NN the first intersection on the bounding sphere and this new vector direction. The advantage is that the range of theta and phi (of the new vector direction) is smaller than the previous input features that used the full ranges in spherical coordinates of the second intersection on the bounding sphere. This should help the NN because it will have less input space to analyse but without loss of informations.
PS: we could even try grid encoding associated to direction and see how it works.
Is there any way to know exactly which object the light hit?
Ok I have added also the direction of the ray in the input features. As I was expecting, the NN does not improve adding the direction itself because we are not adding more informations but we are only using different features (I have also tried grid encoding).
But the advantage is that now we can implement a more complex encoding that rely on the concept of direction.
I have 2 ideas about encodings to try:
first idea (and the best one in my opinion) is similar to the work of DeepSDF. The idea is to train latent vectors to predict the sdf for each ray. Then this latent vectors will be the input of our NIF architecture. The problem is that we need to compute the sdf for each ray in the sampling approach. I don't know if there is a fast way to do it. @Sean @Erik
Note that the sdf will be used only to train the latent vectors, so the NIF architecture won't use the sdf but only the latent vectors.
second one (more difficult than the first) is a similar work of Neural pull. In this case we do not have ground truth sdf, so the idea is to train latent vectors to "pull" rays towards the nearest surface by using the predicted signed distance values and their gradients, which the network computes. The movement of each rays is determined by the predicted distance and can be either towards or away from the surface, depending on the sign of the distance.
In both the cases, we need to choose an arbitrary number of latent vectors that will be associated to each direction (and this is the reason why I have added the direction to the input features).
I believe that if we can calculate the sdf for each ray the first encoding will be more efficient. I wait for your opinion. @Sean @Erik
I have read a lot of papers, nerf, nerf++,etc...and I decided to use the methodology in this paper:3D Gaussian Splatting for Real-Time Radiance Field Rendering
Matteo Balice said:
I have 2 ideas about encodings to try:
first idea (and the best one in my opinion) is similar to the work of DeepSDF. The idea is to train latent vectors to predict the sdf for each ray. Then this latent vectors will be the input of our NIF architecture. The problem is that we need to compute the sdf for each ray in the sampling approach. I don't know if there is a fast way to do it. Sean Erik
Note that the sdf will be used only to train the latent vectors, so the NIF architecture won't use the sdf but only the latent vectors.second one (more difficult than the first) is a similar work of Neural pull. In this case we do not have ground truth sdf, so the idea is to train latent vectors to "pull" rays towards the nearest surface by using the predicted signed distance values and their gradients, which the network computes. The movement of each rays is determined by the predicted distance and can be either towards or away from the surface, depending on the sign of the distance.
In both the cases, we need to choose an arbitrary number of latent vectors that will be associated to each direction (and this is the reason why I have added the direction to the input features).
I believe that if we can calculate the sdf for each ray the first encoding will be more efficient. I wait for your opinion. Sean Erik
I'm reading some papers on sdf and have some questions, sdf is used to represent geometric objects, how to represent ray with sdf?
I have currently implemented 3d Gaussian ahah
fall Rainy ha scritto:
Matteo Balice said:
I have 2 ideas about encodings to try:
first idea (and the best one in my opinion) is similar to the work of DeepSDF. The idea is to train latent vectors to predict the sdf for each ray. Then this latent vectors will be the input of our NIF architecture. The problem is that we need to compute the sdf for each ray in the sampling approach. I don't know if there is a fast way to do it. Sean Erik
Note that the sdf will be used only to train the latent vectors, so the NIF architecture won't use the sdf but only the latent vectors.second one (more difficult than the first) is a similar work of Neural pull. In this case we do not have ground truth sdf, so the idea is to train latent vectors to "pull" rays towards the nearest surface by using the predicted signed distance values and their gradients, which the network computes. The movement of each rays is determined by the predicted distance and can be either towards or away from the surface, depending on the sign of the distance.
In both the cases, we need to choose an arbitrary number of latent vectors that will be associated to each direction (and this is the reason why I have added the direction to the input features).
I believe that if we can calculate the sdf for each ray the first encoding will be more efficient. I wait for your opinion. Sean Erik
I'm reading some papers on sdf and have some questions, sdf is used to represent geometric objects, how to represent ray with sdf?
The idea is to find the point on the ray such that it has the minimum distance to the surface. See the ray marching algorithm or sphere tracing.
Matteo Balice ha scritto:
I have currently implemented 3d Gaussian ahah
I am trying with a probabilistic approach because I wasn't sure how to correctly extract the sdf for rays.
My idea is to use an autoencoder (giving as input only the direction) so as to encode in an embedding a number n of gaussians that represent the shape of the object for that direction.
In principle we do not need to encode all the "pixels" for a given direction in an embedding, but we need only to encode those areas which are the most uncertain.
So I think I will merge this idea with my previous network which was good apart from the boundaries of the object.
I made few tries in python so as to take confidence with gaussian splatting (for example I tried approximating an image using n gaussians) and I was really impressed how good is this technique.
Matteo Balice said:
So I think I will merge this idea with my previous network which was good apart from the boundaries of the object.
I agree with you. The problem is finding the boundaries.
I have an idea about this. Since I use a sigmoid activation function as last layer to predict hit/miss it is somehow an information of the uncertainty of the prediction.
So if it gives me a number around 0.5 it means that it is uncertain, so the gaussian should have more weight on that area.
I think consider using Bayesian optimization
About this, I would also calculate what is the mean probability that the NN gives me to each missclassified ray to check whether it can work.
Bayesian networks can output both mean and variance simultaneously
what is your idea
I'm going a little slow, and I'm still considering how to integrate 3dgs
I've completely given up on grid net.
If you want to know how I want to implemented it, i use an embedding of N gaussians for each direction
each gaussian can have a number of parameter that you want
Can I refer to your code?
but for simplicity I use only mean, variance and I think I should add also a weight
fall Rainy ha scritto:
Can I refer to your code?
Ok later I will upload to github
Ok, thx
I tried with a small number of rays (10000) and a small number of gaussians for each direction and the NN is perfectly able to discretize all the rays
The good thing is that I do not need to add any grid encoding to separate each direction, because the autoencoder is able to output the embedding in a continuos way since the input I give to it (only direction) is continuos
That sounds great. I'd like to replace the whole rendering process with a 3dgs approach, which might turn into a rasterized rendering
fall Rainy ha scritto:
Can I refer to your code?
i have uploaded the code for gaussian splatting.
https://github.com/bralani/rt_volume/blob/neural_rendering2/src/rt/gaussian_splatting.py
Today I made another improvement with gaussian splatting. I decided to associate a single gaussian to each positive hit of ray in the training set with a large variance. Each gaussian has a mean (origin_theta, origin_phi, dir_phi, dir_theta) and a variance (variance_theta, variance_phi, dir_phi, dir_theta). Dir_phi and dir_theta are set to a fixed number of 0.05 because in my opinion we can save some memory in this way and the training process will be faster. The only thing the neural network is supposed to do is to find the maximum variance of each gaussian so that also the negative hits are correctly classified.
The reason why we want to find the largest variance of gaussians is because of overfitting issues.
This works pretty good with 10000 rays but the problem is that if we increase just a bit the number of examples in the training set, the NN will become very complex in the number of parameters
I believe that this approach has a lot of potential
Tomorrow I will focus on reducing the number of gaussians without losing accuracy
There is a bottleneck of the current NN because as I increase the number of examples in the training set, it goes out of memory
@Matteo Balice Can you recap the structure of the computations and memory involved with the gaussian approach? How is that related to the Encoder/Decoder networks you have/had in your gaussian_splatting.py
Ok
I should update that file, anyway
The network is very simple: there is an encoder (which has the task to produce the embedding) and a decoder which has the task to produce the output (a probability between 0 and 1)
I'm seeing a 6-layer fully connected network there, where the layers are pretty hefty number of weights in total, which would explain the memory explosion
Sean ha scritto:
I'm seeing a 6-layer fully connected network there, where the layers are pretty hefty number of weights in total, which would explain the memory explosion
It is the old version, I don't use any layer anymore.
I have uploaded the new version now
Okay. In the old, looks like approximately 4MB of memory for that latent_dim=100 construction on just the encoder side.
Sorry, way off... that's better.
Ok, in this new version I associate to each gaussian 4 numbers for the mean and 4 for the variance.
And the number of gaussians are proportional to the number of positive hit in the training set.
When I use 10k samples it works fine, with 100k is very slow and with 1 million it goes out of memory
Maybe I should not associate a single gaussian to each positive hit in the training set but I should randomly take a subset of positive hits...
So you're using 10k ray samples currently right? Is that your number of embeddings?
With 10k rays, half are positive hits so about 5k are the embeddings
Okay, but worst case it's 10k?
yes
or is there more on disk? You have it actually using whatever is in the data folder
(just need to make sure you don't have a json with 100M lines or something)
my json has 1 million data
but I take randomly only 10k samples
Heh, okay .. but you're still creating an Encoder based on embeddings, which is based on how much is in your json, not how many samples, unless I'm reading this differently
if you print(embeddings.shape[0]) in get_embeddings, what's that report?
4683
Line 47 i cut the json, so I take only 10k examples
Okay, so it is hitting the else case in the constructor
yes
So in my back-of-napkin calculations, you're really not using much memory at all, nothing that explains running out.
With 10k or with 1M?
The autoencoder network is trivial as you noted, about 0.25MB total (which is dubious for any real model)
I don't (yet) see where you're actually accruing memory in the test iterations.
Unless pytorch is doing something under the hood that isn't being used but is growing
It does not even start training with 1M
You mean all you change is num_epochs = 1000000 and it dies?
That doesn't add up
not epochs
I comment the line 47, so I load all the json
Oooh oh, gotcha -- so you're chaning the [:10000] to other values, how much data, how many embeddings
yes
In the real paper of gaussian splatting they associate a single embedding to each example but I don't understand how they don't run out of memory
I don't see how you're running out of memory.
there must be some bug or cleanup issue that is ballooning. Even with 1M samples, that's only about 38MB of data
you surely have more than that available :)
how do you calculate it?
with double precision, your 5 origin+dir+label tensors consume just 40 bytes
mmmm, maybe the issue is with the gradient of pytorch
I remember I had this problem a long time ago
I must do some checks, thanks
at 5000 embeddings, that's not even 1MB for the autoencoding (embeddings + embedding params + proxy vars)
got it
even if it scaled linearly, to 500000 embeddings, that'd be 100MB max
The json file is 100MB, so you are right
Well that's text, but even as 8-byte double precision floats there's just not enough data
I'd suggest adding some print or pause statements and watch the process memory usage, see if some particular operation is increasing usage substantially
Ok thanks 👍🏻
got to be something relatively simple. If you were getting to iterations, I would suggest adding a gc.collect() or something to ensure python has opportunity to purge, but clearly something is going on before that even if it doesn't get to iterations.
I don't see it yet, but something is consuming gobs of memory
Yes it is really strange
Thanks
Like I could totally see if it your constructor data hash was making 1M copies of all those string hash keys... but those are reset on each iteration of self.datas.
maybe some linear overhead of torch.Tensor, but that doesn't make sense to me
any change if you replace them with:
origin = torch.tensor(data["point1_sph"], dtype=torch.float64)
dir = torch.tensor(data["dir_sph"], dtype=torch.float64)
label = torch.tensor([data["label"]], dtype=torch.float64)
?
I have not anymore the pc with me
I will try later
Okay. I'd try that but then also exit before the training use and see if you can get a pause before exiting to see how much memory the app is actually using before RayDataset, after RayDataset construction, and after Autoencoder construction, see where it balloons out.
Yep it is what I am going to try 👍🏻
Came across this interesting high-level article posted today, nice generic tutorial... https://gpuopen.com/learn/deep_learning_crash_course/
Screenshot-2024-07-24-alle-23.59.31.png
Ok I got it. @Sean You were right on all the estimations of the memory. The problem is in the decoder which you hadn't seen, when I calculate the probability of the examples that belongs to each gaussian.
I had implemented broadcasting (to speed up calculations) so I clone the embeddings (gaussian) n times where n is the batch size (number of examples).
In this image the batch size was of 256 examples.
Broadcasting in this case isn't probably a smart way to speed up calculations since a lot of gaussians are totally useless for the calculations of the probability for the current examples, but only the closest one to the current example are relevant.
I have solved the broadcasting issue that run out memory and I have first results. For the moment I am trying with only 10k rays because the calculation of probability is pretty slow.
Here the best epoch of the previous NIF network with only 10k rays:
F1: 0.880218316493941
Accuracy: 0.9283117186132877
Precision: 0.9087676930301201
Recall: 0.8534080886985007
Here, instead the best epoch with this new gaussian splatting network with 10K rays and only 1k gaussians embeddings (I decided to lower the number of gaussians to better explain the power of this model):
F1 Score: 0.9155
Accuracy: 0.9261
Precision: 0.9316
Recall: 0.9000
As you can notice, the gaussian splatting architecture has more power than the NIF architecture BUT it is way slower.
The interesting part is that I achieved this result in only 11 epochs, so the training is not slow but the inference is slow (the action from the start of the neural network to giving back the prediction).
For this reason, I will now focus on speeding up the gaussian splatting architecture. I believe I should read some papers about real-time renderings with gaussian splatting so as to achieve this speed up.
One idea I have is cutting some gaussians basing on the input I want to predict. (ie the furthest ones from the ray I want to predict).
P.S: the results I have showed here are not close to the ones of the previous week (accuracy: 0.994 and F1: 0.991) just because they were achieved with 1 Million examples. Here I use only 10k rays just to compare NIF architecture with Gaussian Splatting.
Today I had implemented parallelization of inference process and the training process is much faster.
However, I had another problem with the covariance matrix. To calculate the pdf of a gaussian we need to invert the covariance matrix but when it is close to singularity, it is not possible to invert it. The problem is that the covariance matrix for all the gaussians has small values due to the nature of the problem (if you imagine each gaussian as an ellipsoid in 3D, it will have very small magnitude).
To mitigate this problem I decided to use double precision (float64) and for the moment it seems to perform better.
Do you know any other method to solve this issue of matrix ill-conditioning?
Apart from this, tomorrow I will train with more examples to see the true results of this model.
Today I had further speed up the inference process by taking only the closest gaussians to the example to predict. I had also tried with more examples (100k) and the metric F1 improves even to 94-95%. However, it seems more difficult to improve this result with the current architecture. My opinion is due to the number of gaussians that now are fixed for all the training. The original paper (3DGS) uses instead a variable number of gaussians that can be lowered during the training process (if some gaussians have a very small variance) or it can be increased if the accuracy is low.
That's sounding a lot better. I think it'll still need to get into the 99% realm, but that's a distinct improvement. What's that 94%-95% look like?
By the way, had a lovely discussion with one of the authors of AMD's Neural Intersection Function paper today. He said they've made some headway on generalizing themselves, but that's obviously a hard problem. He's probably going to publish on it next year.
One thought I had, and it's obviously in a different direction but maybe applicable is what if we try integrating over multiple networks. That is, use the simple network they demonstrated works well for a single view, but then lets create one for 32 (or 32768) views, get estimated in-hit points from all of them, and integrate spatially.
That's actually not terribly dissimilar from the NeRF approach, but the goal would not be a radiance field. It would be like a surface occupancy field, or a point cloud with weights.
This is really possible. I recently saw some algorithms for interpolation between multiple pictures. Maybe we can predict in key directions and then make differences in other directions.
If the direction is contant, network is learning a func like:
2024-07-28-212551.png
Interpolated fits between different perspectives have been done before:
2024-07-28-212847.png
from VIINTER: View Interpolation with Implicit Neural Representations of Images
Sean ha scritto:
That's sounding a lot better. I think it'll still need to get into the 99% realm, but that's a distinct improvement. What's that 94%-95% look like?
I have not rendered any frame for gaussian splatting network yet because if the percentage are under 0.97 - 0.98, the predicted image is very far from the true one. I prefer first to achieve better results and then plotting it.
Sean ha scritto:
One thought I had, and it's obviously in a different direction but maybe applicable is what if we try integrating over multiple networks. That is, use the simple network they demonstrated works well for a single view, but then lets create one for 32 (or 32768) views, get estimated in-hit points from all of them, and integrate spatially.
This idea is not far from the idea of grid encoding of direction which I have already tried without any improvement. In that case I used embeddings for each direction, you suggest using a network for each direction. I can try it. :thumbs_up:
fall Rainy ha scritto:
Interpolated fits between different perspectives have been done before:
2024-07-28-212847.png
from VIINTER: View Interpolation with Implicit Neural Representations of Images
This could work but it means that we have to change the sampling to fixed view sampling. I have some doubts about the number of views we should sample because:
If we choose for example 1 million rays in the training set and we choose a reasonable number of 1000 rays for each view this means that we have a total of 1000 views.
If we distribute uniformly these views around the bounding sphere it means that, remember that theta goes from 0 to pi and phi goes from 0 to 2pi:
In my case I can use half of phi because in my case the problem is simpler and the direction is invariant to orientation (so I have theta that goes from 0 to pi and phi that goes from 0 to pi):
Theta x phi = 3,14 x 3,14 = 9,8596 -> total space
Uniform distribution of views:
9,8596/1000 views = 0,009859 rad = 0,56° between two views.
I think it is a reasonable error and it can be easily interpolated.
In your case, however, you can't reduce the range of phi (because rgb is not invariant to orientations), so you will have:
Theta x phi = 3,14 x 6,28 = 19,719 -> total space
19,719/1000 views = 0,0197 rad = 1,128° between two views.
An error of 1° is reasonable also in your case and if my calculations are exact, it could work even in your case but the error is the double of mine.
I believe we should try this approach using NIF architecture and then interpolating them.
@Sean I have a doubt about rays and pixels: if we want to render a frame of 100x100 it means that the algorithm sample for each pixel a ray (for a total of 10k rays) or is it different?
Matteo Balice said:
I have not rendered any frame for gaussian splatting network yet because if the percentage are under 0.97 - 0.98, the predicted image is very far from the true one. I prefer first to achieve better results and then plotting it.
That is absolutely not best practice and not recommended to ignore the predicted images solely based on having low percentages. Not looking gives you no information. Looking may give no information, or may provide helpful clues as to what isn't encoding well.
It can be high frequency detail, it can be a straight up bug where values are simply shifted, it can be low frequency undulations, and more. It's not in your interest to ignore them even if 19 times out of 20 it's just a "drunk wet mess".
Matteo Balice said:
I believe we should try this approach using NIF architecture and then interpolating them.
I will just reiterate what we'd discussed earlier, that there should be two different approaches being taken (in general), or even better two different goals (e.g., 3d shape vs 2d image). There is value in exploring the same method with different implementation detail (on the off chance there is some detail that matters more than anticipated).
Matteo Balice said:
This idea is not far from the idea of grid encoding of direction which I have already tried without any improvement. In that case I used embeddings for each direction, you suggest using a network for each direction. I can try it. :thumbs_up:
It's not far off, but the separate networks is the key. AMD really proved that surface illumination can be almost perfectly encoded for a given view. Maybe if we were to first reproduce their research, that would give more confidence, but lacking that it's not terribly unexpected.
Now that said, I don't think there's a whole lot of difference with random rays in random dirs -- naively I think that can work with the right network and right amount of training. Remains to be proven though. The idea with the 32+ grid views, however, is a compromise, banking on the notion that a single view should converge that view. In other analysis work we're involved with, there's mathematical proofs that lend evidence towards 32 views being a sweet spot approximation for complete random, converging much faster than pure random.
Matteo Balice said:
fall Rainy ha scritto:
Interpolated fits between different perspectives have been done before:
2024-07-28-212847.png
from VIINTER: View Interpolation with Implicit Neural Representations of ImagesThis could work but it means that we have to change the sampling to fixed view sampling. I have some doubts about the number of views we should sample because:
If we choose for example 1 million rays in the training set and we choose a reasonable number of 1000 rays for each view this means that we have a total of 1000 views.
If we distribute uniformly these views around the bounding sphere it means that, remember that theta goes from 0 to pi and phi goes from 0 to 2pi:
In my case I can use half of phi because in my case the problem is simpler and the direction is invariant to orientation (so I have theta that goes from 0 to pi and phi that goes from 0 to pi):
Theta x phi = 3,14 x 3,14 = 9,8596 -> total space
Uniform distribution of views:
9,8596/1000 views = 0,009859 rad = 0,56° between two views.
I think it is a reasonable error and it can be easily interpolated.
In your case, however, you can't reduce the range of phi (because rgb is not invariant to orientations), so you will have:
Theta x phi = 3,14 x 6,28 = 19,719 -> total space
19,719/1000 views = 0,0197 rad = 1,128° between two views.
An error of 1° is reasonable also in your case and if my calculations are exact, it could work even in your case but the error is the double of mine.
Again, I would just caution whether we're following research that is attempting to capture shape in the embedding or whether the goal is capturing the shape just barely enough that color, i.e., a visual image can be constructed that "looks good enough". On quick read, that VINTER paper appears to be the latter, but I'd have to read it in more detail.
As for total number of views, I think you could try as coarse as 45-degree increments. Resolution will need to be as fine as the smallest detail, which depends on the model size and detail complexity. I'd personally start at 1024x1024, about 1M per view.
That 1024^2 resolution at 45-degree probably means something like 256 or 512 resolution alignment, whatever that cell size resolves to.
Sean ha scritto:
Matteo Balice said:
I have not rendered any frame for gaussian splatting network yet because if the percentage are under 0.97 - 0.98, the predicted image is very far from the true one. I prefer first to achieve better results and then plotting it.
That is absolutely not best practice and not recommended to ignore the predicted images solely based on having low percentages. Not looking gives you no information. Looking may give no information, or may provide helpful clues as to what isn't encoding well.
It can be high frequency detail, it can be a straight up bug where values are simply shifted, it can be low frequency undulations, and more. It's not in your interest to ignore them even if 19 times out of 20 it's just a "drunk wet mess".
Yes, you are totally right. I did not considered to render it because I got that result with only 100K rays and I wanted first to train the NN with at least 1 million rays. Thanks for the advise.
Sean ha scritto:
Matteo Balice said:
This idea is not far from the idea of grid encoding of direction which I have already tried without any improvement. In that case I used embeddings for each direction, you suggest using a network for each direction. I can try it. :thumbs_up:
It's not far off, but the separate networks is the key. AMD really proved that surface illumination can be almost perfectly encoded for a given view. Maybe if we were to first reproduce their research, that would give more confidence, but lacking that it's not terribly unexpected.
Now that said, I don't think there's a whole lot of difference with random rays in random dirs -- naively I think that can work with the right network and right amount of training. Remains to be proven though. The idea with the 32+ grid views, however, is a compromise, banking on the notion that a single view should converge that view. In other analysis work we're involved with, there's mathematical proofs that lend evidence towards 32 views being a sweet spot approximation for complete random, converging much faster than pure random.
I was wondering... Isn't the limit of the previous NN (NIF network which I got 0.994 for accuracy) was simply that the number of examples in the training set was too low? Maybe we should try with even more samples (more than 1 million) to see how it works.
Today I had implemented the adaptive learning for Gaussian splatting architecture. I want first to finish and evaluate this architecture before trying with multiple NIFs.
After many optimizations, I got a pretty good result(grid net, fix direction)
2024-07-31-163626.png
@fall Rainy please elaborate, (and point to latest code!) what's the resolution of the grid net, what are the layers, how many epochs, how long did training take, how long does lookup take, etc...
Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen
Sean said:
Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen
I have some experience with diffusion models for 3D. I did a diffusion network for automatic retopology (for my university).
Regarding Gaussian splatting, I am not so convinced about metrics, tomorrow I will render some frames to see graphical results…
Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh
Sean ha scritto:
Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh
This is fantastic, I just tried the software of https://github.com/NVlabs/instant-ngp (which is the base code they use) and the training time is of the order of seconds even with my poor GTX 1060!!
Nuts. instant-ngp code license is non-commercial only
starseeker said:
Nuts. instant-ngp code license is non-commercial only
It’s not so bad, all we need to do is understand their paper and the one that Sean sent. The coding part shouldn’t be difficult even if we have to code from 0.
Yeah, the implementation seems pretty straightforward. The main limitation is they didn't get a performance gain. They train for a few minutes, then ray query performance is on par with the ray tracing time. The one big gain they saw was getting that on-par ray tracing time with an order of magnitude less memory use. So exceptional compression in the latent space.
Sean said:
fall Rainy please elaborate, (and point to latest code!) what's the resolution of the grid net, what are the layers, how many epochs, how long did training take, how long does lookup take, etc...
There are three resolutions:
first,I give up bilinear interpolation and try to learn a matrix to express the relationship between neighboring vectors
second, I consider neighboring vectors in the range 7×7 instead of 2×2
third, I use a threshold to reduce noise.
here is my codes: https://github.com/Rainy-fall-end/Rendernn/blob/main/networks/gridnet3.py
100,000 points need to be sampled, but the model actually converges when 20,000 points are used,The training will take 2 minutes total.(4060)
The yellow curve is the improved gridnet, the purple one is the original gridnet.
2024-08-01-220614.png
Sean said:
Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh
But this one looks so much better than mine....
Cool, thanks! I'll take a look in more detail. Looks like it converges pretty quickly? How long did it take to get to step 200?
fall Rainy said:
Sean said:
Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh
But this one looks so much better than mine....
Don't worry about that -- this is very ripe area of research.
About 30 seconds.
Sean said:
Cool, thanks! I'll take a look in more detail. Looks like it converges pretty quickly? How long did it take to get to step 200?
It also means it's worth exploring all avenues as the details matter.
Matteo Balice said:
Sean ha scritto:
Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh
This is fantastic, I just tried the software of https://github.com/NVlabs/instant-ngp (which is the base code they use) and the training time is of the order of seconds even with my poor GTX 1060!!
This looks like it's implemented in C++, are you going to reproduce it in pytorch?
fall Rainy ha scritto:
Matteo Balice said:
Sean ha scritto:
Shame I didn't notice/read this paper sooner, but looks like this siggraph paper is right on track with what we're trying to achieve with impressive multiviewer results: https://weiphil.github.io/portfolio/neural_bvh
This is fantastic, I just tried the software of https://github.com/NVlabs/instant-ngp (which is the base code they use) and the training time is of the order of seconds even with my poor GTX 1060!!
This looks like it's implemented in C++, are you going to reproduce it in pytorch?
I am still understanding all the ideas of those two papers, but yes probably I will reproduce in pytorch.
The N-BVH paper was an outstanding talk -- will see if I can get a copy, but it's a direct response to AMD's NIF paper. They key insight was to not use the ray+dir but to instead use 3 sample points along the ray on the interior of the bounding volume along with a BVH.
https://weiphil.github.io/portfolio/neural_bvh
The key insight of using interior points was demonstrated with just 3-10 sample points which of course made training slower as points are added, but achieved essentially perfect occupancy recall even with high frequency detail. Adding in a BVH was a training optimization so they could reduce it back down to just 3 points per BVH node.
I was studying that paper and had a few questions:
This means their training set consists of 2^18×100, equating to approximately 26 million rays. Moreover, they state that the training time is at most 2-3 minutes.
I just discovered that it is indeed possible to use such a large batch size (I had been training NIF with a batch size of just 512 rays until now...). However, theory suggests that using a large batch size can increase variance error. Likely, with such a large training set, they do not encounter this issue.
What puzzles me most is how they can achieve convergence in just about 2-3 minutes.
To investigate, I increased the batch size for the NIF model (with which I previously achieved an accuracy of 0.994). As expected, the model trains faster (20 minutes for 1 million rays), and the performance metrics remained roughly the same (just a little worse).
However, convergence for NIF is much slower than their approach, meaning that training a simple NIF requires more time and results in lower accuracy. Likely, the bounding volume hierarchies approach they use significantly helps the neural network.
That 262144 is almost certainly a 512x512 grid, 100 different views
Now have to take performance with a grain of salt. I didn't see the hardware, but it they're on a high-end GPU, that might be an hour of training on a CPU..
In their talk the BVH optimization using 3 samples vs 10 samples cut the training time roughly in half (1min)
Sean ha scritto:
Now have to take performance with a grain of salt. I didn't see the hardware, but it they're on a high-end GPU, that might be an hour of training on a CPU..
They used an RTX 3090
How do you estimate that it would be an hour on a CPU?
I decided to implement first the multi-resolution hash grid and validate it, then on top of this I will add the BVH approach so as to integrate rays in the 3D grid.
I found this wonderful repository https://github.com/ashawkey/torch-ngp and I am using this as a base code for the multi-resolution hash grid.
I implemented the multi-resolution hash grid and I tested it on top of my previous NIF network. It's incredible how it converges in just few seconds even on my poor gpu.
It achieves only 96% of F1 but it was expected since this encoding is not appropriated for rays+dir input (as we have in NIF) but they are appropriated for 3D points (like in NVBH).
Tomorrow I will focus on this part :+1:
Today I tried the multi-resolution hash grid sampling N points along the the ray. In this picture I sampled 70 points along each ray. I want to mention that this is without the BVH approach but only the multi-grid resolution. The metric F1 is about 0.985 and it converges in few seconds. The training set was about 3 million rays and the picture is 512x512. I noticed that increasing the number of the sampling points, the metrics are better (and without the BVH approach I have to use a lot of sampling points).
Before implementing the BVH approach I want to try with more samples (like the paper -> 26 millions).
Figure_1.png
Here there is the prediction of that image training with only this fixed direction just to prove that this model is perfectly able to discretize IF the number of training samples are sufficient.
Now I will train with 26 million samples.
@Sean @Erik Generating 26 million samples takes me several hours. Are there any ways to optimize the ray tracing algorithm in BRL-CAD to speed up the process?
Figure_1.png
This is with 6 millions rays.
Figure_1.png
This is with 6 million rays BUT generated from 22 different and fixed views (512x512x22) instead of using random sampling. Metrics are higher with this sampling: I got 0.997 in both accuracy and F1.
Curious, @Matteo Balice why 22? That’s what, 16degrees or so in one axis of rotation?
Well that is not exactly 22 views. I finally successfully sampled 100 views (thanks to my Mac M1) each with 512x512 rays for a total of 26 millions. However when I load all these rays on PyTorch I go out of memory… so I decided to cut only 6 millions rays.
But before cutting I randomly shuffle the 26 millions rays, so it is not right to say that they are 22 views.
Now I am trying a way to load more samples
So it’s better to say that we have 60k rays for each view (100 views)
Figure_1.png
10 millions. Getting better :mechanical_arm:
Even though there is no much difference with 6 millions... But I noticed that with 10 millions I did not converge in 10 epochs like in 6 millions case... Probably I need to train more.
(just to be clear: I always print this view because I noticed it’s one of the most difficult to render, but the model is able to render also all the others views)
Matteo Balice said:
Sean Erik Generating 26 million samples takes me several hours. Are there any ways to optimize the ray tracing algorithm in BRL-CAD to speed up the process?
You can consider using multithreading
Sean said:
Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen
But I remember brl-cad doesn't seem to be based on b-rep modeling?
A great imple implementation of the hash encoding:HashNeRF. It is sad to find that many of my ideas have already been realized, but I'll finish my other ideas on that basis
Figure_4.png
This is with 26 million samples. I had to increase the resolution of hash grid but a lot of white dots appears. It seems that we need to add also the BVH approach so as to delete all these noisy dots.
The F1 metric is improved to 0.998
Don't quite understand how you'd run out of memory.. Is that with replicated view information or with offsets?
Even with independent 6 doubles (xyz+dir), that should be about 1.1GB, and depending on how that's encoded, that could be reduced to just 4 floats (azel on bounding sphere + azel direction) which is about 400MB for 100x512x512 views.
Because when I loaded the dataset, I computed 70 points xyz along each rays. So I had in memory 70 points xyz for 26 millions rays
Now I compute these 70 points only in the forward method of the neural network (so only for the batch).
For the current batch
Tomorrow I will upload the code on GitHub
fall Rainy said:
Sean said:
Really cool paper implementation on how to encode BREP in a NNet... https://github.com/samxuxiang/BrepGen
But I remember brl-cad doesn't seem to be based on b-rep modeling?
Yes and no, @fall Rainy ... BRL-CAD does have support for BREP models. They import, display, and raytrace. There's even some basic tessellation (conversion) and preliminary export support. There's just not much support yet for editing and we want ray tracing performance to be better before we push it harder.
It's fundamentally no different than all the other primitives, can be used in boolean expressions (which raytrace just fine), can be volumetric/solid or plate-mode like meshes. There's also some direct Boolean evaluation support which is converting BREP used in CSG expressions to BREP without CSG, but that work is incomplete.
What's really cool about that paper is the figured out how to encode solid geometry (in BREP form) into a NNet. Not only is that a general concept that extends to other geometry forms, it's a way to actually encode CAD in the latent space, not just SDFs or volume grids or radiance fields.
Matteo Balice said:
Now I compute these 70 points only in the forward method of the neural network (so only for the batch).
Why 70 points?? The paper demonstrated complete convergence with less than 10...
Sean said:
Matteo Balice said:
Now I compute these 70 points only in the forward method of the neural network (so only for the batch).
Why 70 points?? The paper demonstrated complete convergence with less than 10...
Yes but because they use the BVH
I don’t have implemented it yet
This will be the next step
If I recall correctly, they did not use BVH in their first iterations -- they went from 3 points to 10 points to get convergence.
They introduced a BVH to make the performance of 10 points take less time than the original 3 points.
So in theory, it should converge just fine with 10 points, just not quickly. Also means 70 should converge, but in 7x time or more.
Figure_6.png
Here there is the figure with only 10 points... As you can see it's pretty weird and the F1 metric is only about 0.97... Instead, with 70 points I got 0.998 of F1
The accuracy should be higher (according to the paper) IF the points are sampled near the surface. So even with 10 points it should be ok IF they are sampled near the surface. But how can we guarantee that they are sampled there if we do not know the intersection of the ray with the surface?
This is the reason why I uniformly sample along the ray (all the points have the same distance along the ray)... And this is the reason why increasing the number of points it converges with higher accuracy.
If there is some ideas about smarter sampling along the ray it should improve a lot the model...
Matteo Balice ha scritto:
Figure_6.png
Here there is the figure with only 10 points... As you can see it's pretty weird and the F1 metric is only about 0.97... Instead, with 70 points I got 0.998 of F1
Let's think for instance at the torus in the figure. Why is it so ugly? In my opinion, since we have rays that start and end in a bounding sphere, if we sample along these rays there is a medium/high probability that all the points sampled aren't close to the torus since this one is very thin. And this is the reason why the cube and the sphere are better represented (because their volume is larger and so it is easier that the points are closer to the cube/sphere).
In my opinion if we use BVHs that wrap the surface, not only we can use less sampling points but I think that also the accuracy must increase.
@Sean Is the BVH algorithm already implemented in brl-cad?
Matteo Balice said:
Figure_4.png
This is with 26 million samples. I had to increase the resolution of hash grid but a lot of white dots appears. It seems that we need to add also the BVH approach so as to delete all these noisy dots.
Add a threshold layer before output may solve this question.
I'd like to combine these tricks with the hashencoder to see how much improvement can be gained
fall Rainy said:
Sean said:
fall Rainy please elaborate, (and point to latest code!) what's the resolution of the grid net, what are the layers, how many epochs, how long did training take, how long does lookup take, etc...
There are three resolutions:
first,I give up bilinear interpolation and try to learn a matrix to express the relationship between neighboring vectors
second, I consider neighboring vectors in the range 7×7 instead of 2×2
third, I use a threshold to reduce noise.
here is my codes: https://github.com/Rainy-fall-end/Rendernn/blob/main/networks/gridnet3.py
100,000 points need to be sampled, but the model actually converges when 20,000 points are used,The training will take 2 minutes total.(4060)
The yellow curve is the improved gridnet, the purple one is the original gridnet.
2024-08-01-220614.png
Of course, there are some other things that need to be improved
Figure_13.png
I finally achieved the 0.9991 of F1 overtraining with 200 points for each ray and 26 millions total rays. I modified the prediction so that for each ray, I take only the maximum of those 200 points and if it is greater than 0.5 it is a hit, otherwise it's a miss. It's like we have a voxel grid, in which for each voxel we have a probability of a hit/miss.
Of course it's much slower BUT I have an idea. We can train the multi-resolution grid in this way (using a lot of points for each ray) but after the training we can take the grid already trained and build another model on top of this (without editing the grid) trying to predict the right "voxel" for each ray. We can leverage the fact that closest rays have closest "voxels" hit.
And trying to build an hashmap (similar as the grid encoding) so that for each input ray we have an O(1) complexity to retrieve the right voxel and then the inference process would be very fast!
If this works, it could be even faster than the paper of nvbh!
Matteo Balice said:
Sean Is the BVH algorithm already implemented in brl-cad?
Yes there is, see src/librt/cut_hlbvh.*
See it in use in clt_prep() in src/librt/prep.cpp
fall Rainy said:
- Improved initialization strategy. During my training, I found that I could initialize the output of the model to 0, 0, 0 instead of [127.5,127.5,127.5] and it would be beneficial for the model to converge
That's one of the optimizations mentioned in the paper -- the model is not only normalized in position, but also scaled/centered/normalized in size also. So values are all 0 to 1 or -1 to 1 for XYZ. That was pretty essential in limiting the training space.
Matteo Balice said:
Figure_13.png
I finally achieved the 0.9991 of F1 overtraining with 200 points for each ray and 26 millions total rays. I modified the prediction so that for each ray, I take only the maximum of those 200 points and if it is greater than 0.5 it is a hit, otherwise it's a miss. It's like we have a voxel grid, in which for each voxel we have a probability of a hit/miss.
Please show the code for what you're doing here? That's definitely interesting results, but I'm still not understanding why you need 70 or 200 sample points. I get that it's sampling like a voxel grid, but that shouldn't be necessary (and degenerates to a simple grid query). Implies some fundamental difference in setup or evaluation. What's the network you're using at this point?
Sean ha scritto:
Matteo Balice said:
Figure_13.png
I finally achieved the 0.9991 of F1 overtraining with 200 points for each ray and 26 millions total rays. I modified the prediction so that for each ray, I take only the maximum of those 200 points and if it is greater than 0.5 it is a hit, otherwise it's a miss. It's like we have a voxel grid, in which for each voxel we have a probability of a hit/miss.Please show the code for what you're doing here? That's definitely interesting results, but I'm still not understanding why you need 70 or 200 sample points. I get that it's sampling like a voxel grid, but that shouldn't be necessary (and degenerates to a simple grid query). Implies some fundamental difference in setup or evaluation. What's the network you're using at this point?
I'm going to upload the code in half an hour more or less.
@Sean @fall Rainy https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
We can summarize the neural network in this picture:
In the prediction (forward method), I take the ray, sample n points along the ray, then I pass this to the encoder and finally the embedding are scaled to a probability [0;1]
In the end, for each ray I take the maximum for all the points sampled and this works well because:
This neural network works well in the training of the multi-resolution hash grid. My idea is to use the trained grid of this network as a base for another network much simpler and that uses less points. Do you have some ideas how to achieve this result (I proposed the idea of the hashmap; the paper used the BVH approach for instance)?
I would also remark some differences about the nvbh paper:
image.png
Overall my network is much faster than theirs (if we do not take into consideration the sampling points part).
Screenshot-2024-08-11-alle-21.27.55.png
Screenshot-2024-08-11-alle-21.28.26.png
Screenshot-2024-08-11-alle-21.28.51.png
I have implemented an hierarchy of bbox like in a tree.
My idea is to leverage this architecture so as to retrieve all the leaf nodes of a given ray.
Screenshot-2024-08-11-alle-21.30.39.png
And moreover, we can precompute all the leaf nodes for all the rays with a given tolerance.
In this way the sampling parts of the NN should be way faster (less points to sample)
with hashnet: 1 million data, 1 minute to train(4060)
hashnet.png
fall Rainy ha scritto:
with hashnet: 1 million data, 1 minute to train(4060)
hashnet.png
Does this work with arbitary rays or only with fixed directions?
In a small range
Matteo Balice said:
fall Rainy ha scritto:
with hashnet: 1 million data, 1 minute to train(4060)
hashnet.pngDoes this work with arbitary rays or only with fixed directions?
I am currently recording predicting times of my network and I got these results (for 1024x1024) with batch size of 8k:
Here in the picture there are the times of the paper:
In my opinion my network is better in rendering times because I use batch sizes of only 8k. They instead used 260k as batch size.
I have bought an RTX 4070 and in these days I will set up it on my PC
I will re-record times on my new rtx using their same batch size.
Matteo Balice said:
I am currently recording predicting times of my network and I got these results (for 1024x1024) with batch size of 8k:
- if I sample 200 points I have 0.1004147 s = 100ms
- if I sample 3 points (like in the paper) I have 0.123648 s = 124 ms
So it seems that the rendering times are independent from the number of points sampled (and I cannot understand why). It seems pretty strange.
This may be due to the GPU's acceleration in matrix computation
I think I need to compare the same object as in the paper.
And using the same batch size as theirs
Because otherwise results are not comparable
I installed the new RTX 4070, but I discovered today that the power supply is no longer sufficient. I’ve ordered a new power supply, but it hasn’t arrived yet (hopefully it’ll be here by tomorrow), so I’ll be without a PC for a couple of days.
It would be very good if you're going to follow their approach @Matteo Balice to see if you can indeed match their results. If you can, then everything you're learning and asserting with different geometry is new insight. If you can't, then that may lead to discovering where there are differences/bugs/issues/assumptions that need to be considered.
img-2025_Ym53QguN.mp4
Today I had implemented a 3D visualizer directly in python.
In this way it's easier to see how much the model is predicting well.
All the frames are rendered with the neural network.
This is the "statuette" model by the nbvh paper with the memory and render times.
image.png
These are the memory and render times of my network in pytorch:
Memory: 27mb
Render time: 20,68 ms
Regarding the memory, I believe that the grid network I use as base model does not compress a lot the grid itself.
About rendering times, I think they are good because of course python is much slower than C++ (their code is in C++).
Moreover, you have to take into consideration also a small overhead in my rendering time due to the conversion to spherical to cartesian coordinates (because my training set is in spherical coordinates as previous methods), but this can be easily avoided.
About the error, I got the same as the last model (F1 of about 0.9991)
@Sean @fall Rainy I have figured out why in my network using N = 200 has the same speed as N = 3. The difference with the NBVH network is that they encode N = 3 points and then they concatenate all of these 3 points in order so as to include also the direction of the vector. But the main difference is that the input of their network has 3 * number of features for each point.
In my case it is different because I get 200 points BUT I compute in parallel all of these 200 points, so the input of my network is only 1 * number of features for each point because I do not need the direction of the ray (I just have to predict whether the voxel is hit/miss).
photo_2024-08-19_13-41-50.jpg
(sorry for my bad handwriting).
I had a talk on LinkedIn with Philippe Weier, the author of the nbvh paper and he said to me that the inference time of one ray depends on the time of computing the intersection of the ray with the first node + the time of getting the three points + the time of prediction. Then, if it is a miss, you should add the time of computing the intersection of the second node + etc...
I believe that their network can be improved using this approach:
instead of using the bvh, we can simply get N = 201 points and merging them in order in triples of three points (67 triples). In this way we can parallelize all these computations in GPU and take only the informations on the first hit. I think it should be faster than the bvh approach.
It sounds a little like volume rendering
Yes, it's like it
Today I was thinking that the multi resolution hash encoding isn't very far from nif network
For example the grid encode they use has the bilinear interpolation
But it is more like voxels
Maybe attach more information, like whether the point is visible or not, and then do an integration at the end
Matteo Balice said:
Today I was thinking that the multi resolution hash encoding isn't very far from nif network
Yes, it has the advantage of being more efficient in the use of voxels
fall Rainy ha scritto:
Maybe attach more information, like whether the point is visible or not, and then do an integration at the end
Yes
I'd like to make some attempts as well, do you currently have the code for that?
No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
Actually, I think it's kind of like a combination of nbvh and nerf.
Matteo Balice said:
No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
Thx
I got another improvement in both rendering time and memory. Talking with the author of nbvh he suggested me to use only 1 number as the dimension of each level (and he was right because I only need to predict visibility ie the shape of the object).
Here it is the grid:
image.png
Moreover, now the statistics are:
memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere, otherwise the time is lower)
These statistics are indipendent from the object encoded. I have tried with moss.g and statuette (one of their object) and the reason is very simple: I do not use any bvh, so every ray will have the same inference time IF the grid is the same between object. The only thing that can change between objects is the metric F1, because some objects can have small details that are more complex to encode but using this grid is ok both for moss.g and statuette.
I have written some metrics here if you are interested:
NBVH
hardware: RTX 4070 12GB
batch size: 2^14 (16'384 rays)
Both models achieve 0.9991 F1 with 26 millions rays in about 10 minutes
memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere)
MOSS MODEL
training set: 1 million rays
training time: 47 seconds
F1: 0.9958
training set: 5 million rays
training time: 4 min 10 sec
F1: 0.9974
STATUETTE MODEL
training set: 1 million rays
training time: 40 seconds
F1: 0.9942
training set: 5 million rays
training time: 4 min
F1: 0.9972
Matteo Balice said:
No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
I read these codes. It still looks like gridnet&hashnet? not nbvh
fall Rainy ha scritto:
Matteo Balice said:
No, I have only my network to predict hit/miss here but you can easily adapt for your task https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
I read these codes. It still looks like gridnet&hashnet? not nbvh
Yes I call it nbvh but it is only hashnet and gridnet
Matteo Balice said:
I got another improvement in both rendering time and memory. Talking with the author of nbvh he suggested me to use only 1 number as the dimension of each level (and he was right because I only need to predict visibility ie the shape of the object).
Here it is the grid:
image.pngMoreover, now the statistics are:
memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere, otherwise the time is lower)These statistics are indipendent from the object encoded. I have tried with moss.g and statuette (one of their object) and the reason is very simple: I do not use any bvh, so every ray will have the same inference time IF the grid is the same between object. The only thing that can change between objects is the metric F1, because some objects can have small details that are more complex to encode but using this grid is ok both for moss.g and statuette.
what do you mean about 1920×1080?
The resolution of the rendering
I am using your rgb training set now and I have one picture to show you
I've actually found out before. And for rgb prediction, dimension=3 is better
Matteo Balice said:
I got another improvement in both rendering time and memory. Talking with the author of nbvh he suggested me to use only 1 number as the dimension of each level (and he was right because I only need to predict visibility ie the shape of the object).
Here it is the grid:
image.pngMoreover, now the statistics are:
memory usage: 4,3mb
max inference time: 5,2 ms in 1920x1080 (assuming all rays hit the bounding sphere, otherwise the time is lower)These statistics are indipendent from the object encoded. I have tried with moss.g and statuette (one of their object) and the reason is very simple: I do not use any bvh, so every ray will have the same inference time IF the grid is the same between object. The only thing that can change between objects is the metric F1, because some objects can have small details that are more complex to encode but using this grid is ok both for moss.g and statuette.
image.png
The torch.max is not suitable for rgb prediction
You can do it with this:
x = self.output_layers(x)
x = torch.sigmoid(x) # map output to 0,1
x = F.threshold(x, 0.1, 0.0) #Reducing noise
x = x * 255 #map output to 0,255
Is this differentiable?
Yes(I think so
Because I tried something similar but it wasnt differentiable
Have you tried rgb with hashnet&gridnet?
This is my res:
91aa8fe67e981e4ed47dc11c6bda546.png
Does it work with all directions?
with 1million rays to train
Just for fixed direction
Ah ok
For all directions, I'm getting similar results to you.
For all directions, my predictions for binary classification actually turned out pretty well
I'll organize the results tomorrow.
ok good
Matteo Balice said:
image.png
The torch.max is not suitable for rgb prediction
This image doesn't look like it's been rendered by brl-cad
it's in python
OK, got it
they are rgb colors of the pygame renderer
it's a python library
I will show all of my results here: neural rendering
Matteo Balice said:
they are rgb colors of the pygame renderer
The resulting graph is very nice.
which graph
Matteo Balice said:
image.png
The torch.max is not suitable for rgb prediction
this one
Looks more like a point cloud map.
well, this is because at the moment if the ray intersects 2 or more surfaces, my network does not know which color to show
so it seems weird
OK, got it
@fall Rainy I believe it's just a matter of choosing the right loss function and the right hyperparameters.
If we add also the position of where the ray intersects the surface, I believe we will achieve a great accuracy.
(The training set is 1 million)
for all direction?
yes
That's amazing
just hashnet/gridnet?
yes
Wow
well it is not a surprise
it was already done in the nbvh paper
Matteo Balice said:
If we add also the position of where the ray intersects the surface, I believe we will achieve a great accuracy.
I've created such a dataset before, but it didn't work well and I gave up on it
on which network?
Matteo Balice said:
it was already done in the nbvh paper
But I don't think you're using the bvh?
Matteo Balice said:
on which network?
gridnet
the bvh is useful only to decrease the number of points but the network is exactly hashnet + gridnet
As I said, I use a different approach as a sampling approach (I parallelize the points)
OK, got it. What did you do?
I just extendend my previous network with rgb
I take the first hit along the 200 points
And I take the embedding of that voxel to show the rgb color
wait a minute. you just pridict hit points?
Well something similar, the network is able to predict itself the voxels hitted or not
Because all the rays that miss the object will put 0 on all the voxels along their path
Instead, if we are talking about a ray that is a hit, there must be at least one voxel along the path that it is a hit.
How many parameters did you enter? Spherical or Cartesian coordinates?
All the points along the path are in cartesian coordinates
But they are scaled like they are in a sphere of radius 1.
I need this to make hashnet work between range -1,1
Do you want to see the code?
yes
Ok some minutes, I am going to upload it.
I'm a little confused.
https://github.com/bralani/rt_volume/tree/neural_rendering/src/rt/nvbh
The file nvbh_rgb.py
OK, thx.
I'll try it after I submit my final evaluation.
As you can see, it understands the volume by itself
epoch 5
image.png
There are still a lot of improvements to do, first I do not encode the direction of the ray
Matteo Balice said:
There are still a lot of improvements to do, first I do not encode the direction of the ray
I'm getting more and more confused, wait until I read your code
haha ok
If you want to get more message, like distance, you can find it here:
(void)rt_shootray(&a);
*dist = a.a_dist;
*hit_ = a.a_user;
@fall Rainy it’s better with the intersection point
This is trained with 10 millions rays but only in 5 epochs
It can be even better
I train in parallel the shape of the object and also the rgb. Probably it is better to separate the two networks because the second one (rgb) depends on the first (shape-> hit/miss)
(Don’t see the fact that the object stretches when I rotate, it’s a bug of the camera…)
If you see below the cube there is an error of the color, probably there aren’t any rays in the training set that go below the cube :joy:
I have added also the direction of the ray and as you can see it is essential in the prediction of the color because the same intersection point can have a different color if the direction is different (see how the light on the surface change).
Again, don’t see the fact that the shape stretches.
@Matteo Balice I just read your code. Any example for your dataset?
Wait a moment
(deleted)
https://drive.google.com/file/d/1G-HR5PSxtaXoZCaQYEKGXHrx1sTcfSwh/view?usp=drive_link
I don't have access to this link.
Just sent an access request
I have accepted
OK thx
The code you read is not updated. In that code I did not managed with the intersection point.
Can you update it now?
But in the dataset there are (for each example i.e each row):
fall Rainy ha scritto:
Can you update it now?
Yes just some minutes
Done
Got it, Thx
What do these two spherical coordinates represent?
the first and second intersection on the bounding sphere
It looks like you've trained two networks, one for determining if it's a hit or not, and the other for predicting rgb values
yes
I see how you're controlling the direction, the two intersections actually determine the direction of light
yes but it is controlled also by the order of the points sampled close to the hit surface
which they are sampled thanks to the two intersections as you said
How do you get this loss function:
def loss_fn(output_label, labels, dist, all_outputs):
mask = (labels > 0.5).squeeze()
# Applichiamo la soglia morbida
indices_first = find_index_with_exponential_decay(all_outputs.view(-1, int(n_points))).view(-1, 1) / (int(n_points))
# L1 Loss tra gli indici soft e dist
loss1 = nn.L1Loss()(indices_first[mask], dist[mask])
loss0 = nn.BCELoss()(output_label, labels)
return loss0 + loss1
loss0 is simply the loss for hit/miss
loss1 instead is the loss for the intersection point on the surface
These are the steps:
1) I sample n_points on the segment between intersection1 and intersection2
2) I calculate the probability of hit for each point
3) I need to get the first hit point on these n_points but I need to get it in a differentiable way to calculate the loss function, so the function find_index_with_exponential_decay cover this issue
4) Thanks to the hit point prediction I can calculate the distance to the true hit point.
I organized your code, after all, it's not good to put all the code in one file:neural_rendering
fall Rainy ha scritto:
I organized your code, after all, it's not good to put all the code in one file:neural_rendering
This morning I did another improvement: I have merged the two networks into one (now it is faster to train) and I have added the true normalized direction as input of the MLP neural network: training with the full dataset there is an average error of 6 for each RGB channel.
We are very close :mechanical_arm:
video-2024-08-25-16-46-36_R9pM18PD.mp4
true.png
pred.png
@fall Rainy this is the current difference if you train the NN (the prediction is computed directly in python)
It seems the prediction is brighter for some reason
@Sean Are there any post production processes in brlcad before saving the render?
Because I cannot explain why the light is brighter :distrust:
(the prediction seems more realistic than the true one :joy: )
Matteo Balice said:
true.png
pred.png
fall Rainy this is the current difference if you train the NN (the prediction is computed directly in python)
This is my results(rendered by brl-cad)
1fba315da2f2ce6c381664d32c767a8.png
I think it may be that the rendering parameters are different
fall Rainy ha scritto:
Matteo Balice said:
true.png
pred.png
fall Rainy this is the current difference if you train the NN (the prediction is computed directly in python)This is my results(rendered by brl-cad)
1fba315da2f2ce6c381664d32c767a8.png
I think it may be that the rendering parameters are different
Is this the prediction?
No, the true result
Have you tried training the NN?
Yes, I got the same result with you
I'm trying to add some of my previous improvements on hashencoder to your network.
Ok you are more expert in rgb than me, I will wait for you
The L1 Loss says that there is an average error of 6 along each RGB channel. But I think that this value is not uniform for each pixel, because if you see on the shadows it is nearly perfect. So I believe that this error is an average between the bright areas (more than 6) and shadows one (almost perfect).
Yes, it is easier to train the shadow parts
true
I also merge the two networks into one:https://github.com/Rainy-fall-end/neural_rendering/tree/merge_network
Matteo Balice said:
fall Rainy ha scritto:
I organized your code, after all, it's not good to put all the code in one file:neural_rendering
This morning I did another improvement: I have merged the two networks into one (now it is faster to train) and I have added the true normalized direction as input of the MLP neural network: training with the full dataset there is an average error of 6 for each RGB channel.
Good
I have noticed also that the distance loss(loss1) is essential to predict the rgb color
Maybe the error on the color is due to the small error on the loss1?
I don’t think so
I think some weight could be added in front of the loss?
def loss_fn3(output_label,labels,output_dists,dists,output_rgb,rgb):
mask = (labels > 0.5).squeeze()
# Applichiamo la soglia morbida
indices_first = find_index_with_exponential_decay(output_dists.view(-1, int(n_points))).view(-1, 1) / (int(n_points))
# L1 Loss tra gli indici soft e dist
loss0 = nn.BCELoss()(output_label, labels)
loss1 = nn.L1Loss()(indices_first[mask], dists[mask])
loss2 = loss_fn2(output_rgb,rgb)
return 0.2*loss0 + 0.2*loss1 + 0.6*loss2
Well I don't think it will change the convergence because the encoders of loss2 and the others two losses are separate
In other words, the gradients of rgb is separate from the the others two losses
I believe there is still something missing
Maybe we need to make more complex the encoder of rgb
@fall Rainy If you see my last commit of yesterday I have changed the input of the rgb encoder. There is no need to give in input 5 points but 1 it's ok if we add directly the direction of the ray.
https://github.com/bralani/rt_volume/blob/neural_rendering/src/rt/nvbh/nvbh_rgb.py
pred2.png
@fall Rainy the problem was the training set
The sampling was not uniform along each direction
https://drive.google.com/file/d/1yiyRmts0hboItRuw9VnC1OP1lriIxIOh/view?usp=drive_link
Try with this training set
:partying_face:
Amazing, got it
Now it's all about finding the right combination with quality and inference time/memory. I will focus on the first encoder
So you can focus on the rgb encoder
I think moss is a bit too simple to do training, may be we should test some of the more complex models
Sure, this is the uniform sampling I did:
// Funzione per calcolare i parametri azimuth ed elevation
std::vector<std::pair<double, double>> uniform_sphere_sampling(int N) {
std::vector<std::pair<double, double>> angles;
double phi = (1.0 + std::sqrt(5.0)) / 2.0; // Costante aurea
for (int i = 0; i < N; ++i) {
// Calcolo dell'angolo di elevazione (phi)
double elevation = std::acos(1 - 2.0 * (i + 0.5) / N);
// Calcolo dell'angolo di azimut (theta)
double azimuth = std::fmod(2.0 * M_PI * i / phi, 2.0 * M_PI);
// Convert the azimuth to degrees
azimuth = azimuth * 180 / M_PI;
// Convert the elevation to degrees
elevation = elevation * 180 / M_PI;
// Aggiunge la coppia (azimuth, elevation) alla lista
angles.push_back(std::make_pair(azimuth, elevation));
}
return angles;
}
void generate_renders_test_set(int num_renders)
{
// make a file to write the test set
FILE* file = fopen("./test_neural2.txt", "w");
fclose(file);
// test set
set_generate_test_set(1);
set_type(render_type::neural);
auto para = uniform_sphere_sampling(num_renders);
for (int i = 0; i < num_renders; i++)
{
printf("Rendering %d\n", i);
do_ae(para[i].first, para[i].second);
//outputfile = (char*)"./output.png";
rt_neu::render();
}
set_generate_test_set(0);
}
If you want to generate the training set
Matteo Balice said:
Sean Are there any post production processes in brlcad before saving the render?
There are not any happening on an image / frame basis, no. Your ambient level appears to approximately match. What's not matching is the intensity from the sole light source itself, like it's being applied with a different scaling factor.
Sean ha scritto:
Matteo Balice said:
Sean Are there any post production processes in brlcad before saving the render?
There are not any happening on an image / frame basis, no. Your ambient level appears to approximately match. What's not matching is the intensity from the sole light source itself, like it's being applied with a different scaling factor.
Yep we have solved it. The error was on the training set due to the lack of samples for each direction :)
I caught up and saw that :)
Now it is almost perfect
What happens with the 'havoc' object in 'havoc.g' sample?
wildly different model
Is this more complex?
I will generate the training set later :)
It's much more complex, by about 3 orders of magnitude compared with moss.
Still considered a small model, but it has hard features that will be interesting to observe
pygame-window-2024-08-26-19-15-10.mp4
It's strange why there are those noisy dots below the helicopter.
Maybe the training set is small (it contains only 100 frames in 256x256 for a total of 5/6 millions)
During the sampling part (renderings) I have these warnings (and others similar):
Root solver reported 3 intersections != {0, 2, 4} on s.nos5a.i
shooting point (units mm): (16599.899526, 396.927869, 1211.109774)
shooting direction: (-0.947094, -0.274243, -0.166745)
377.103 366.627 86.6928
OVERLAP1: /havoc_front/cannopy/cannopy_wipers/2_wiper/2_wipe1_rubber/2_r.wipe1
OVERLAP2: /havoc_front/cannopy/cannopy_glass/r.glass1
OVERLAPa: dist=(1.91371, 3.2123) isol=2_s.wipe2 osol=2_s.wipe2
OVERLAPb: depth 1.29859mm at (15095.9, -244.662, 1677.06) x105 y148 lvl0
Could it be for this reason that the model isn't able to predict well?
Does a_dist contain always the first intersection even if there are more than 1 intersections of the ray on the surface?
VJOIN1(hit_point, ap->a_ray.r_pt, ap->a_dist, ap->a_ray.r_dir);
@Sean
This is the shape training only with the loss of hit/miss:
image.png
And this is the shape training also with the loss of the distance:
image.png
So the problem must be with the distance... I think that during the sampling part (in brl-cad), if there are more than 1 intersections, a_dist contains the wrong distance and I need only and always the first intersection. Could it be possible?
Matteo Balice said:
Does a_dist contain always the first intersection even if there are more than 1 intersections of the ray on the surface?
VJOIN1(hit_point, ap->a_ray.r_pt, ap->a_dist, ap->a_ray.r_dir);
Sean
No, a_dist isn't even set by librt -- that's an application-specific field so apps (e.g., rt) get to use it however they want. So code you're using/calling has set or used a_dist and that's what you'll have to refer to in order to determine what is there.
That shape training doesn't look right -- that looks like just the nose geometry. I mean it's great that it's recognizable, but missing a lot. Did you use the 'havoc' object?
Ah, and I see from your output reporting that you're using "havoc_front" .. you want "havoc" instead.
No i
Sean said:
That shape training doesn't look right -- that looks like just the nose geometry. I mean it's great that it's recognizable, but missing a lot. Did you use the 'havoc' object?
No I used only the havoc_front
Ah ok I will train with havoc
Matteo Balice said:
Could it be for this reason that the model isn't able to predict well?
No, those warnings are innocuous. They are known issues with the model that just affect a few pixels and do not affect the hit distance values or silhouetting at all.
It’s strange
Matteo Balice said:
Ah ok I will train with havoc
Yeah, I think it'll be more interesting to use the whole vehicle, especially the long rotors and complexity in certain parts.
Rotors are long and thin, which should approach a worst case for sampling.
Matteo Balice said:
It’s strange
One is the root solver saying it got 3 hits, which can happen on a couple primitives when rays just graze a surface. We make it print to eventually see if we can special-case the condition but it doesn't affect ray tracing.
The other is a report of a geometry overlap which is a modeling error, which also doesn't affect the hit results you're using.
I will see whether there are the same errors on the full havoc model
We are certain that the hit informations are right due to the first image here
Matteo Balice said:
This is the shape training only with the loss of hit/miss:
image.pngAnd this is the shape training also with the loss of the distance:
image.pngSo the problem must be with the distance... I think that during the sampling part (in brl-cad), if there are more than 1 intersections, a_dist contains the wrong distance and I need only and always the first intersection. Could it be possible?
It shouldn’t be too difficult to understand the distance in those areas below the helicopter. I have seen the true model and it is pretty smooth
pygame-window-2024-08-28-11-30-51.mp4
This is trained with 100 frames 256x256
We need more frames with even higher resolution...
pygame-window-2024-08-29-13-11-50.mp4
I have trained without the loss function of the distance and it has understood by itself the true model. :joy:
In this case it seems that the loss of the distance gets worse the model
The resolution is always 256x256 but trained with 500 frames this time.
@Sean @fall Rainy
pygame-window-2024-08-29-21-45-53.mp4
I changed the loss function and now it is very good :)
Today I did a big improvement in the number of sample points. Depending on the model, you need a fixed tolerance between each sampling point to ensure good accuracy: for example in moss a good tolerance is 0.01 (remember that we should sample in range 0-1 where 0 means that we are sampling on the origin and 1 means we are sampling on the destination on the bounding sphere).
The trivial solution is using the uniform sampling for each ray (it was the solution I used before today). With uniform sampling you need a number of points that is equal to range/tolerance. This is the reason why in moss I need 1/0.01 = 100 points.
But we can do better than this because I tried to plot the distribution of all the rays and it is like this:
distribution-rays-moss.png
It means that in moss it follows more or less a gaussian distribution with a mean 0.7
In the havoc model it is slightly different and it is a gaussian but with mean 0.5:
distribution-rays-helicopter.png
So a first improvement would be to follow these distribution to sample more samples on the mean and less points far from the mean
But we can do even better: we can divide our bounding sphere into cells (choosing an arbitrary resolution) and precompute the distribution for each pair of cell_origin and cell_destination. This way, the distribution no longer follows a Gaussian pattern but instead aligns with the local distribution in those areas. This represents a significant improvement because, not only do we capture the local distribution, but the range will also be reduced for almost all pairs. This is evident, as similar rays will have similar hit distances, as in this case:
distribution-rays-helicopter-local.png
Or like this
distribution-rays-helicopter-local-2.png
Moreover not all the rays have the same length because in a bounding sphere (of ray 1) we have the maximum length for the rays that have exactly length equal to 2 * radius but all the other rays will have a smaller length so we can even sample less points
In these days I will try to merge all these ideas together and I will update you. If you have any other idea let me know :)
intersection-prediction.pdf
This paper for example use a similar approach to my idea but they use an hashmap based on quantization of input (rays origin + rays destination) instead of cells
Matteo Balice said:
pygame-window-2024-08-29-21-45-53.mp4
I changed the loss function and now it is very good :)
That is very good indeed! Very nice shape and color registration all around. That's considerably better than I would have expected for that model.
So now is that pygame visualizaiton doing real-time inference / lookups from the NNet or do you export the latent space to some fixed representation like mesh or voxels or lightfields, etc.? If they're real-time lookups, what's the lookup rate? Is there a significant diff between havoc and moss?
Either way, this is looking very much like you're approaching publication-worthy if you want to take this work to the next level... we could also probably turn it into a feature, depending on how long training takes, how long it takes to load the net, how long inference takes, etc.
Sean ha scritto:
So now is that pygame visualizaiton doing real-time inference / lookups from the NNet or do you export the latent space to some fixed representation like mesh or voxels or lightfields, etc.? If they're real-time lookups, what's the lookup rate? Is there a significant diff between havoc and moss?
They are real time lookups, you can see the code here https://github.com/bralani/rt_volume/blob/neural_rendering/src/rt/nvbh/camera.py. The inference time for each call of the forward method (ie this instruction output_label, _, _, output_rgb = model(input)) takes an average time of 1-2 ms for the havoc model (remember that in pygame the images are 256x256) but due to the limitations of python I have these metrics:
Time to generate rays: 3.155 ms -> time to calculate the origin ray + direction ray for each pixel of the 256x256 camera
Time to trace rays: 22 ms -> bottleneck due to the limitations of python (this includes the time to transfer data from cpu to gpu + the formula of intersection ray with sphere + the conversion to cartesian to spherical coordinates + the inference time + the time to transfer data from gpu to cpu. Since the inference time is only 1-2 ms, you can understand that all the time is taken by previous calculations + transfer time).
Time to render image: 0.99 ms -> time that pygame takes to show the input image
So summing all together it takes something like 26 ms to render one frame (and it can be much better in C++) so I have an average of 36 FPS (in python).
Instead, with moss model I can use a simpler network (with less resolution) and less sampling points (I have not implemented the smart sampling yet, I am still using the uniform one at the moment) and the times are these ones:
Time to generate rays: 3,15ms (the same as before)
Time to trace rays: 15,6 ms (with an inference time that is still of 1-2 ms)
Time to render image: 0.99 ms
In this case I have a total time of 19,75 ms with an average 50 FPS.
There was an error in the trace rays method, I did all the calculations on the cpu and I moved to the gpu just before the inference call. Now I move the rays after their generation and the times are these:
MOSS:
Time to generate rays: 3,15ms (the same as before)
Time to trace rays: 10,19 ms (with an inference time that is still of 1-2 ms)
Time to render image: 0.99 ms
In this case I have a total time of 14,34 ms with an average 70 FPS.
HAVOC:
Time to generate rays: 3,15ms (the same as before)
Time to trace rays: 14,13 ms (with an inference time that is still of 1-2 ms)
Time to render image: 0.99 ms
In this case I have a total time of 18,28 ms with an average 54 FPS.
I know these times are high but I believe they are due to the python overhead
They aren't so high like I was thinking, I have printed rendering times in brl-cad with my mac m1:
MOSS
65536 pixels (256x256) in 0.05 sec=50ms
HAVOC
65536 pixels (256x256) in 0.23 sec=230ms
There is an improvement even using python instead of c++.
Probably this is because in brl-cad all the rays are traced sequentially or am I wrong?
Today I implemented the first of the proposed optimizations to reduce the number of sampling points: now I follow the global rays distribution.
Here you can see an uniform sampling with 50 points.
uniform-distribution.mp4
And here you can see a sampling following the global rays distribution (always 50 points):
global-distribution.mp4
The difference is evident: following the global distribution we are losing something on the boundaries (because there are less rays in those areas) but in all the other parts the resolution is way better.
Following this sampling alone give us some advantages (but also drawbacks like you can see) but merging together with local distribution I expect to improve even more :)
@Sean Regarding the paper, which conferences/journal do you suggest? I don't have experience in this yet
Is it possible to consider CVPR2025?:https://cvpr.thecvf.com/Conferences/2025
@fall Rainy & @Matteo Balice CVPR is possible and desirable, but the bar is very high. They are the premier conference right now for CV-related work including NNet reconstruction.
A paper there will need to very clearly demonstrate what the significant contributions are of the work, distinct and compared with other recent papers. Both efforts have demonstrated contributions, but will take some work to show how well or different they're performing on a test/benchmark common with some other paper (at least one).
The novel significant contribution will have to be identified first, which is going to be something like some % quality metric, or training convergence metric, or lookup performance metric, or some specific feature that hasn't been demonstrated (like sharp corners or some other measurable improvement).
Matteo Balice said:
They aren't so high like I was thinking, I have printed rendering times in brl-cad with my mac m1:
MOSS
65536 pixels (256x256) in 0.05 sec=50ms
HAVOC
65536 pixels (256x256) in 0.23 sec=230msThere is an improvement even using python instead of c++.
So @Matteo Balice this is outstanding performance that opens up a potential state-of-the-art contribution. I'm not sure that's faster than the last paper that did 3 points -- you'd need to test the same models to really demonstrate that -- but it is a "new" capability if you can demonstrate something that ray traces slowly being looked up much faster. That is, demonstrate this as a viable preprocess acceleration structure for models that are expensive to raytrace.
Towards exploring that, I'd suggest downloading a 3DM (NURBS) model, running 3dm-g and raytracing it. It'll be slow as that's an unoptimized representation, but then you can show how long training takes, and how long inference/generation takes at the same resolution. Suggest 1024x1024 since that's pretty standard and results in convenient 1M primary rays.
Here's the stanford bunny in nurbs format that someone made and posted online where you can see how costly it is:
bunny_nurbs.g
For me, it takes about 35 sec to prep the first time (uncached) and 45sec to render at 1024 on my laptop, around 22k rays/sec.
Screenshot-2024-09-03-at-4.34.20-PM.png
pygame-window-2024-09-04-11-04-27.mp4
Here here is the 256x256 frames resolution trained with 100 different 512x512 frames (in pygame I have implemented only 256x256 views with a small range of the camera).
Regarding the 1024x1024, these are the times (in python without optimizations):
Time to generate rays: 49,7 ms
Time to trace rays: 12 ms
Time to render image: 3,9 ms
For a total time of 65,6 ms and an average 15,24 FPS. If you compare 65,6 ms (NN) with 66 seconds (true ray tracing on my mac M1), the NN is about 1006 times faster.
The true bottleneck is not the training time or the loading of the net but the generation of the training set since generating 100 frames in 1024x1024 would require 100 x 66 seconds (on my pc) = almost 2 hours... For this reason I have trained only with 100 frames 512x512 (it has taken something like 30 minutes to generate the training set) and about 10 minutes to train it but I have not implemented the optimizations to reduce the number of sampling points yet (local sampling), so I expect that the training time (and inference) can be even smaller.
I wanted to compare also the "statuette" model as in the nbvh paper and in the neural intersection paper, so I downloaded the stl model, converted to .g and ray-traced it. According to the nbvh paper it should take 4,9ms to ray-trace it (1920x1080):
Screenshot-2024-09-04-alle-16.30.49.png
But in brl-cad it takes several several minutes in 1024x1024...
Using only a 256x256 size it takes about 60 seconds:
outputstatuette.png
Why is there this behaviour?
The model .g weights 600mb, while the stl is 500mb
P.S:
Seeing the times to ray trace all those models (according to the nbvh paper) it seems there is a bottleneck in my python code in the generation of the rays since in 1024x1024 it takes me 49 ms and it cannot be so bad.
Matteo Balice ha scritto:
P.S:
Seeing the times to ray trace all those models (according to the nbvh paper) it seems there is a bottleneck in my python code in the generation of the rays since in 1024x1024 it takes me 49 ms and it cannot be so bad.
I was right, there was a bottleneck in the generation of rays, now it is less than 1ms to generate 1M rays so the bunny model is raytraced in about 16-17 ms.
@Matteo Balice can you show a 1024x1024 rendering of Bunny? Looking to directly compare both performance (time0 and deviation (pixdiff) for a given view.
bunny_pred.png
bunny_true.png
The time is an average of 17ms but I'm trying to figure out why the light does not look like the true one
The shape is correctly encoded, it seems I have the same problem as in the moss model (the training set was small)
thank you @Matteo Balice that's definitely looking production-viable even with the light source issue. How long did training take? How many samples, how much time, etc?
I'm wondering if we could try for actual integration in the time remaining, or if the shift to writing up a paper is more in order. For integration, it would entail modifying the 'rt' application to have a training mode where some command-line flag would tell it to train/use the neural net in place of the do_pixel() function (since you have color). Something like rt -N bunny.weights bunny.g bunny
If bunny.weights doesn't exist, it trains. If it does, it loads and uses it.
Sean ha scritto:
thank you Matteo Balice that's definitely looking production-viable even with the light source issue. How long did training take? How many samples, how much time, etc?
The training set consisted of 100 images with a resolution of 512x512 (approximately 26 million in total). The model was trained for approximately 5 minutes. The F1 loss for shape prediction was around 0.995. Additionally, the average error in predicting the hit position along the ray was about 0.006 -> shape and hit position almost perfect.
Regarding the RGB prediction, however, I have an average error of 24 (in "rgb" scale).
Each ray samples 200 points (each point has 3 float32 numbers: x y z) along its path, meaning each ray occupies 200 x (3 x 4 bytes (float32)) = 2,3 KB on the GPU. For training, I use a batch size of 2^13 rays per iteration, resulting in 2^13 x 2,3KB = 19.6 MB in memory.
During the prediction process, the larger the batch size, the better the GPU parallelism can be utilized. Therefore, I load all 1024x1024 rays simultaneously, which requires 1024 x 1024 x 2,3 KB = 2,5 GB (my GPU allows it but it's ok to execute more iterations of the set of rays, it will be just a little bit slower).
I believe it is possible to reduce the number of sampling points but I am still working on it, I have not achieved good results for example sampling locally.
I tried also with 500 frames in 256x256 and in this case the light's direction is encoded well but not the intensity.
bunny_1024.png
Sean ha scritto:
I'm wondering if we could try for actual integration in the time remaining, or if the shift to writing up a paper is more in order. For integration, it would entail modifying the 'rt' application to have a training mode where some command-line flag would tell it to train/use the neural net in place of the do_pixel() function (since you have color). Something like rt -N bunny.weights bunny.g bunny
Regarding the integration there are some points to understand first in order to achieve good results in inference time like I had in python:
I believe our approach is more suited to GPUs compared to NBVH, as it involves fewer conditional branches. This is because we completely avoid using BVHs; the only intersection we calculate is with the bounding sphere. After that, the inference process remains consistent for every ray, exploiting the true power of GPUs. This is the reason why even using more sampling point we have good results in inference time.
I discovered one drawback: when I render the scene in 1024x1024 in pygame it's good for the first seconds (16-17 ms for each frame) but after that the memory of GPU goes to 100% and the inference time slow down a lot. Probably this is because they do not use well the garbage collector of tensors accumulated for previous frames (I have a RTX 4070 with 13GB of VRAM).
I have written a report for my work, let me know if there are something missing or that I should change.
@Sean @fall Rainy
That's outstanding @Matteo Balice
I'm still reading it, but I did a quick read through and really great write-up! Love the background and inclusion of your early incremental progress, and of course final results.
I attempted also to integrate the PyTorch code into the rt module, but encountered an issue with the PyTorch library I used for multi-hash resolution. The problem is that the library includes CUDA-specific scripts, preventing me from exporting the neural network...
@Matteo Balice and @fall Rainy I'd like to showcase your work at the mentor summit. Do you have a brief video or favorite image(s) that you'd like me to show that you feel best captures the results of your research?
Sean ha scritto:
Matteo Balice and fall Rainy I'd like to showcase your work at the mentor summit. Do you have a brief video or favorite image(s) that you'd like me to show that you feel best captures the results of your research?
bunny_pred.png
bunny_true.png
havoc_pred.png
havoc_true.png
moss_pred.png
moss_true.png
Can you make a diagram of your network layers? While a technical audience, they’re not necessarily familiar with graphics or ai at all.
Matteo Balice said:
pygame-window-2024-09-04-11-04-27.mp4
Here here is the 256x256 frames resolution trained with 100 different 512x512 frames (in pygame I have implemented only 256x256 views with a small range of the camera).Regarding the 1024x1024, these are the times (in python without optimizations):
Time to generate rays: 49,7 ms
Time to trace rays: 12 ms
Time to render image: 3,9 ms
For a total time of 65,6 ms and an average 15,24 FPS. If you compare 65,6 ms (NN) with 66 seconds (true ray tracing on my mac M1), the NN is about 1006 times faster.The true bottleneck is not the training time or the loading of the net but the generation of the training set since generating 100 frames in 1024x1024 would require 100 x 66 seconds (on my pc) = almost 2 hours... For this reason I have trained only with 100 frames 512x512 (it has taken something like 30 minutes to generate the training set) and about 10 minutes to train it but I have not implemented the optimizations to reduce the number of sampling points yet (local sampling), so I expect that the training time (and inference) can be even smaller.
Wonder if we could show bunny spinning orbitally. This video is really good results but the view stays focused on the bunny’s butt.. even a simple 360 orbital would probably do well.
Sean ha scritto:
Matteo Balice said:
pygame-window-2024-09-04-11-04-27.mp4
Here here is the 256x256 frames resolution trained with 100 different 512x512 frames (in pygame I have implemented only 256x256 views with a small range of the camera).Regarding the 1024x1024, these are the times (in python without optimizations):
Time to generate rays: 49,7 ms
Time to trace rays: 12 ms
Time to render image: 3,9 ms
For a total time of 65,6 ms and an average 15,24 FPS. If you compare 65,6 ms (NN) with 66 seconds (true ray tracing on my mac M1), the NN is about 1006 times faster.The true bottleneck is not the training time or the loading of the net but the generation of the training set since generating 100 frames in 1024x1024 would require 100 x 66 seconds (on my pc) = almost 2 hours... For this reason I have trained only with 100 frames 512x512 (it has taken something like 30 minutes to generate the training set) and about 10 minutes to train it but I have not implemented the optimizations to reduce the number of sampling points yet (local sampling), so I expect that the training time (and inference) can be even smaller.
Wonder if we could show bunny spinning orbitally. This video is really good results but the view stays focused on the bunny’s butt.. even a simple 360 orbital would probably do well.
Sure no problem.
Interesting paper https://half-potato.gitlab.io/posts/ever/
@Matteo Balice any update on the orbital video? Presentation is tomorrow morning :)
Sean ha scritto:
Matteo Balice any update on the orbital video? Presentation is tomorrow morning :)
I am currently generating it
https://drive.google.com/file/d/1ook3N4hRjUqK-7SKuEH-4qV48Sa6kxWO/view?usp=sharing
I don't know why it becomes all white from behind
@Sean How was the presentation?
Last updated: Nov 16 2024 at 00:47 UTC