Hardware Optimisation & Benchmarking Shenanigans
AnsweredHi
As we all know taking as many good pictures as possible is the best way to ensure things go well, and ultimately will save lots of time back in the office.
However time is allways against us. As fast as Capturing Reality is, waiting is inevitably part of the game. And it can be a long time, days+ to only find out your photo shoot was not good enough.
hardware requirements are listed here https://support.capturingreality.com/hc/en-us/articles/115001524071-OS-and-hardware-requirements however it is very vague. From experience, with rendering & video encoding etc, a badly configured system which looks great on paper can perform 1/2 as fast as a similarly priced system with carefully selected components and optimised appropriately. Throwing more money at the problem is not always the answer. And at times can slow things down.
Various stages of the calculations, stress different parts of the system. However to what amount I am struggling to figure out. how can I/we optimise a system that will perform the best with the software.
I have recently got rid of my dual Xeon v3 28 core workstation, which for rendering was awesome, however in reality capture it was painfully slow. A much higher clocked, new architecture consumer skylake system is not hugely different in reality capture yes a little slower, yet 4x+ slower for rendering (cinebench). cost 5x+ less and has 4 vs 28 cores.
Below are the areas which I know can make a difference. Unfortunately as with many things we cant have our cake and eat it. Cost has a big influence and also technological system restrictions mean you can have 32gb of very fast ram, or 128gb+ of slow ram. You can have a 5.2ghz 6core cpu or a 3.6ghz 16core cpu.
- Cpu speed Mhz (IPC) - More is better always.
- Core count (threads) more is better - to a extent, and not at the cost of IPC. From my experience a dual cpu system worked awesomely with some applications, however the system architecture did not agree with other applications such as capturing reality and underperformed. I have a feeling that in the same manner as GPU's you get increasingly mitigating returns with reality capture when increasing core count, even if the cpus are maxed 100%.
-CPU - Instructions support (AVX etc) does reality capture take advantage ? or will it soon and to what extent ? I see you are looking for a avx coder. Has avx been enabled when the software is compiled or tested ?
I personally am wishing to build a new system, AMD offer good value CPU's at last with Threadripper and EPYC, however do not support AVX well at all. It would be a disaster to invest in the wrong architecture. I am aware that amd hardware does not perform ideally with other software due to lack of AVX. Is this/will this be true with capturing reality ?
-GPU count - 3 is max and as with most things, you get diminishing returns.
-GPU speed/cuda performance - 1080ti/titan/quadro etc are the goto cards with ti's being best bang for buck. The new Tesla V100's are compute monsters with a cost to match. Soon* we should have the consumer volta titans and gaming cards available.
-GPU memory, is 12gb enough ? Reality capture complains I do not have enough video memory frequently - maybe this is a bug, as my monitoring software says only around 1gb of memory is used.
-RAM amount - Reality capture is fantastic that in theory it doesn't require massive amounts like competitors, however it does have it's limits. - what impact does maxing the ram, and requiring swap file usage on performance have ?
I have encountered out of memory in reality capture many times, is throwing more ram at the system the best solution?
-RAM Speed, 2666mhz or 4400mhz ?
-RAM Latency - ties into the above, some apps love faster speed or tighter timings. ? from my experience, optimising cache and memory performance for cpu/ram can double the speed of certain applications. has this been tested ? - there sure is a lot of data being passed about.
-HDD for cache/virtual memory. latency vs speed. I expect this is less important, however every bit will count to a extent. I assume when ram limitations are hit this becomes more valuable.
From all the above it's easy to choose the best, but you can't you'l have to sacrifice one area to get the max performance in another.
So the solution
Benchmark datasets - I searched the forum and found others have mentioned the availability of a benchmark and even stated they will be creating one, however this was a year+ ago and nothing came of it.
Unless a integrated benchmarking tool is to appear in the software very soon (would be best) I propose to do the following.
Have 2 different datasets available to run to reflect varying workloads. (I can make some, - we could utilise data provided capturing reality, or maybe someone can suggest something suitable)
a) light dataset - will be fast
b) Heavy dataset - will take longer, however may give more accurate results.
Users will then shift start the application, and hit start. Theoretically everyone should be on the same level.
Users will be required to upload the contents of the logs created to either the forum thread, or ideally a google form I create.
The easy part - RealityCapture.log this is basically a duplicate of the console window and logs the timestamps for the various stages that complete. it should be located here: c:\Users\USER\AppData\Local\Temp\
It pumps out the following as a example.
RealityCapture 1.0.2.3008 Demo RC (c) Capturing Reality s.r.o.
Using 8 CPU cores
Added 83 images
Feature detection completed in 11 seconds
Finalizing 1 component
Reconstruction completed in 31.237 seconds
Processing part 1 / 5. Estimated 1225441 vertices
Processing part 3 / 5. Estimated 38117 vertices
Processing part 4 / 5. Estimated 926526 vertices
Processing part 5 / 5. Estimated 538277 vertices
Reconstruction in Normal Detail completed in 232.061 seconds
Coloring completed in 30.105 seconds
Coloring completed in 0.116 seconds
Coloring completed in 30.363 seconds
Creating Virtual Reality completed in 294.092 seconds
The trickier part- system analysis. There is a nice little freeware tool called hardwareinfo, that does not require installation. and can spit out a nice little text report as below, It contains no sensitive info. These two logs combined I believe will contain all the required information needed for us to compile a nice comparative dataset. When I say we I mean me, I'll have to parse the data onto a google spreadsheet which will do the calculations and we can all see the results.
CPU: Intel Core i7-6700K (Skylake-S, R0)
4000 MHz (40.00x100.0) @ 4498 MHz (45.00x100.0)
Motherboard: ASUS MAXIMUS VIII HERO
Chipset: Intel Z170 (Skylake PCH-H)
Memory: 32768 MBytes @ 1599 MHz, 16-18-18-36
Graphics: NVIDIA GeForce GTX 1080 Ti, 11264 MB GDDR5X SDRAM
Drive: Samsung SSD 850 EVO 500GB, 488.4 GB, Serial ATA 6Gb/s @ 6Gb/s
Sound: Intel Skylake PCH-H - High Definition Audio Controller
Sound: NVIDIA GP102 - High Definition Audio Controller
Network: Intel Ethernet Connection I219-V
OS: Microsoft Windows 10 Professional (x64) Build 15063.674 (RS2)
I'll need your help :)
A) input from my wall of text above.
B) Suggestions on the proposed benchmark & setup.
C) To run the benchmark and post the results.
If you've read through all that and think - "yeah, I'd spend 15 min running the test files and report back". Please say
If you've read part of it and fell asleep thinking , "aint nobody got time for that" - Please say :D.
What we get out of out all this?
Eventually when/if enough people with varying hardware post the results. We can determine what areas to spend our precious money on to improve the areas of capturing reality that we are bottlenecked in. Which components and configurations - help with say reconstruction or texturing the most, and what hardware is just ineffective.
What say you ? Do you think this is a worthwhile task and should I proceed ?
-
Hi benjy
I'm glad it worked this time, thank you for posting the results.
Some applications create profiles on a per user basis, and some don't. It's fine either way. What that does help rule out that your user profile is a issue.
Another possibility is that the uninstaller hasn't/doesn't fully uninstall everything. And leaves certain values behind. This again is often not a problem, however at times it can be if your software plays up.
The issue here if trying to fix is playing with the registry can cause more harm than good.
It maybe worth using cc cleaner to uninstall, and run it's registry cleaner tool (may need ot be ran a few successive times). - It is a well respected tool, be sure to get it form the official pirform site. If in doubt don't. Sometimes it can cause more harm than good. -
I'd like to add my appreciation for this project. If I could significantly understand the detail of the expert discussion, then I would be right in there, keen to participate and learn. As it is I can only watch and be amazed at the amount of effort needed to prepare to collect the resultant data, let alone to analyze and make sense of it.
And hope and pray that the eventual distillation of guidance will be made public. Because, fast as RC is, speed is going to be make-or-break for an independent operator to get established offering initially a modest local service, via an efficient reliable workflow, without access to any high-finance render-farm. Having the optimum standalone machine will be one of the keys.
-
You are welcome.
Don't be fooled into thinking we are experts or have the remotest clue what we are talking about or doing.
You are more than welcome to add your 2cents.Fear not, the intention of the benchmark is as you hope for.
There has been a lot of talk from me, and not much evidence of my web-based results page. I have struggled immensely with that part. So many solutions which claimed to be able to offer the ability to upload and then show data were failures and did not deliver. I think I have it cracked.. Mostly.
This is now as a good a time as any to share where I am. There are currently 2 parts.
1) the upload, and 2) the public results.
Getting the publicly viewable results shown in a clear and presentable manner that can be analysed and interrogated was a important part.
Here is where I am with that. The data is drawn live from the google spreadsheet from which results are uploaded too.
And updates accordingly. The pie charts etc are not final and I will change the metrics displayed/used. It's just a test to get it working. And will have more useful data shown for your viewing pleasure..
Note. the contents are fabricated by me changing the rawresults.txt files that have been uploaded each time, and don't represent real results yet.https://datastudio.google.com/reporting/1LVbEcggzC87TWXaKTDczwks2pRLM51b8
The uploading part is currently not as pretty. (which will change.) is here
https://script.google.com/a/ivanpascoe.com/macros/s/AKfycbwQi8gvGNy83YEhrNZykm_uLJwgUbGOdrSnauWJC1FNrLE8OpJL/execI'd very much appreciate anyone to try uploading some data. Using the rawresults.txt that is generated by the benchmark. please use that file rather than the results.txt as it will add garbage to spreadsheet, I have not yet added code to reject the incorrect file. Yes the results will be kind of useless as we are all using different datasets for now, however for the moment I need help with checking that the uploaded process is working correctly, and the results are displayed properly.
known issues.
1) Results are show instantly on the upload page, however can take a min+ to appear on the pretty public results. And you will need to manually refresh the page for your uploaded data to appear. This is a limitation of the platform. It caches data on the server to save on resources. Poor google and their lack of resources...2) Works on chrome, I do not know about other browsers.
3) The Rawresults.txt must be selected for the upload or terrible things may happen (not that terrible, but will make a mess on the spreadsheet with garbage data)4) The chart is full of made up data . For now the fact that results can be generated, uploaded, displayed and analysed is the important part.
5) You can download the results for your own analytical pleasures, there is a hidden button next to the word " Total."
As always. Feedback is really appreciated.
-
Hi Folks,
Here are the result from the benchmark I just ran with 1159 images at 17mpx. I have uploaded the rawresults.txt to the spreadsheet.
Username=RC TEST - THANOS SERVER
Comment=1159 images
Version=1.0.3.3939 Demo RC
Alignment=245
Depth (GPU)=15.272
Model=1.615
Simplify=0.055
Texturing=5.528
Total Time=266
CPU=Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz * 2
GPU=GeForce GTX TITAN X * 2
Cache Drive=INTEL SSDSC2BA400G3 ATA Device
RAM (Bits)=206158430208
Ram Speed=2133SYSTEM INFO:
CPU1: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
CPU2: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Number Of Processor Packages (Physical): 2
Number Of Processor Cores: 36
Number Of Logical Processors: 72
Motherboard: Supermicro X10DRG-Q
Chipset: Intel C612 (Wellsburg-G)
Memory: 256 GBytes @ 933 MHz, 13-13-13-31, DDR4-2132 / PC4-17000
GPU1: NVIDIA GeForce GTX TITAN X 12288 MBytes of GDDR5 SDRAM [Hynix]
GPU2: NVIDIA GeForce GTX TITAN X 12288 MBytes of GDDR5 SDRAM [Hynix]
Drive: INTEL SSDSC2BA400G3 400GB, 381,554 MBytes, Serial ATA 6Gb/s @ 6Gb/s
Network: Intel Ethernet Server Adapter I340-T4
OS: Microsoft Windows 7 Professional (x64) Build 7601There is some confusion with the RAM specs here as hardware info shows what is physically installed - 256gb, although the OS can only use 192gb. It's currently clocked at 933Mhz not the 1066Mhz suggested by the benchmark test.
-
Hello George,
That's one mighty PC you've built there. It's beyond me to extrapolate from your benchmark results to my world, i.e. working with a different image count @ 42 MP and very different system resources, so I'm interested more in your general response to some questions geared at evaluating things like dual socket CPU and performance of the Titans.
Ivan (thanks again, Ivan, for putting this conversation and utility on its feet, see how we benefit?) suggested early on that he'd take fewer faster cores than more slower. I see you went with dual 2.7 GHz CPUs, 18 cores each. Maybe, you had other considerations in your choice, e.g. running two CPUs cooler, needs of other apps, but did you run any comparisons with single CPU and/or with faster CPU to inform that perspective?
It's been my impression that a Tesla verges into the case of diminishing returns relative to a 1080 Ti, the cost/value being hard to justify. Time is money, so I assume your projects yet justify your pulling out a sledge hammer - correction - two sledge hammers. What kind of memory pressure are you seeing in Resource Monitor on the first GPU, the second?
How about RAM? At a fraction of your whopping 256 GB I rarely see memory pressure in RC, really only after reconstruction and texturing, where GPU memory doesn't support Sweet display quality. I suppose that's at least one area your Teslas rock.
I'm still relatively new to Windows PCs. I came up with Macs mid-80s, do miss the construction quality, and in the case of the Mac Pro "trashcan" I've yet to see a slicker approach to thermal regulation. I'm currently running a i7-7820X in an ITX case, the smallest form factor supporting the 980 Ti short of a laptop (didn't want the heat). This ITX is now gasping its last breath, thanks to a Delta agent who performed an atomic body slam when dropping my well-padded storm case onto the belt from chest height. (I later found the GPU loose inside the case, rear metal mounting plate peeled back with mounting screws ripped clean out! So many loose connections, had to rebuild from scratch 3 times to finally get video up and operating system to load.) I'm now getting BSODs, believe a weakness on the motherboard has now broadened into a tiny bridging gap hiding somewhere. So, I'm now going with a Boxx laptop I can take carry on, use in field to validate a day's capture and for demos, but for desktop I'm pulling out my tower ready for parts. I'd like to repurpose my i7-7820X and possibly bring in a second one. Thanks for your thoughts.
Benjy
-
Hello Ivan,
Anybody home? I'd like to revisit this topic, am looking at spec'ing a new PC, am curious what you think about any possible conclusions one could draw from the albeit limited participation you worked so hard to make possible. What you did with this benchmark utility is really important work, now gathering dust. Ugh.
It was my impression that number of cores wasn't as valued by RC as clock speed, right? What was the actual evidence for that? It appears the i7 processors base frequency topping off at 4.0 GHz (i7-8086K) would then win over the fastest of the i9 class topping off at 3.6 GHz (i9-9900K), though overclock speed is the same at 5.0 GHz. To then account for the difference between their respective core-thread count, 6-core 12-threads for the i7-8086K v. 8-core 18-threads for the i9-9900K.
I then wonder about a cpu with slower base frequency than either of those, the i9-9960XE at 3.1 GHz, but sports 16-core 32-threads. Multiplying the base frequency by the thread count for these three processors you get:
- i7-8086K — 4 x 12 = 48
- i9-9900K — 3.6 x 18 = 64.8
- i9-9960XE — 3.1 x 32 = 99.2
Those totals reflect the proportion between Passmark numbers:
- i7-8086K — 16,684
- i9-9900K — 20,168
- i9-9960XE — 30,641
So, I'm curious about the reality of how these three cpus stack up in RC. Is there a sweet spot, as suggested before, and would this be the middle guy, the i9-990K? Do we see diminishing returns with the i9-9960XE? Note, there's one model with yet more cores-threads, the i9-9980XE, slightly lower base frequency at 3.0 GHz, and even with its 18 cores/36 threads Passmark shows it coming in beneath the i9-9960XE, which seems to underscore the diminishing returns principle. Even so, the 150% jump in performance between the i9-9900K and i9-9960XE would seem to outweigh whatever is lost apples for apples with this diminishing returns thing. No?
Many thanks for your noggin.
Benjy
-
Hi All & Benjy
I have been hiding. I am working on some new fancy tools to help. I have not forgotten you all. My license expired too which didn't help :D
To answer your question is tough.From my testing (which isn't concluded). There is no easy answer.
The issue is as follows.
The initial benchmark, gives a rough idea of performance to be had. Of course more cores & more mhz the faster things go.However it isn't so simple.
The software behaves differently depending on the dataset given.
As you know the computational stages are split into:
a) alignment - a predominantly fast stage
b) point cloud creation - this is where most of the calculations are done
c) texturing
d) mesh exportHowever there are multiple stages within the point cloud creation, such as depth calculation (gpu accelerated). Some of these interim stages are single threaded and some are multi threaded. So some stages win out with core count, and other stages win out with pure ghz.
Using different settings within the application, image resolution, number of images, quality of images (not just meaning sharp etc, but enough so the software has a easy time), all can throw the weighting one way or the other.Roughly speaking when dealing with small datasets low res, mhz is king.
If dealing with many images - 200+40mp images for example, core count wins.
There are diminishing returns with multi core systems. Even more so with dual cpu systems. My old world record breaking dual 14core xeon system is slower than my new one at almost half the cores. There are so many variables in system architecture that make a difference.
I have the i9 980xe all cores are @ 4.5ghz (some motherboards allow all core turbo as default). However it was silly expensive, hot and power hungry and is wasted most of the time I love it. Enthusiast things have drawbacks.
AMD maybe worth considering too. They will be releasing the 7nm parts this summer and are taking the lead over Intel for the price performance ratio when high core count is key. I'd be cautious of the current gen, but the new ones on the horizon look very interesting.The GPU processing part does not tax the GPU to the fullest, and you'd be much better getting a regular gaming card. Perhaps a 1080/ti or better. There maybe circumstances where multi gpu can help, however It's such as small amount of the overall calculation time, It would depend on the project your working on. I don't believe RC takes into consideration your video ram, and the sweet setting is hard coded, more memory is wasted (I could be wrong here).
When waiting a day+ for a test to come out 5% here and 10% there start to make a significant difference.
Tweeking bios settings, ram timings, motherboard choice, windows setup, all make a difference too.
SSD's are essential in this modern day, however past a certain point make little difference to performance. No need to get some fancy nvme expensive thing. However be sure it can sustain long term writes, many drives fall to about 30MBps after the cache is full.
So to conclude the best system will be different for different people, depending on the data sets they commonly use.What type of datasets are you throwing or planning at throwing at RC ?
Ivan
-
Hello Ivan,
All very interesting. You deserve a medal for all this work and sharing it.
I'd like to respond with more questions but let's begin at the end of your post by answering your question, what am I throwing at RC. Most my work begins with at least 400 images, so maybe right there you'll say my case benefits from multi-threaded cores. My projects easily run past 2500, and I've had one up at 10,000. These images are 42 MP and I'm growing increasingly interested in running them at High. For the same amount of manual input I get four times the amount of data than at Normal (downscale is 2:1, so twice vertically and horizontally), availing myself to move a virtual camera as close to surfaces as I want, in the case of virtual cinematography, or as the user wants, as in interactive applications. We enter a virtual environment, look around, are swept up in the believability. Curiosity urges us to explore, for humans that means we want to approach things and get within arms reach, proprioception urging us to want to touch surfaces. Even if we can't, mirror neurons in our brain taking in stereoscopic information of finely detailed texture information allow us to feel those textures without actually touching them simply based on memory of how many times past we've touched similar surfaces. Okay, a long way to say I want all the data my system can handle, especially if it's doing the heavy lifting, nothing more from me.
It's hard to imagine my case wouldn't benefit from as much of everything any component can deliver, but not to say I don't hear you on the no easy answers. There is a point, rather points, where values come into contention. For instance, offline rendering for film-grade animations favors a very different GPU than does online rendering, such as required by game engines, with big things happening in that space and something I'm more focussed on. To your statement:
"The GPU processing part does not tax the GPU to the fullest, and you'd be much better getting a regular gaming card. Perhaps a 1080/ti or better. There may be circumstances where multi gpu can help, however It's such as small amount of the overall calculation time, It would depend on the project your working on. I don't believe RC takes into consideration your video ram..."
I had suggested wanting the RTX 2080 Ti, not sure what you mean with "regular gaming card", I know the 2080 Ti isn't regular compared to everything downstream, but before its recent release, you could say the same thing for the GTX 1080 Ti. Both have the same amount of video RAM, 11 GB, but the memory speed differs, GDDR5X at 11 Gbps for the 1080 Ti, 14 Gbps for the 2080 Ti. My only thought about the value of video RAM capacity would be how large of a clipping box your system can handle after texturing a model. Add to that, Passmark shows 17,100 on the 2080 Ti versus 14,241 on the 1080 Ti, about a 20% boost. I'd think the GPU-intensive depth map calculations would clearly warrant the RTX 2080 Ti, no matter what you're working on in RC. No? That might not cost justify, another matter, but I'm thinking simply in terms of performance. I hear what you're saying about even small performance improvements acting as a multiplier to advantage larger projects spanning days or even weeks.
I'm yet interested to learn from you what you mean with "no easy answers", never mind my specific case, but just generally. We know that during any particular stage there are sub-stages that swing between CPU and GPU. We keep Resource Monitor and GPU-Z open to see all that. To the extent camera orientation seems purely CPU-intensive, am I missing something to think that this could be an easy one? If you have many images with lots of pixels you'll benefit from fast read/write SSDs, lots of system RAM, and many fast cores, there possibly being a sweet spot where core speed shouldn't be allowed to go but so low relative to core count, as would appear to be the case with Xeon processors. Might this yet apply even to the case I pointed to with the i9-9960X being a bit cheaper but outperforming the i9-9980XE, where the former has slightly faster, albeit fewer, cores?
You talk about quality of images, e.g. sharp, and I would add to that the roughness of the subject matter, the two combining to pose different problems to solve in finding a solution that keeps projection errors in check when returning a component. I've observed this while feeding my system manmade subject matter, simple planar shapes, versus extremely folded, like a plant. I've observed the sharp/occluded subject matter factor takes longer processing times on both CPU and GPU, which stands to reason. My brain has an easier time forming a map of a box on a floor than a pencil jade plant. If that's true, why in your mind would this not be simple to keep separate in thinking about optimizing a system for RC? I'm thinking of it this way. If say every RC user ran two of the same datasets with your benchmark utility, one featuring a low number of images, containing not optimally sharp pixels of simple subject matter, the other having the converse, and assuming we all had a nice spread of system specs ranging from low end to high along all those lines you describe, I'd expect to see the folks running the low end dataset on the low end systems scoring lower than those running the same dataset on high end systems, but that spread being X. I'd then expect that spread to be far greater in the case of folks running the high end dataset on low end systems compared to the high end systems. Yes?
I realize with so many variables in the imagery and the system specs that those final numbers in RC's reporting will be all over the map, but I'm eager to preclude a forest for the trees situation. If as a starting point we can know that the above statement holds true (up to that finer point about diminishing returns for core count), where would you place the monkey wrench in that understanding? So far I'm hearing, high end datasets benefit from many cores, but not so many that you lose speed. High end data sets benefit from strongest Cuda cores in GPU, video RAM is nice but not essential, and more than one GPU quickly goes into diminishing returns. High end datasets love fast SSDs and lots of system RAM, speed of RAM doesn't make much of a difference.
I so value whatever monkey wrenches you bring, but I also lack understanding about what to look for in a motherboard, power supply, thermal regulation, the rest of the system. At this stage I'm only able to ensure it's the right socket type for CPU, am oblivious to much of it. Big thanks for all you do!
Benjy
Please sign in to leave a comment.
Comments
101 comments