Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture isOpenMP False SharingOpenMP and cores/threadsPerformance loss from parallelizationVtune Amplifier XE for Multicores?fortran openmp nested loop communication overheadNUMA systems, virtual pages, and false sharingUnexpectedly good performance with openmp parallel for loopopenMP program creates multiple threads but executes on only one corewhy parallelized .at() opencv with openmp take more timeOpenMP: no speedup with Hyperthreading

Why do falling prices hurt debtors?

Why Is Death Allowed In the Matrix?

What are these boxed doors outside store fronts in New York?

Is it unprofessional to ask if a job posting on GlassDoor is real?

Can I ask the recruiters in my resume to put the reason why I am rejected?

Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?

Windows 98 hangs after entering password on fresh install

Maximum likelihood parameters deviate from posterior distributions

Why can't I see bouncing of a switch on an oscilloscope?

Dragon forelimb placement

Why, historically, did Gödel think CH was false?

"to be prejudice towards/against someone" vs "to be prejudiced against/towards someone"

How does strength of boric acid solution increase in presence of salicylic acid?

Do I have a twin with permutated remainders?

Why doesn't H₄O²⁺ exist?

Mathematical cryptic clues

How can bays and straits be determined in a procedurally generated map?

Why are 150k or 200k jobs considered good when there are 300k+ births a month?

Is a tag line useful on a cover?

I’m planning on buying a laser printer but concerned about the life cycle of toner in the machine

To string or not to string

Is it legal for company to use my work email to pretend I still work there?

What's the point of deactivating Num Lock on login screens?

Test whether all array elements are factors of a number



Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture is


OpenMP False SharingOpenMP and cores/threadsPerformance loss from parallelizationVtune Amplifier XE for Multicores?fortran openmp nested loop communication overheadNUMA systems, virtual pages, and false sharingUnexpectedly good performance with openmp parallel for loopopenMP program creates multiple threads but executes on only one corewhy parallelized .at() opencv with openmp take more timeOpenMP: no speedup with Hyperthreading






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):



Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go

#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);




I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.



I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:



enter image description here



In particular, this is the 2nd biggest time consumer:



enter image description here



where the 58% is the microarchitecture usage.



Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:




Average Physical Core Utilization



Metric Description



The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.




I'm not sure what this is trying to tell me, which leads me to my question:



Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.



I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.



I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.










share|improve this question






















  • Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd

    – tim18
    Mar 8 at 9:36












  • @tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?

    – Water
    Mar 8 at 15:31

















1















I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):



Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go

#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);




I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.



I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:



enter image description here



In particular, this is the 2nd biggest time consumer:



enter image description here



where the 58% is the microarchitecture usage.



Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:




Average Physical Core Utilization



Metric Description



The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.




I'm not sure what this is trying to tell me, which leads me to my question:



Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.



I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.



I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.










share|improve this question






















  • Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd

    – tim18
    Mar 8 at 9:36












  • @tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?

    – Water
    Mar 8 at 15:31













1












1








1








I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):



Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go

#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);




I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.



I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:



enter image description here



In particular, this is the 2nd biggest time consumer:



enter image description here



where the 58% is the microarchitecture usage.



Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:




Average Physical Core Utilization



Metric Description



The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.




I'm not sure what this is trying to tell me, which leads me to my question:



Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.



I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.



I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.










share|improve this question














I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):



Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go

#pragma omp parallel for
for (int y = 0; y < window->height; y++)
for (int x = 0; x < window->width; x++)
Ray& ray = rays.get(x, y);
accelerator.trace(ray);




I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.



I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:



enter image description here



In particular, this is the 2nd biggest time consumer:



enter image description here



where the 58% is the microarchitecture usage.



Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:




Average Physical Core Utilization



Metric Description



The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.




I'm not sure what this is trying to tell me, which leads me to my question:



Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.



I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.



I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.







multithreading performance optimization openmp vtune






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 8 at 4:18









WaterWater

1,03211434




1,03211434












  • Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd

    – tim18
    Mar 8 at 9:36












  • @tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?

    – Water
    Mar 8 at 15:31

















  • Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd

    – tim18
    Mar 8 at 9:36












  • @tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?

    – Water
    Mar 8 at 15:31
















Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd

– tim18
Mar 8 at 9:36






Your idea of theoretical speedup when running more threads than cores seems mistaken. If you don't pin threads to individual cores your speedup will be limited by the threads not being spread evenly over cores and losing cached data prematurely. Gnu libgomp on Windows doesn't support pinning, e.g omp_places depending on your application linux may pin better than winsowd

– tim18
Mar 8 at 9:36














@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?

– Water
Mar 8 at 15:31





@tim18 I am using Ubuntu, gcc 7.3.0. I don't use windows at all (at least right now, this wouldn't even compile on windows currently). I've tried changing the values with respect to pinning threads to cores, playing with affinity through environmental variables, etc... it does nothing to improve performance. Is there something I'm missing?

– Water
Mar 8 at 15:31












1 Answer
1






active

oldest

votes


















1














Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.



Why?



As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.



Long story short, I suggest you try to profile memory usage.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55056661%2flow-average-physical-core-utilization-according-to-vtune-when-using-openmp-no%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.



    Why?



    As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
    Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.



    Long story short, I suggest you try to profile memory usage.






    share|improve this answer



























      1














      Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.



      Why?



      As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
      Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.



      Long story short, I suggest you try to profile memory usage.






      share|improve this answer

























        1












        1








        1







        Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.



        Why?



        As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
        Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.



        Long story short, I suggest you try to profile memory usage.






        share|improve this answer













        Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.



        Why?



        As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
        Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.



        Long story short, I suggest you try to profile memory usage.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 25 at 15:38









        L.C.L.C.

        720618




        720618





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55056661%2flow-average-physical-core-utilization-according-to-vtune-when-using-openmp-no%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            1928 у кіно

            Захаров Федір Захарович

            Ель Греко