How does Linux perf calculate the cache-references and cache-misses eventsWhy doesn't perf report cache misses?Cache misses on macOSHardware cache events and perfHow does perf use the offcore events?How can I use linux perf and interpret its output to understand CPU cache misses?How to control web page caching, across all browsers?How to force browser to reload cached CSS/JS files?How to catch the L3-cache hits and misses by perf tool in LinuxLinux perf command for cache referencesintel xeon hardware cache events not supportedperf reports misses larger than total accessesSky high iTLB-load-missesperf.data to text or csvTwice as many page faults when reading from a large malloced array instead of just storing?Why modifying an instruction cause huge i-cache and i-TLB misses on x86?
What the heck is gets(stdin) on site coderbyte?
Can I run 125khz RF circuit on a breadboard?
Can I say "fingers" when referring to toes?
Ways of geometrical multiplication
If the only attacker is removed from combat, is a creature still counted as having attacked this turn?
Deciphering cause of death?
How to get directions in deep space?
Has the laser at Magurele, Romania reached a tenth of the Sun's power?
Anime with legendary swords made from talismans and a man who could change them with a shattered body
I'm just a whisper. Who am I?
Check if object is null and return null
Why does a 97 / 92 key piano exist by Bösendorfer?
How to make a list of partial sums using forEach
Is there a reason to prefer HFS+ over APFS for disk images in High Sierra and/or Mojave?
Do people actually use the word "kaputt" in conversation?
Is there anyway, I can have two passwords for my wi-fi
Grepping string, but include all non-blank lines following each grep match
Is there a RAID 0 Equivalent for RAM?
Why can't the Brexit deadlock in the UK parliament be solved with a plurality vote?
What does "tick" mean in this sentence?
How to leave product feedback on macOS?
Cumulative Sum using Java 8 stream API
How can I, as DM, avoid the Conga Line of Death occurring when implementing some form of flanking rule?
What is the smallest number n> 5 so that 5 ^ n ends with "3125"?
How does Linux perf calculate the cache-references and cache-misses events
Why doesn't perf report cache misses?Cache misses on macOSHardware cache events and perfHow does perf use the offcore events?How can I use linux perf and interpret its output to understand CPU cache misses?How to control web page caching, across all browsers?How to force browser to reload cached CSS/JS files?How to catch the L3-cache hits and misses by perf tool in LinuxLinux perf command for cache referencesintel xeon hardware cache events not supportedperf reports misses larger than total accessesSky high iTLB-load-missesperf.data to text or csvTwice as many page faults when reading from a large malloced array instead of just storing?Why modifying an instruction cause huge i-cache and i-TLB misses on x86?
I am confused by the perf events cache-misses and L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses. As when I tried to perf stat all of them, the answer doesn't seem consistent:
%$: sudo perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches ./my_app
523,288,816 cache-references (22.89%)
205,331,370 cache-misses # 39.239 % of all cache refs (31.53%)
10,163,373,365 cycles (39.62%)
13,739,845,761 instructions # 1.35 insn per cycle (47.43%)
2,520,022,243 branches (54.90%)
20,341 faults
147 migrations
237,794,728 L1-dcache-load-misses # 6.80% of all L1-dcache hits (62.43%)
3,495,080,007 L1-dcache-loads (69.95%)
2,039,344,725 L1-dcache-stores (69.95%)
531,452,853 L1-icache-load-misses (70.11%)
77,062,627 LLC-loads (70.47%)
27,462,249 LLC-load-misses # 35.64% of all LL-cache hits (69.09%)
15,039,473 LLC-stores (15.15%)
3,829,429 LLC-store-misses (15.30%)
The L1-* and LLC-* events are easy to understand, as I can tell they are read from the hardware counters in CPU.
But how does perf calculate cache-misses event? From my understanding, if the cache-misses counts the number of memory accesses that cannot be served by the CPU cache, then shouldn't it be equal to LLC-loads-misses + LLC-store-misses? Clearly in my case, the cache-misses is much higher than the Last-Level-Cache-Misses number.
The same confusion goes to cache-reference. It is much lower than L1-dcache-loads and much higher then LLC-loads+LLC-stores
My Linux kernel and CPU info:
%$: uname -r
4.10.0-22-generic
%$: lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
Stepping: 9
CPU MHz: 885.754
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 7584.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
caching linux-kernel cpu perf
add a comment |
I am confused by the perf events cache-misses and L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses. As when I tried to perf stat all of them, the answer doesn't seem consistent:
%$: sudo perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches ./my_app
523,288,816 cache-references (22.89%)
205,331,370 cache-misses # 39.239 % of all cache refs (31.53%)
10,163,373,365 cycles (39.62%)
13,739,845,761 instructions # 1.35 insn per cycle (47.43%)
2,520,022,243 branches (54.90%)
20,341 faults
147 migrations
237,794,728 L1-dcache-load-misses # 6.80% of all L1-dcache hits (62.43%)
3,495,080,007 L1-dcache-loads (69.95%)
2,039,344,725 L1-dcache-stores (69.95%)
531,452,853 L1-icache-load-misses (70.11%)
77,062,627 LLC-loads (70.47%)
27,462,249 LLC-load-misses # 35.64% of all LL-cache hits (69.09%)
15,039,473 LLC-stores (15.15%)
3,829,429 LLC-store-misses (15.30%)
The L1-* and LLC-* events are easy to understand, as I can tell they are read from the hardware counters in CPU.
But how does perf calculate cache-misses event? From my understanding, if the cache-misses counts the number of memory accesses that cannot be served by the CPU cache, then shouldn't it be equal to LLC-loads-misses + LLC-store-misses? Clearly in my case, the cache-misses is much higher than the Last-Level-Cache-Misses number.
The same confusion goes to cache-reference. It is much lower than L1-dcache-loads and much higher then LLC-loads+LLC-stores
My Linux kernel and CPU info:
%$: uname -r
4.10.0-22-generic
%$: lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
Stepping: 9
CPU MHz: 885.754
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 7584.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
caching linux-kernel cpu perf
Stack Overflow is for programming questions, not questions about using or configuring Unix and its utilities.. Unix & Linux or Super User would be better places for questions like this.
– Barmar
Mar 7 at 2:53
3
@Barmar The question is not about configuring anything.perfis a tool for measuring performance-related metrics and the question is about what do some of these metrics mean. The Linux tag may not be very relevant to the question, but still perf is a Linux tool, so it's at least marginally relevant.
– Hadi Brais
Mar 7 at 4:04
@HadiBrais I said "using or configuring Unix and its utilities", and he's "using its utilities" (it's a canned comment, I don't tailor it to each question). Actually, the question seems to be more about the design of Linux. But it's not about programming (he didn't post any code).
– Barmar
Mar 7 at 15:34
@Barmar thanks for providing the links. But I don't think StackOverflow should be limited to just "programming questions". My question here is about CPU architecture and related tools. It is about how programmers collect performance usage, and Linux is just happened to be the most popular platform. I believe any good programmers, especially those who program in C/C++, should be aware of features provided by CPU, especially CPU cache, in order to produce programs with good performance. It is definitely worth posting if any of the related tools is confusing.
– LouisYe
Mar 7 at 19:00
BTW, I made this post cuz I don't find the answer from another related StackOverflow post
– LouisYe
Mar 7 at 19:00
add a comment |
I am confused by the perf events cache-misses and L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses. As when I tried to perf stat all of them, the answer doesn't seem consistent:
%$: sudo perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches ./my_app
523,288,816 cache-references (22.89%)
205,331,370 cache-misses # 39.239 % of all cache refs (31.53%)
10,163,373,365 cycles (39.62%)
13,739,845,761 instructions # 1.35 insn per cycle (47.43%)
2,520,022,243 branches (54.90%)
20,341 faults
147 migrations
237,794,728 L1-dcache-load-misses # 6.80% of all L1-dcache hits (62.43%)
3,495,080,007 L1-dcache-loads (69.95%)
2,039,344,725 L1-dcache-stores (69.95%)
531,452,853 L1-icache-load-misses (70.11%)
77,062,627 LLC-loads (70.47%)
27,462,249 LLC-load-misses # 35.64% of all LL-cache hits (69.09%)
15,039,473 LLC-stores (15.15%)
3,829,429 LLC-store-misses (15.30%)
The L1-* and LLC-* events are easy to understand, as I can tell they are read from the hardware counters in CPU.
But how does perf calculate cache-misses event? From my understanding, if the cache-misses counts the number of memory accesses that cannot be served by the CPU cache, then shouldn't it be equal to LLC-loads-misses + LLC-store-misses? Clearly in my case, the cache-misses is much higher than the Last-Level-Cache-Misses number.
The same confusion goes to cache-reference. It is much lower than L1-dcache-loads and much higher then LLC-loads+LLC-stores
My Linux kernel and CPU info:
%$: uname -r
4.10.0-22-generic
%$: lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
Stepping: 9
CPU MHz: 885.754
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 7584.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
caching linux-kernel cpu perf
I am confused by the perf events cache-misses and L1-icache-load-misses,L1-dcache-load-misses,LLC-load-misses. As when I tried to perf stat all of them, the answer doesn't seem consistent:
%$: sudo perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches ./my_app
523,288,816 cache-references (22.89%)
205,331,370 cache-misses # 39.239 % of all cache refs (31.53%)
10,163,373,365 cycles (39.62%)
13,739,845,761 instructions # 1.35 insn per cycle (47.43%)
2,520,022,243 branches (54.90%)
20,341 faults
147 migrations
237,794,728 L1-dcache-load-misses # 6.80% of all L1-dcache hits (62.43%)
3,495,080,007 L1-dcache-loads (69.95%)
2,039,344,725 L1-dcache-stores (69.95%)
531,452,853 L1-icache-load-misses (70.11%)
77,062,627 LLC-loads (70.47%)
27,462,249 LLC-load-misses # 35.64% of all LL-cache hits (69.09%)
15,039,473 LLC-stores (15.15%)
3,829,429 LLC-store-misses (15.30%)
The L1-* and LLC-* events are easy to understand, as I can tell they are read from the hardware counters in CPU.
But how does perf calculate cache-misses event? From my understanding, if the cache-misses counts the number of memory accesses that cannot be served by the CPU cache, then shouldn't it be equal to LLC-loads-misses + LLC-store-misses? Clearly in my case, the cache-misses is much higher than the Last-Level-Cache-Misses number.
The same confusion goes to cache-reference. It is much lower than L1-dcache-loads and much higher then LLC-loads+LLC-stores
My Linux kernel and CPU info:
%$: uname -r
4.10.0-22-generic
%$: lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
Stepping: 9
CPU MHz: 885.754
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 7584.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
caching linux-kernel cpu perf
caching linux-kernel cpu perf
edited Mar 7 at 22:47
LouisYe
asked Mar 7 at 2:51
LouisYeLouisYe
133
133
Stack Overflow is for programming questions, not questions about using or configuring Unix and its utilities.. Unix & Linux or Super User would be better places for questions like this.
– Barmar
Mar 7 at 2:53
3
@Barmar The question is not about configuring anything.perfis a tool for measuring performance-related metrics and the question is about what do some of these metrics mean. The Linux tag may not be very relevant to the question, but still perf is a Linux tool, so it's at least marginally relevant.
– Hadi Brais
Mar 7 at 4:04
@HadiBrais I said "using or configuring Unix and its utilities", and he's "using its utilities" (it's a canned comment, I don't tailor it to each question). Actually, the question seems to be more about the design of Linux. But it's not about programming (he didn't post any code).
– Barmar
Mar 7 at 15:34
@Barmar thanks for providing the links. But I don't think StackOverflow should be limited to just "programming questions". My question here is about CPU architecture and related tools. It is about how programmers collect performance usage, and Linux is just happened to be the most popular platform. I believe any good programmers, especially those who program in C/C++, should be aware of features provided by CPU, especially CPU cache, in order to produce programs with good performance. It is definitely worth posting if any of the related tools is confusing.
– LouisYe
Mar 7 at 19:00
BTW, I made this post cuz I don't find the answer from another related StackOverflow post
– LouisYe
Mar 7 at 19:00
add a comment |
Stack Overflow is for programming questions, not questions about using or configuring Unix and its utilities.. Unix & Linux or Super User would be better places for questions like this.
– Barmar
Mar 7 at 2:53
3
@Barmar The question is not about configuring anything.perfis a tool for measuring performance-related metrics and the question is about what do some of these metrics mean. The Linux tag may not be very relevant to the question, but still perf is a Linux tool, so it's at least marginally relevant.
– Hadi Brais
Mar 7 at 4:04
@HadiBrais I said "using or configuring Unix and its utilities", and he's "using its utilities" (it's a canned comment, I don't tailor it to each question). Actually, the question seems to be more about the design of Linux. But it's not about programming (he didn't post any code).
– Barmar
Mar 7 at 15:34
@Barmar thanks for providing the links. But I don't think StackOverflow should be limited to just "programming questions". My question here is about CPU architecture and related tools. It is about how programmers collect performance usage, and Linux is just happened to be the most popular platform. I believe any good programmers, especially those who program in C/C++, should be aware of features provided by CPU, especially CPU cache, in order to produce programs with good performance. It is definitely worth posting if any of the related tools is confusing.
– LouisYe
Mar 7 at 19:00
BTW, I made this post cuz I don't find the answer from another related StackOverflow post
– LouisYe
Mar 7 at 19:00
Stack Overflow is for programming questions, not questions about using or configuring Unix and its utilities.. Unix & Linux or Super User would be better places for questions like this.
– Barmar
Mar 7 at 2:53
Stack Overflow is for programming questions, not questions about using or configuring Unix and its utilities.. Unix & Linux or Super User would be better places for questions like this.
– Barmar
Mar 7 at 2:53
3
3
@Barmar The question is not about configuring anything.
perf is a tool for measuring performance-related metrics and the question is about what do some of these metrics mean. The Linux tag may not be very relevant to the question, but still perf is a Linux tool, so it's at least marginally relevant.– Hadi Brais
Mar 7 at 4:04
@Barmar The question is not about configuring anything.
perf is a tool for measuring performance-related metrics and the question is about what do some of these metrics mean. The Linux tag may not be very relevant to the question, but still perf is a Linux tool, so it's at least marginally relevant.– Hadi Brais
Mar 7 at 4:04
@HadiBrais I said "using or configuring Unix and its utilities", and he's "using its utilities" (it's a canned comment, I don't tailor it to each question). Actually, the question seems to be more about the design of Linux. But it's not about programming (he didn't post any code).
– Barmar
Mar 7 at 15:34
@HadiBrais I said "using or configuring Unix and its utilities", and he's "using its utilities" (it's a canned comment, I don't tailor it to each question). Actually, the question seems to be more about the design of Linux. But it's not about programming (he didn't post any code).
– Barmar
Mar 7 at 15:34
@Barmar thanks for providing the links. But I don't think StackOverflow should be limited to just "programming questions". My question here is about CPU architecture and related tools. It is about how programmers collect performance usage, and Linux is just happened to be the most popular platform. I believe any good programmers, especially those who program in C/C++, should be aware of features provided by CPU, especially CPU cache, in order to produce programs with good performance. It is definitely worth posting if any of the related tools is confusing.
– LouisYe
Mar 7 at 19:00
@Barmar thanks for providing the links. But I don't think StackOverflow should be limited to just "programming questions". My question here is about CPU architecture and related tools. It is about how programmers collect performance usage, and Linux is just happened to be the most popular platform. I believe any good programmers, especially those who program in C/C++, should be aware of features provided by CPU, especially CPU cache, in order to produce programs with good performance. It is definitely worth posting if any of the related tools is confusing.
– LouisYe
Mar 7 at 19:00
BTW, I made this post cuz I don't find the answer from another related StackOverflow post
– LouisYe
Mar 7 at 19:00
BTW, I made this post cuz I don't find the answer from another related StackOverflow post
– LouisYe
Mar 7 at 19:00
add a comment |
1 Answer
1
active
oldest
votes
The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses and LLC-store-misses count only demand requests but they also count both cacheable and uncacheable requests. On the other hand, cache-misses counts both demand and speculative requests but only the cacheable ones. So it's not necessary that one is larger than the other.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it's a trap. They are not easy to understand.
thank you for the answer, now I understand whycache-referencesis higher thanllc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset thatcache-referencedoesn't count any L1 cache access, am I right?
– LouisYe
Mar 8 at 0:00
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted bycache-references. Otherwise, if it hits in the L1, then, no, it will not be counted bycache-references.
– Hadi Brais
Mar 8 at 0:05
1
Note that ther's alsolongest_lat_cache.missandlongest_lat_cache.reference- which, at least on my system, count exactly the same ascache-missesandcache-referencesandoffcore_response.demand_data_rd.any_responsecorresponding toLLC-loads.
– Zulan
Mar 8 at 11:01
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55035313%2fhow-does-linux-perf-calculate-the-cache-references-and-cache-misses-events%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses and LLC-store-misses count only demand requests but they also count both cacheable and uncacheable requests. On the other hand, cache-misses counts both demand and speculative requests but only the cacheable ones. So it's not necessary that one is larger than the other.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it's a trap. They are not easy to understand.
thank you for the answer, now I understand whycache-referencesis higher thanllc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset thatcache-referencedoesn't count any L1 cache access, am I right?
– LouisYe
Mar 8 at 0:00
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted bycache-references. Otherwise, if it hits in the L1, then, no, it will not be counted bycache-references.
– Hadi Brais
Mar 8 at 0:05
1
Note that ther's alsolongest_lat_cache.missandlongest_lat_cache.reference- which, at least on my system, count exactly the same ascache-missesandcache-referencesandoffcore_response.demand_data_rd.any_responsecorresponding toLLC-loads.
– Zulan
Mar 8 at 11:01
add a comment |
The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses and LLC-store-misses count only demand requests but they also count both cacheable and uncacheable requests. On the other hand, cache-misses counts both demand and speculative requests but only the cacheable ones. So it's not necessary that one is larger than the other.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it's a trap. They are not easy to understand.
thank you for the answer, now I understand whycache-referencesis higher thanllc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset thatcache-referencedoesn't count any L1 cache access, am I right?
– LouisYe
Mar 8 at 0:00
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted bycache-references. Otherwise, if it hits in the L1, then, no, it will not be counted bycache-references.
– Hadi Brais
Mar 8 at 0:05
1
Note that ther's alsolongest_lat_cache.missandlongest_lat_cache.reference- which, at least on my system, count exactly the same ascache-missesandcache-referencesandoffcore_response.demand_data_rd.any_responsecorresponding toLLC-loads.
– Zulan
Mar 8 at 11:01
add a comment |
The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses and LLC-store-misses count only demand requests but they also count both cacheable and uncacheable requests. On the other hand, cache-misses counts both demand and speculative requests but only the cacheable ones. So it's not necessary that one is larger than the other.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it's a trap. They are not easy to understand.
The built-in perf events that you are interested in are mapping to the following hardware performance monitoring events on your processor:
523,288,816 cache-references (architectural event: LLC Reference)
205,331,370 cache-misses (architectural event: LLC Misses)
237,794,728 L1-dcache-load-misses L1D.REPLACEMENT
3,495,080,007 L1-dcache-loads MEM_INST_RETIRED.ALL_LOADS
2,039,344,725 L1-dcache-stores MEM_INST_RETIRED.ALL_STORES
531,452,853 L1-icache-load-misses ICACHE_64B.IFTAG_MISS
77,062,627 LLC-loads OFFCORE_RESPONSE (MSR bits 0, 16, 30-37)
27,462,249 LLC-load-misses OFFCORE_RESPONSE (MSR bits 0, 17, 26-29, 30-37)
15,039,473 LLC-stores OFFCORE_RESPONSE (MSR bits 1, 16, 30-37)
3,829,429 LLC-store-misses OFFCORE_RESPONSE (MSR bits 1, 17, 26-29, 30-37)
All of these events are documented in the Intel manual Volume 3. For more information on how to map perf events to native events, see: Hardware cache events and perf and How does perf use the offcore events?.
But how does perf calculate cache-misses event? From my understanding,
if the cache-misses counts the number of memory accesses that cannot
be served by the CPU cache, then shouldn't it be equal to
LLC-loads-misses + LLC-store-misses? Clearly in my case, the
cache-misses is much higher than the Last-Level-Cache-Misses number.
LLC-load-misses and LLC-store-misses count only demand requests but they also count both cacheable and uncacheable requests. On the other hand, cache-misses counts both demand and speculative requests but only the cacheable ones. So it's not necessary that one is larger than the other.
The same confusion goes to cache-reference. It is much lower than
L1-dcache-loads and much higher then LLC-loads+LLC-stores
It's only guaranteed that cache-reference is larger than cache-misses because the former counts requests irrespective of whether they miss the L3. It's normal for L1-dcache-loads to be larger than cache-reference because core-originated loads usually occur only when you have load instructions and because of the cache locality exhibited by many programs. But it's not necessarily always the case because of hardware prefetches.
The L1-* and LLC-* events are easy to understand, as I can tell they
are read from the hardware counters in CPU.
No, it's a trap. They are not easy to understand.
answered Mar 7 at 19:40
Hadi BraisHadi Brais
11k22244
11k22244
thank you for the answer, now I understand whycache-referencesis higher thanllc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset thatcache-referencedoesn't count any L1 cache access, am I right?
– LouisYe
Mar 8 at 0:00
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted bycache-references. Otherwise, if it hits in the L1, then, no, it will not be counted bycache-references.
– Hadi Brais
Mar 8 at 0:05
1
Note that ther's alsolongest_lat_cache.missandlongest_lat_cache.reference- which, at least on my system, count exactly the same ascache-missesandcache-referencesandoffcore_response.demand_data_rd.any_responsecorresponding toLLC-loads.
– Zulan
Mar 8 at 11:01
add a comment |
thank you for the answer, now I understand whycache-referencesis higher thanllc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset thatcache-referencedoesn't count any L1 cache access, am I right?
– LouisYe
Mar 8 at 0:00
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted bycache-references. Otherwise, if it hits in the L1, then, no, it will not be counted bycache-references.
– Hadi Brais
Mar 8 at 0:05
1
Note that ther's alsolongest_lat_cache.missandlongest_lat_cache.reference- which, at least on my system, count exactly the same ascache-missesandcache-referencesandoffcore_response.demand_data_rd.any_responsecorresponding toLLC-loads.
– Zulan
Mar 8 at 11:01
thank you for the answer, now I understand why
cache-references is higher than llc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset that cache-reference doesn't count any L1 cache access, am I right?– LouisYe
Mar 8 at 0:00
thank you for the answer, now I understand why
cache-references is higher than llc-loads+llc-stores, as the former counts both demand and speculative requests. It looks like you suggeset that cache-reference doesn't count any L1 cache access, am I right?– LouisYe
Mar 8 at 0:00
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted by
cache-references. Otherwise, if it hits in the L1, then, no, it will not be counted by cache-references.– Hadi Brais
Mar 8 at 0:05
@LouisYe If a cacheable memory access missed in the L1 and the L2, then it will be counted by
cache-references. Otherwise, if it hits in the L1, then, no, it will not be counted by cache-references.– Hadi Brais
Mar 8 at 0:05
1
1
Note that ther's also
longest_lat_cache.miss and longest_lat_cache.reference - which, at least on my system, count exactly the same as cache-misses and cache-references and offcore_response.demand_data_rd.any_response corresponding to LLC-loads.– Zulan
Mar 8 at 11:01
Note that ther's also
longest_lat_cache.miss and longest_lat_cache.reference - which, at least on my system, count exactly the same as cache-misses and cache-references and offcore_response.demand_data_rd.any_response corresponding to LLC-loads.– Zulan
Mar 8 at 11:01
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55035313%2fhow-does-linux-perf-calculate-the-cache-references-and-cache-misses-events%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Stack Overflow is for programming questions, not questions about using or configuring Unix and its utilities.. Unix & Linux or Super User would be better places for questions like this.
– Barmar
Mar 7 at 2:53
3
@Barmar The question is not about configuring anything.
perfis a tool for measuring performance-related metrics and the question is about what do some of these metrics mean. The Linux tag may not be very relevant to the question, but still perf is a Linux tool, so it's at least marginally relevant.– Hadi Brais
Mar 7 at 4:04
@HadiBrais I said "using or configuring Unix and its utilities", and he's "using its utilities" (it's a canned comment, I don't tailor it to each question). Actually, the question seems to be more about the design of Linux. But it's not about programming (he didn't post any code).
– Barmar
Mar 7 at 15:34
@Barmar thanks for providing the links. But I don't think StackOverflow should be limited to just "programming questions". My question here is about CPU architecture and related tools. It is about how programmers collect performance usage, and Linux is just happened to be the most popular platform. I believe any good programmers, especially those who program in C/C++, should be aware of features provided by CPU, especially CPU cache, in order to produce programs with good performance. It is definitely worth posting if any of the related tools is confusing.
– LouisYe
Mar 7 at 19:00
BTW, I made this post cuz I don't find the answer from another related StackOverflow post
– LouisYe
Mar 7 at 19:00