Does lock xchg have the same behavior as mfence? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE?Which is a better write barrier on x86: lock+addl or xchgl?Do we need mfence when using xchgAre loads and stores the only instructions that gets reordered?Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?What does “serializing operation” mean in the sfence documentation?Do locked instructions provide a barrier between weakly-ordered accesses?How many memory barriers instructions does an x86 CPU have?Generating a 64-byte read PCIe TLP from an x86 CPUDoes `xchg` encompass `mfence` assuming no non-temporal instructions?Does Intel SFENCE have release semantics?Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?How do fences atomize load-modify-store operations?Does the Intel Memory Model make SFENCE and LFENCE redundant?Java memory model - volatile and x86Does it matter if the non read and non write instructions are reordered in x86?What is the relationship between the _mm_sfence intrinsic and a SFENCE instruction?Is LFENCE serializing on AMD processors?Does `xchg` encompass `mfence` assuming no non-temporal instructions?

Married in secret, can marital status in passport be changed at a later date?

Is the Mordenkainen's Sword spell underpowered?

Will I be more secure with my own router behind my ISP's router?

How to charge percentage of transaction cost?

What is the definining line between a helicopter and a drone a person can ride in?

How to create a command for the "strange m" symbol in latex?

Why did Israel vote against lifting the American embargo on Cuba?

Converting a text document with special format to Pandas DataFrame

Why does BitLocker not use RSA?

Are there any AGPL-style licences that require source code modifications to be public?

Does GDPR cover the collection of data by websites that crawl the web and resell user data

Short story about an alien named Ushtu(?) coming from a future Earth, when ours was destroyed by a nuclear explosion

Putting Ant-Man on house arrest

A German immigrant ancestor has a "Registration Affidavit of Alien Enemy" on file. What does that mean exactly?

Protagonist's race is hidden - should I reveal it?

Does traveling In The United States require a passport or can I use my green card if not a US citizen?

Why these surprising proportionalities of integrals involving odd zeta values?

Why do C and C++ allow the expression (int) + 4*5?

Why is one lightbulb in a string illuminated?

Is there a verb for listening stealthily?

What could prevent concentrated local exploration?

When does Bran Stark remember Jamie pushing him?

Can I take recommendation from someone I met at a conference?

How to leave only the following strings?



Does lock xchg have the same behavior as mfence?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE?Which is a better write barrier on x86: lock+addl or xchgl?Do we need mfence when using xchgAre loads and stores the only instructions that gets reordered?Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?What does “serializing operation” mean in the sfence documentation?Do locked instructions provide a barrier between weakly-ordered accesses?How many memory barriers instructions does an x86 CPU have?Generating a 64-byte read PCIe TLP from an x86 CPUDoes `xchg` encompass `mfence` assuming no non-temporal instructions?Does Intel SFENCE have release semantics?Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?How do fences atomize load-modify-store operations?Does the Intel Memory Model make SFENCE and LFENCE redundant?Java memory model - volatile and x86Does it matter if the non read and non write instructions are reordered in x86?What is the relationship between the _mm_sfence intrinsic and a SFENCE instruction?Is LFENCE serializing on AMD processors?Does `xchg` encompass `mfence` assuming no non-temporal instructions?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








7















What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. Does it guarantee I get the most up to date value? Of memory read/write instructions that follow after?



The reason for my confusion is:




8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”



-Intel 64 Developers Manual Vol. 3




Does this apply across threads?



mfence states:




Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).



-Intel 64 Developers Manual Vol 3A




This sounds like a stronger guarantee. As it sounds like mfence is almost flushing the write buffer, or at least reaching out to the write buffer and other cores to ensure my future load/stores are up to date.



When bench-marked both instructions take on the order of ~100 cycles to complete. So I can't see that big of a difference either way.



Primarily I am just confused. I instructions based around lock used in mutexes, but then these contain no memory fences. Then I see lock free programming that uses memory fences, but no locks. I understand AMD64 has a very strong memory model, but stale values can persist in cache. If lock doesn't behave the same behavior as mfence then how do mutexes help you see the most recent value?










share|improve this question
























  • Possibly a duplicate of: stackoverflow.com/questions/9027590/…

    – hidefromkgb
    Nov 3 '16 at 19:15












  • xchg includes the lock logic, so lock / xchg is redundant.

    – rcgldr
    Nov 3 '16 at 19:16











  • I'm aware, except clang actually emits lock xchg for atomic swapping size_t with Sequential ordering on x86_64. I was kind of copying and pasting.

    – Valarauca
    Nov 3 '16 at 19:17











  • @hidefromkgb this states that instructions cannot be re-order but it does not answer if load/stores are serialized like what happens with mfense.

    – Valarauca
    Nov 3 '16 at 19:20






  • 1





    They're both full memory barriers. Don't have time to write a full answer, but see some of the memory-ordering links in the x86 tag wiki. MFENCE may also imply some other semantics about partially serializing the instruction stream, not just memory, at least on AMD CPUs where it's lower throughput than lock add for use as a memory barrier.

    – Peter Cordes
    Nov 4 '16 at 10:03

















7















What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. Does it guarantee I get the most up to date value? Of memory read/write instructions that follow after?



The reason for my confusion is:




8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”



-Intel 64 Developers Manual Vol. 3




Does this apply across threads?



mfence states:




Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).



-Intel 64 Developers Manual Vol 3A




This sounds like a stronger guarantee. As it sounds like mfence is almost flushing the write buffer, or at least reaching out to the write buffer and other cores to ensure my future load/stores are up to date.



When bench-marked both instructions take on the order of ~100 cycles to complete. So I can't see that big of a difference either way.



Primarily I am just confused. I instructions based around lock used in mutexes, but then these contain no memory fences. Then I see lock free programming that uses memory fences, but no locks. I understand AMD64 has a very strong memory model, but stale values can persist in cache. If lock doesn't behave the same behavior as mfence then how do mutexes help you see the most recent value?










share|improve this question
























  • Possibly a duplicate of: stackoverflow.com/questions/9027590/…

    – hidefromkgb
    Nov 3 '16 at 19:15












  • xchg includes the lock logic, so lock / xchg is redundant.

    – rcgldr
    Nov 3 '16 at 19:16











  • I'm aware, except clang actually emits lock xchg for atomic swapping size_t with Sequential ordering on x86_64. I was kind of copying and pasting.

    – Valarauca
    Nov 3 '16 at 19:17











  • @hidefromkgb this states that instructions cannot be re-order but it does not answer if load/stores are serialized like what happens with mfense.

    – Valarauca
    Nov 3 '16 at 19:20






  • 1





    They're both full memory barriers. Don't have time to write a full answer, but see some of the memory-ordering links in the x86 tag wiki. MFENCE may also imply some other semantics about partially serializing the instruction stream, not just memory, at least on AMD CPUs where it's lower throughput than lock add for use as a memory barrier.

    – Peter Cordes
    Nov 4 '16 at 10:03













7












7








7


1






What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. Does it guarantee I get the most up to date value? Of memory read/write instructions that follow after?



The reason for my confusion is:




8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”



-Intel 64 Developers Manual Vol. 3




Does this apply across threads?



mfence states:




Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).



-Intel 64 Developers Manual Vol 3A




This sounds like a stronger guarantee. As it sounds like mfence is almost flushing the write buffer, or at least reaching out to the write buffer and other cores to ensure my future load/stores are up to date.



When bench-marked both instructions take on the order of ~100 cycles to complete. So I can't see that big of a difference either way.



Primarily I am just confused. I instructions based around lock used in mutexes, but then these contain no memory fences. Then I see lock free programming that uses memory fences, but no locks. I understand AMD64 has a very strong memory model, but stale values can persist in cache. If lock doesn't behave the same behavior as mfence then how do mutexes help you see the most recent value?










share|improve this question
















What I'm wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. Does it guarantee I get the most up to date value? Of memory read/write instructions that follow after?



The reason for my confusion is:




8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.”



-Intel 64 Developers Manual Vol. 3




Does this apply across threads?



mfence states:




Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any SFENCE and LFENCE instructions, and any serializing instructions (such as the CPUID instruction).



-Intel 64 Developers Manual Vol 3A




This sounds like a stronger guarantee. As it sounds like mfence is almost flushing the write buffer, or at least reaching out to the write buffer and other cores to ensure my future load/stores are up to date.



When bench-marked both instructions take on the order of ~100 cycles to complete. So I can't see that big of a difference either way.



Primarily I am just confused. I instructions based around lock used in mutexes, but then these contain no memory fences. Then I see lock free programming that uses memory fences, but no locks. I understand AMD64 has a very strong memory model, but stale values can persist in cache. If lock doesn't behave the same behavior as mfence then how do mutexes help you see the most recent value?







multithreading assembly x86 cpu-architecture memory-barriers






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 15 '18 at 2:14









Peter Cordes

136k19208351




136k19208351










asked Nov 3 '16 at 18:59









ValaraucaValarauca

344215




344215












  • Possibly a duplicate of: stackoverflow.com/questions/9027590/…

    – hidefromkgb
    Nov 3 '16 at 19:15












  • xchg includes the lock logic, so lock / xchg is redundant.

    – rcgldr
    Nov 3 '16 at 19:16











  • I'm aware, except clang actually emits lock xchg for atomic swapping size_t with Sequential ordering on x86_64. I was kind of copying and pasting.

    – Valarauca
    Nov 3 '16 at 19:17











  • @hidefromkgb this states that instructions cannot be re-order but it does not answer if load/stores are serialized like what happens with mfense.

    – Valarauca
    Nov 3 '16 at 19:20






  • 1





    They're both full memory barriers. Don't have time to write a full answer, but see some of the memory-ordering links in the x86 tag wiki. MFENCE may also imply some other semantics about partially serializing the instruction stream, not just memory, at least on AMD CPUs where it's lower throughput than lock add for use as a memory barrier.

    – Peter Cordes
    Nov 4 '16 at 10:03

















  • Possibly a duplicate of: stackoverflow.com/questions/9027590/…

    – hidefromkgb
    Nov 3 '16 at 19:15












  • xchg includes the lock logic, so lock / xchg is redundant.

    – rcgldr
    Nov 3 '16 at 19:16











  • I'm aware, except clang actually emits lock xchg for atomic swapping size_t with Sequential ordering on x86_64. I was kind of copying and pasting.

    – Valarauca
    Nov 3 '16 at 19:17











  • @hidefromkgb this states that instructions cannot be re-order but it does not answer if load/stores are serialized like what happens with mfense.

    – Valarauca
    Nov 3 '16 at 19:20






  • 1





    They're both full memory barriers. Don't have time to write a full answer, but see some of the memory-ordering links in the x86 tag wiki. MFENCE may also imply some other semantics about partially serializing the instruction stream, not just memory, at least on AMD CPUs where it's lower throughput than lock add for use as a memory barrier.

    – Peter Cordes
    Nov 4 '16 at 10:03
















Possibly a duplicate of: stackoverflow.com/questions/9027590/…

– hidefromkgb
Nov 3 '16 at 19:15






Possibly a duplicate of: stackoverflow.com/questions/9027590/…

– hidefromkgb
Nov 3 '16 at 19:15














xchg includes the lock logic, so lock / xchg is redundant.

– rcgldr
Nov 3 '16 at 19:16





xchg includes the lock logic, so lock / xchg is redundant.

– rcgldr
Nov 3 '16 at 19:16













I'm aware, except clang actually emits lock xchg for atomic swapping size_t with Sequential ordering on x86_64. I was kind of copying and pasting.

– Valarauca
Nov 3 '16 at 19:17





I'm aware, except clang actually emits lock xchg for atomic swapping size_t with Sequential ordering on x86_64. I was kind of copying and pasting.

– Valarauca
Nov 3 '16 at 19:17













@hidefromkgb this states that instructions cannot be re-order but it does not answer if load/stores are serialized like what happens with mfense.

– Valarauca
Nov 3 '16 at 19:20





@hidefromkgb this states that instructions cannot be re-order but it does not answer if load/stores are serialized like what happens with mfense.

– Valarauca
Nov 3 '16 at 19:20




1




1





They're both full memory barriers. Don't have time to write a full answer, but see some of the memory-ordering links in the x86 tag wiki. MFENCE may also imply some other semantics about partially serializing the instruction stream, not just memory, at least on AMD CPUs where it's lower throughput than lock add for use as a memory barrier.

– Peter Cordes
Nov 4 '16 at 10:03





They're both full memory barriers. Don't have time to write a full answer, but see some of the memory-ordering links in the x86 tag wiki. MFENCE may also imply some other semantics about partially serializing the instruction stream, not just memory, at least on AMD CPUs where it's lower throughput than lock add for use as a memory barrier.

– Peter Cordes
Nov 4 '16 at 10:03












1 Answer
1






active

oldest

votes


















5














I believe your question is the same as asking if mfence has the same barrier semantics as the lock-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.



My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence and locked instructions provide the same fencing semantics, but that due to implementation oversights, mfence actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular, mfence can fence a subsequent non-temporal load from a WC-type memory region, while locked instructions do not.



We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:




MOVNTDQA From WC Memory May Pass Earlier Locked Instructions



Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.



Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.



Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA
should insert an MFENCE instruction between the locked instruction
and subsequent (V)MOVNTDQA instruction.




From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence is recommended instead.



In Skylake, mfence actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock-instruction errata, but applies to mfence. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".



This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. mfence apparently already did the job on Haswell, and Skylake would be fixed so that mfence worked there too.



It doesn't really explain why SKL079 (the mfence one) appeared in January 2016, nearly two years before SKL155 (the locked one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.



One might speculate on what Intel will do in the future. Since they weren't able/willing to change the lock instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.



Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.



Peter Cordes has written a bit about the Skylake mfence change at the end of this answer.



The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.



Old Answer



My informed guess at the answer is that mfence provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).



That said, this is just an informed guess and you'll find details of my investigation below.



Details



Documentation



It isn't exactly clear the extent that the memory consistency effects of mfence differs that provided by lock-prefixed instruction (including xchg with a memory operand, which is implicitly locked).



I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence provides the same ordering semantics as lock-prefixed operation.



What is open for debate is whether mfence differs at all from lock-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.



For example, you can find some suggestions (such as here or here) that mfence implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.



For example, quoting Dr. McCalpin in this thread (emphasis added):




The fence instruction is only needed to be absolutely sure that all of
the non-temporal stores are visible before a subsequent "ordinary"
store. The most obvious case where this matters is in a parallel
code, where the "barrier" at the end of a parallel region may include
an "ordinary" store. Without a fence, the processor might still have
modified data in the Write-Combining buffers, but pass through the
barrier and allow other processors to read "stale" copies of the
write-combined data. This scenario might also apply to a single
thread that is migrated by the OS from one core to another core (not
sure about this case).



I can't remember the detailed reasoning (not enough coffee yet this
morning), but the instruction you want to use after the non-temporal
stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the
SWDM, the MFENCE is the only fence instruction that prevents both
subsequent loads and subsequent stores from being executed ahead of
the completion of the fence.
I am surprised that this is not
mentioned in Section 11.3.1, which tells you how important it is to
manually ensure coherence when using write-combining, but does not
tell you how to do it!




Let's check out the referenced section 8.2.5 of the Intel SDM:




Strengthening or Weakening the Memory-Ordering Model



The Intel 64 and
IA-32 architectures provide several mechanisms for strengthening or
weakening the memory- ordering model to handle special programming
situations. These mechanisms include:



• The I/O instructions, locking
instructions, the LOCK prefix, and serializing instructions force
stronger ordering on the processor.



• The SFENCE instruction
(introduced to the IA-32 architecture in the Pentium III processor)
and the LFENCE and MFENCE instructions (introduced in the Pentium 4
processor) provide memory-ordering and serialization capabilities for
specific types of memory operations.



These mechanisms can be used as follows:



Memory mapped devices and
other I/O devices on the bus are often sensitive to the order of
writes to their I/O buffers. I/O instructions can be used to (the IN
and OUT instructions) impose strong write ordering on such accesses as
follows. Prior to executing an I/O instruction, the processor waits
for all previous instructions in the program to complete and for all
buffered writes to drain to memory. Only instruction fetch and page
tables walks can pass I/O instructions. Execution of subsequent
instructions do not begin until the processor determines that the I/O
instruction has been completed.



Synchronization mechanisms in multiple-processor systems may depend
upon a strong memory-ordering model. Here, a program can use a locking
instruction such as the XCHG instruction or the LOCK prefix to ensure
that a read-modify-write operation on memory is carried out
atomically. Locking operations typically operate like I/O operations
in that they wait for all previous instructions to complete and for
all buffered writes to drain to memory (see Section 8.1.2, “Bus
Locking”).



Program synchronization can also be carried out with
serializing instructions (see Section 8.3). These instructions are
typically used at critical procedure or task boundaries to force
completion of all previous instructions before a jump to a new section
of code or a context switch occurs. Like the I/O and locking
instructions, the processor waits until all previous instructions have
been completed and all buffered writes have been drained to memory
before executing the serializing instruction.



The SFENCE, LFENCE, and
MFENCE instructions provide a performance-efficient way of ensuring
load and store memory ordering between routines that produce
weakly-ordered results and routines that consume that data
. The
functions of these instructions are as follows:



• SFENCE — Serializes
all store (write) operations that occurred prior to the SFENCE
instruction in the program instruction stream, but does not affect
load operations.



• LFENCE — Serializes all load (read) operations that
occurred prior to the LFENCE instruction in the program instruction
stream, but does not affect store operations.



• MFENCE — Serializes
all store and load operations that occurred prior to the MFENCE
instruction in the program instruction stream.



Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Contrary to Dr. McCalpin's interpretation2, I see this section as somewhat ambiguous as to whether mfence does something extra. The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO.



Then the section for the FENCE instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."



Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? We can find some support for this idea by noting that fence instructions were introduced3 at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions:




The degree to which a consumer of data knows that the data is weakly
ordered can vary for these cases. As a result, the SFENCE or MFENCE
instruction should be used to ensure ordering between routines that
produce weakly-ordered data and routines that consume the data. SFENCE
and MFENCE provide a performance-efficient way to ensure ordering by
guaranteeing that every store instruction that precedes SFENCE/MFENCE
in program order is globally visible before a store instruction that
follows the fence.




Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions.



We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above:




Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Here is basically implies that the FENCE instructions essentially replace a functionality previously offered by the serializing cpuid in terms of memory ordering. However, if lock-prefixed instructions provided the same barrier capability as cpuid, that would likely have been the previously suggested way, since these are in general much faster than cpuid which often takes 200 or more cycles. The implication being that there were scenarios (probably weakly ordered scenarios) that lock-prefixed instructions didn't handle, and where cpuid was being used, and where mfence is now suggested as a replacement, implying stronger barrier semantics than lock-prefixed instructions.



However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for.



Indeed, sfence at a few cycles is much faster than serializing instructions like cpuid or lock-prefixed instructions which are generally 20 cycles or more. On the other hand mfence isn't generally faster than locked instructions4, at least on modern hardware. Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out.



So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way.



We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. For example, in the documentation for the non-temporal store movnti you find the following quote:




Because the WC protocol uses a weakly-ordered memory consistency
model, a fencing operation implemented with the SFENCE or MFENCE
instruction should be used in conjunction with MOVNTI instructions if
multiple processors might use different memory types to read/write the
destination memory locations.




The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. Indeed, the actual memory type (e.g., as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions.



Performance



The mfence instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon like lock cmpxchg is reported to take only 18 cycles.



If mfence provided barrier semantics no stronger than lock cmpxchg, the latter is doing strictly more work and there is no apparent reason for mfence to take significantly longer. Of course you could argue that lock cmpxchg is simply more important than mfence and hence gets more optimization. This argument is weakened by the fact that all of the locked instructions are considerably faster than mfence, even infrequently used ones. Also, you would imagine that if there were a single barrier implementation shared by all the lock instructions, mfence would simply use the same one as that's the simplest and easiest to validation.



So the slower performance of mfence is, in my opinion, significant evidence that mfence is doing some extra.




0.5 This isn't a water-tight argument. Some things may appear in errata that are apparently "by design" and not a bug, such as popcnt false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug.



1 Evidently, the lock-prefixed instruction also perform an atomic operation which isn't possible to achieve solely with mfence, so the lock-prefixed instructions definitely have additional functionality. Therefore, for mfence to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better.



2 It is also entirely possible that he was reading a different version of the manual where the prose was different.



3SFENCE in SSE, lfence and mfence in SSE2.



4 And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles.






share|improve this answer




















  • 1





    On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

    – Peter Cordes
    May 11 '18 at 0:27







  • 1





    That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

    – Peter Cordes
    May 11 '18 at 0:36






  • 1





    BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

    – Peter Cordes
    May 11 '18 at 0:37






  • 1





    Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

    – BeeOnRope
    Jul 12 '18 at 21:23











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40409297%2fdoes-lock-xchg-have-the-same-behavior-as-mfence%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









5














I believe your question is the same as asking if mfence has the same barrier semantics as the lock-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.



My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence and locked instructions provide the same fencing semantics, but that due to implementation oversights, mfence actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular, mfence can fence a subsequent non-temporal load from a WC-type memory region, while locked instructions do not.



We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:




MOVNTDQA From WC Memory May Pass Earlier Locked Instructions



Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.



Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.



Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA
should insert an MFENCE instruction between the locked instruction
and subsequent (V)MOVNTDQA instruction.




From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence is recommended instead.



In Skylake, mfence actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock-instruction errata, but applies to mfence. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".



This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. mfence apparently already did the job on Haswell, and Skylake would be fixed so that mfence worked there too.



It doesn't really explain why SKL079 (the mfence one) appeared in January 2016, nearly two years before SKL155 (the locked one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.



One might speculate on what Intel will do in the future. Since they weren't able/willing to change the lock instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.



Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.



Peter Cordes has written a bit about the Skylake mfence change at the end of this answer.



The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.



Old Answer



My informed guess at the answer is that mfence provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).



That said, this is just an informed guess and you'll find details of my investigation below.



Details



Documentation



It isn't exactly clear the extent that the memory consistency effects of mfence differs that provided by lock-prefixed instruction (including xchg with a memory operand, which is implicitly locked).



I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence provides the same ordering semantics as lock-prefixed operation.



What is open for debate is whether mfence differs at all from lock-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.



For example, you can find some suggestions (such as here or here) that mfence implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.



For example, quoting Dr. McCalpin in this thread (emphasis added):




The fence instruction is only needed to be absolutely sure that all of
the non-temporal stores are visible before a subsequent "ordinary"
store. The most obvious case where this matters is in a parallel
code, where the "barrier" at the end of a parallel region may include
an "ordinary" store. Without a fence, the processor might still have
modified data in the Write-Combining buffers, but pass through the
barrier and allow other processors to read "stale" copies of the
write-combined data. This scenario might also apply to a single
thread that is migrated by the OS from one core to another core (not
sure about this case).



I can't remember the detailed reasoning (not enough coffee yet this
morning), but the instruction you want to use after the non-temporal
stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the
SWDM, the MFENCE is the only fence instruction that prevents both
subsequent loads and subsequent stores from being executed ahead of
the completion of the fence.
I am surprised that this is not
mentioned in Section 11.3.1, which tells you how important it is to
manually ensure coherence when using write-combining, but does not
tell you how to do it!




Let's check out the referenced section 8.2.5 of the Intel SDM:




Strengthening or Weakening the Memory-Ordering Model



The Intel 64 and
IA-32 architectures provide several mechanisms for strengthening or
weakening the memory- ordering model to handle special programming
situations. These mechanisms include:



• The I/O instructions, locking
instructions, the LOCK prefix, and serializing instructions force
stronger ordering on the processor.



• The SFENCE instruction
(introduced to the IA-32 architecture in the Pentium III processor)
and the LFENCE and MFENCE instructions (introduced in the Pentium 4
processor) provide memory-ordering and serialization capabilities for
specific types of memory operations.



These mechanisms can be used as follows:



Memory mapped devices and
other I/O devices on the bus are often sensitive to the order of
writes to their I/O buffers. I/O instructions can be used to (the IN
and OUT instructions) impose strong write ordering on such accesses as
follows. Prior to executing an I/O instruction, the processor waits
for all previous instructions in the program to complete and for all
buffered writes to drain to memory. Only instruction fetch and page
tables walks can pass I/O instructions. Execution of subsequent
instructions do not begin until the processor determines that the I/O
instruction has been completed.



Synchronization mechanisms in multiple-processor systems may depend
upon a strong memory-ordering model. Here, a program can use a locking
instruction such as the XCHG instruction or the LOCK prefix to ensure
that a read-modify-write operation on memory is carried out
atomically. Locking operations typically operate like I/O operations
in that they wait for all previous instructions to complete and for
all buffered writes to drain to memory (see Section 8.1.2, “Bus
Locking”).



Program synchronization can also be carried out with
serializing instructions (see Section 8.3). These instructions are
typically used at critical procedure or task boundaries to force
completion of all previous instructions before a jump to a new section
of code or a context switch occurs. Like the I/O and locking
instructions, the processor waits until all previous instructions have
been completed and all buffered writes have been drained to memory
before executing the serializing instruction.



The SFENCE, LFENCE, and
MFENCE instructions provide a performance-efficient way of ensuring
load and store memory ordering between routines that produce
weakly-ordered results and routines that consume that data
. The
functions of these instructions are as follows:



• SFENCE — Serializes
all store (write) operations that occurred prior to the SFENCE
instruction in the program instruction stream, but does not affect
load operations.



• LFENCE — Serializes all load (read) operations that
occurred prior to the LFENCE instruction in the program instruction
stream, but does not affect store operations.



• MFENCE — Serializes
all store and load operations that occurred prior to the MFENCE
instruction in the program instruction stream.



Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Contrary to Dr. McCalpin's interpretation2, I see this section as somewhat ambiguous as to whether mfence does something extra. The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO.



Then the section for the FENCE instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."



Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? We can find some support for this idea by noting that fence instructions were introduced3 at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions:




The degree to which a consumer of data knows that the data is weakly
ordered can vary for these cases. As a result, the SFENCE or MFENCE
instruction should be used to ensure ordering between routines that
produce weakly-ordered data and routines that consume the data. SFENCE
and MFENCE provide a performance-efficient way to ensure ordering by
guaranteeing that every store instruction that precedes SFENCE/MFENCE
in program order is globally visible before a store instruction that
follows the fence.




Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions.



We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above:




Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Here is basically implies that the FENCE instructions essentially replace a functionality previously offered by the serializing cpuid in terms of memory ordering. However, if lock-prefixed instructions provided the same barrier capability as cpuid, that would likely have been the previously suggested way, since these are in general much faster than cpuid which often takes 200 or more cycles. The implication being that there were scenarios (probably weakly ordered scenarios) that lock-prefixed instructions didn't handle, and where cpuid was being used, and where mfence is now suggested as a replacement, implying stronger barrier semantics than lock-prefixed instructions.



However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for.



Indeed, sfence at a few cycles is much faster than serializing instructions like cpuid or lock-prefixed instructions which are generally 20 cycles or more. On the other hand mfence isn't generally faster than locked instructions4, at least on modern hardware. Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out.



So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way.



We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. For example, in the documentation for the non-temporal store movnti you find the following quote:




Because the WC protocol uses a weakly-ordered memory consistency
model, a fencing operation implemented with the SFENCE or MFENCE
instruction should be used in conjunction with MOVNTI instructions if
multiple processors might use different memory types to read/write the
destination memory locations.




The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. Indeed, the actual memory type (e.g., as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions.



Performance



The mfence instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon like lock cmpxchg is reported to take only 18 cycles.



If mfence provided barrier semantics no stronger than lock cmpxchg, the latter is doing strictly more work and there is no apparent reason for mfence to take significantly longer. Of course you could argue that lock cmpxchg is simply more important than mfence and hence gets more optimization. This argument is weakened by the fact that all of the locked instructions are considerably faster than mfence, even infrequently used ones. Also, you would imagine that if there were a single barrier implementation shared by all the lock instructions, mfence would simply use the same one as that's the simplest and easiest to validation.



So the slower performance of mfence is, in my opinion, significant evidence that mfence is doing some extra.




0.5 This isn't a water-tight argument. Some things may appear in errata that are apparently "by design" and not a bug, such as popcnt false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug.



1 Evidently, the lock-prefixed instruction also perform an atomic operation which isn't possible to achieve solely with mfence, so the lock-prefixed instructions definitely have additional functionality. Therefore, for mfence to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better.



2 It is also entirely possible that he was reading a different version of the manual where the prose was different.



3SFENCE in SSE, lfence and mfence in SSE2.



4 And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles.






share|improve this answer




















  • 1





    On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

    – Peter Cordes
    May 11 '18 at 0:27







  • 1





    That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

    – Peter Cordes
    May 11 '18 at 0:36






  • 1





    BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

    – Peter Cordes
    May 11 '18 at 0:37






  • 1





    Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

    – BeeOnRope
    Jul 12 '18 at 21:23















5














I believe your question is the same as asking if mfence has the same barrier semantics as the lock-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.



My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence and locked instructions provide the same fencing semantics, but that due to implementation oversights, mfence actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular, mfence can fence a subsequent non-temporal load from a WC-type memory region, while locked instructions do not.



We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:




MOVNTDQA From WC Memory May Pass Earlier Locked Instructions



Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.



Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.



Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA
should insert an MFENCE instruction between the locked instruction
and subsequent (V)MOVNTDQA instruction.




From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence is recommended instead.



In Skylake, mfence actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock-instruction errata, but applies to mfence. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".



This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. mfence apparently already did the job on Haswell, and Skylake would be fixed so that mfence worked there too.



It doesn't really explain why SKL079 (the mfence one) appeared in January 2016, nearly two years before SKL155 (the locked one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.



One might speculate on what Intel will do in the future. Since they weren't able/willing to change the lock instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.



Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.



Peter Cordes has written a bit about the Skylake mfence change at the end of this answer.



The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.



Old Answer



My informed guess at the answer is that mfence provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).



That said, this is just an informed guess and you'll find details of my investigation below.



Details



Documentation



It isn't exactly clear the extent that the memory consistency effects of mfence differs that provided by lock-prefixed instruction (including xchg with a memory operand, which is implicitly locked).



I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence provides the same ordering semantics as lock-prefixed operation.



What is open for debate is whether mfence differs at all from lock-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.



For example, you can find some suggestions (such as here or here) that mfence implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.



For example, quoting Dr. McCalpin in this thread (emphasis added):




The fence instruction is only needed to be absolutely sure that all of
the non-temporal stores are visible before a subsequent "ordinary"
store. The most obvious case where this matters is in a parallel
code, where the "barrier" at the end of a parallel region may include
an "ordinary" store. Without a fence, the processor might still have
modified data in the Write-Combining buffers, but pass through the
barrier and allow other processors to read "stale" copies of the
write-combined data. This scenario might also apply to a single
thread that is migrated by the OS from one core to another core (not
sure about this case).



I can't remember the detailed reasoning (not enough coffee yet this
morning), but the instruction you want to use after the non-temporal
stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the
SWDM, the MFENCE is the only fence instruction that prevents both
subsequent loads and subsequent stores from being executed ahead of
the completion of the fence.
I am surprised that this is not
mentioned in Section 11.3.1, which tells you how important it is to
manually ensure coherence when using write-combining, but does not
tell you how to do it!




Let's check out the referenced section 8.2.5 of the Intel SDM:




Strengthening or Weakening the Memory-Ordering Model



The Intel 64 and
IA-32 architectures provide several mechanisms for strengthening or
weakening the memory- ordering model to handle special programming
situations. These mechanisms include:



• The I/O instructions, locking
instructions, the LOCK prefix, and serializing instructions force
stronger ordering on the processor.



• The SFENCE instruction
(introduced to the IA-32 architecture in the Pentium III processor)
and the LFENCE and MFENCE instructions (introduced in the Pentium 4
processor) provide memory-ordering and serialization capabilities for
specific types of memory operations.



These mechanisms can be used as follows:



Memory mapped devices and
other I/O devices on the bus are often sensitive to the order of
writes to their I/O buffers. I/O instructions can be used to (the IN
and OUT instructions) impose strong write ordering on such accesses as
follows. Prior to executing an I/O instruction, the processor waits
for all previous instructions in the program to complete and for all
buffered writes to drain to memory. Only instruction fetch and page
tables walks can pass I/O instructions. Execution of subsequent
instructions do not begin until the processor determines that the I/O
instruction has been completed.



Synchronization mechanisms in multiple-processor systems may depend
upon a strong memory-ordering model. Here, a program can use a locking
instruction such as the XCHG instruction or the LOCK prefix to ensure
that a read-modify-write operation on memory is carried out
atomically. Locking operations typically operate like I/O operations
in that they wait for all previous instructions to complete and for
all buffered writes to drain to memory (see Section 8.1.2, “Bus
Locking”).



Program synchronization can also be carried out with
serializing instructions (see Section 8.3). These instructions are
typically used at critical procedure or task boundaries to force
completion of all previous instructions before a jump to a new section
of code or a context switch occurs. Like the I/O and locking
instructions, the processor waits until all previous instructions have
been completed and all buffered writes have been drained to memory
before executing the serializing instruction.



The SFENCE, LFENCE, and
MFENCE instructions provide a performance-efficient way of ensuring
load and store memory ordering between routines that produce
weakly-ordered results and routines that consume that data
. The
functions of these instructions are as follows:



• SFENCE — Serializes
all store (write) operations that occurred prior to the SFENCE
instruction in the program instruction stream, but does not affect
load operations.



• LFENCE — Serializes all load (read) operations that
occurred prior to the LFENCE instruction in the program instruction
stream, but does not affect store operations.



• MFENCE — Serializes
all store and load operations that occurred prior to the MFENCE
instruction in the program instruction stream.



Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Contrary to Dr. McCalpin's interpretation2, I see this section as somewhat ambiguous as to whether mfence does something extra. The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO.



Then the section for the FENCE instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."



Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? We can find some support for this idea by noting that fence instructions were introduced3 at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions:




The degree to which a consumer of data knows that the data is weakly
ordered can vary for these cases. As a result, the SFENCE or MFENCE
instruction should be used to ensure ordering between routines that
produce weakly-ordered data and routines that consume the data. SFENCE
and MFENCE provide a performance-efficient way to ensure ordering by
guaranteeing that every store instruction that precedes SFENCE/MFENCE
in program order is globally visible before a store instruction that
follows the fence.




Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions.



We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above:




Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Here is basically implies that the FENCE instructions essentially replace a functionality previously offered by the serializing cpuid in terms of memory ordering. However, if lock-prefixed instructions provided the same barrier capability as cpuid, that would likely have been the previously suggested way, since these are in general much faster than cpuid which often takes 200 or more cycles. The implication being that there were scenarios (probably weakly ordered scenarios) that lock-prefixed instructions didn't handle, and where cpuid was being used, and where mfence is now suggested as a replacement, implying stronger barrier semantics than lock-prefixed instructions.



However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for.



Indeed, sfence at a few cycles is much faster than serializing instructions like cpuid or lock-prefixed instructions which are generally 20 cycles or more. On the other hand mfence isn't generally faster than locked instructions4, at least on modern hardware. Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out.



So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way.



We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. For example, in the documentation for the non-temporal store movnti you find the following quote:




Because the WC protocol uses a weakly-ordered memory consistency
model, a fencing operation implemented with the SFENCE or MFENCE
instruction should be used in conjunction with MOVNTI instructions if
multiple processors might use different memory types to read/write the
destination memory locations.




The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. Indeed, the actual memory type (e.g., as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions.



Performance



The mfence instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon like lock cmpxchg is reported to take only 18 cycles.



If mfence provided barrier semantics no stronger than lock cmpxchg, the latter is doing strictly more work and there is no apparent reason for mfence to take significantly longer. Of course you could argue that lock cmpxchg is simply more important than mfence and hence gets more optimization. This argument is weakened by the fact that all of the locked instructions are considerably faster than mfence, even infrequently used ones. Also, you would imagine that if there were a single barrier implementation shared by all the lock instructions, mfence would simply use the same one as that's the simplest and easiest to validation.



So the slower performance of mfence is, in my opinion, significant evidence that mfence is doing some extra.




0.5 This isn't a water-tight argument. Some things may appear in errata that are apparently "by design" and not a bug, such as popcnt false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug.



1 Evidently, the lock-prefixed instruction also perform an atomic operation which isn't possible to achieve solely with mfence, so the lock-prefixed instructions definitely have additional functionality. Therefore, for mfence to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better.



2 It is also entirely possible that he was reading a different version of the manual where the prose was different.



3SFENCE in SSE, lfence and mfence in SSE2.



4 And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles.






share|improve this answer




















  • 1





    On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

    – Peter Cordes
    May 11 '18 at 0:27







  • 1





    That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

    – Peter Cordes
    May 11 '18 at 0:36






  • 1





    BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

    – Peter Cordes
    May 11 '18 at 0:37






  • 1





    Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

    – BeeOnRope
    Jul 12 '18 at 21:23













5












5








5







I believe your question is the same as asking if mfence has the same barrier semantics as the lock-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.



My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence and locked instructions provide the same fencing semantics, but that due to implementation oversights, mfence actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular, mfence can fence a subsequent non-temporal load from a WC-type memory region, while locked instructions do not.



We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:




MOVNTDQA From WC Memory May Pass Earlier Locked Instructions



Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.



Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.



Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA
should insert an MFENCE instruction between the locked instruction
and subsequent (V)MOVNTDQA instruction.




From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence is recommended instead.



In Skylake, mfence actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock-instruction errata, but applies to mfence. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".



This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. mfence apparently already did the job on Haswell, and Skylake would be fixed so that mfence worked there too.



It doesn't really explain why SKL079 (the mfence one) appeared in January 2016, nearly two years before SKL155 (the locked one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.



One might speculate on what Intel will do in the future. Since they weren't able/willing to change the lock instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.



Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.



Peter Cordes has written a bit about the Skylake mfence change at the end of this answer.



The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.



Old Answer



My informed guess at the answer is that mfence provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).



That said, this is just an informed guess and you'll find details of my investigation below.



Details



Documentation



It isn't exactly clear the extent that the memory consistency effects of mfence differs that provided by lock-prefixed instruction (including xchg with a memory operand, which is implicitly locked).



I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence provides the same ordering semantics as lock-prefixed operation.



What is open for debate is whether mfence differs at all from lock-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.



For example, you can find some suggestions (such as here or here) that mfence implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.



For example, quoting Dr. McCalpin in this thread (emphasis added):




The fence instruction is only needed to be absolutely sure that all of
the non-temporal stores are visible before a subsequent "ordinary"
store. The most obvious case where this matters is in a parallel
code, where the "barrier" at the end of a parallel region may include
an "ordinary" store. Without a fence, the processor might still have
modified data in the Write-Combining buffers, but pass through the
barrier and allow other processors to read "stale" copies of the
write-combined data. This scenario might also apply to a single
thread that is migrated by the OS from one core to another core (not
sure about this case).



I can't remember the detailed reasoning (not enough coffee yet this
morning), but the instruction you want to use after the non-temporal
stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the
SWDM, the MFENCE is the only fence instruction that prevents both
subsequent loads and subsequent stores from being executed ahead of
the completion of the fence.
I am surprised that this is not
mentioned in Section 11.3.1, which tells you how important it is to
manually ensure coherence when using write-combining, but does not
tell you how to do it!




Let's check out the referenced section 8.2.5 of the Intel SDM:




Strengthening or Weakening the Memory-Ordering Model



The Intel 64 and
IA-32 architectures provide several mechanisms for strengthening or
weakening the memory- ordering model to handle special programming
situations. These mechanisms include:



• The I/O instructions, locking
instructions, the LOCK prefix, and serializing instructions force
stronger ordering on the processor.



• The SFENCE instruction
(introduced to the IA-32 architecture in the Pentium III processor)
and the LFENCE and MFENCE instructions (introduced in the Pentium 4
processor) provide memory-ordering and serialization capabilities for
specific types of memory operations.



These mechanisms can be used as follows:



Memory mapped devices and
other I/O devices on the bus are often sensitive to the order of
writes to their I/O buffers. I/O instructions can be used to (the IN
and OUT instructions) impose strong write ordering on such accesses as
follows. Prior to executing an I/O instruction, the processor waits
for all previous instructions in the program to complete and for all
buffered writes to drain to memory. Only instruction fetch and page
tables walks can pass I/O instructions. Execution of subsequent
instructions do not begin until the processor determines that the I/O
instruction has been completed.



Synchronization mechanisms in multiple-processor systems may depend
upon a strong memory-ordering model. Here, a program can use a locking
instruction such as the XCHG instruction or the LOCK prefix to ensure
that a read-modify-write operation on memory is carried out
atomically. Locking operations typically operate like I/O operations
in that they wait for all previous instructions to complete and for
all buffered writes to drain to memory (see Section 8.1.2, “Bus
Locking”).



Program synchronization can also be carried out with
serializing instructions (see Section 8.3). These instructions are
typically used at critical procedure or task boundaries to force
completion of all previous instructions before a jump to a new section
of code or a context switch occurs. Like the I/O and locking
instructions, the processor waits until all previous instructions have
been completed and all buffered writes have been drained to memory
before executing the serializing instruction.



The SFENCE, LFENCE, and
MFENCE instructions provide a performance-efficient way of ensuring
load and store memory ordering between routines that produce
weakly-ordered results and routines that consume that data
. The
functions of these instructions are as follows:



• SFENCE — Serializes
all store (write) operations that occurred prior to the SFENCE
instruction in the program instruction stream, but does not affect
load operations.



• LFENCE — Serializes all load (read) operations that
occurred prior to the LFENCE instruction in the program instruction
stream, but does not affect store operations.



• MFENCE — Serializes
all store and load operations that occurred prior to the MFENCE
instruction in the program instruction stream.



Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Contrary to Dr. McCalpin's interpretation2, I see this section as somewhat ambiguous as to whether mfence does something extra. The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO.



Then the section for the FENCE instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."



Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? We can find some support for this idea by noting that fence instructions were introduced3 at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions:




The degree to which a consumer of data knows that the data is weakly
ordered can vary for these cases. As a result, the SFENCE or MFENCE
instruction should be used to ensure ordering between routines that
produce weakly-ordered data and routines that consume the data. SFENCE
and MFENCE provide a performance-efficient way to ensure ordering by
guaranteeing that every store instruction that precedes SFENCE/MFENCE
in program order is globally visible before a store instruction that
follows the fence.




Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions.



We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above:




Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Here is basically implies that the FENCE instructions essentially replace a functionality previously offered by the serializing cpuid in terms of memory ordering. However, if lock-prefixed instructions provided the same barrier capability as cpuid, that would likely have been the previously suggested way, since these are in general much faster than cpuid which often takes 200 or more cycles. The implication being that there were scenarios (probably weakly ordered scenarios) that lock-prefixed instructions didn't handle, and where cpuid was being used, and where mfence is now suggested as a replacement, implying stronger barrier semantics than lock-prefixed instructions.



However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for.



Indeed, sfence at a few cycles is much faster than serializing instructions like cpuid or lock-prefixed instructions which are generally 20 cycles or more. On the other hand mfence isn't generally faster than locked instructions4, at least on modern hardware. Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out.



So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way.



We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. For example, in the documentation for the non-temporal store movnti you find the following quote:




Because the WC protocol uses a weakly-ordered memory consistency
model, a fencing operation implemented with the SFENCE or MFENCE
instruction should be used in conjunction with MOVNTI instructions if
multiple processors might use different memory types to read/write the
destination memory locations.




The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. Indeed, the actual memory type (e.g., as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions.



Performance



The mfence instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon like lock cmpxchg is reported to take only 18 cycles.



If mfence provided barrier semantics no stronger than lock cmpxchg, the latter is doing strictly more work and there is no apparent reason for mfence to take significantly longer. Of course you could argue that lock cmpxchg is simply more important than mfence and hence gets more optimization. This argument is weakened by the fact that all of the locked instructions are considerably faster than mfence, even infrequently used ones. Also, you would imagine that if there were a single barrier implementation shared by all the lock instructions, mfence would simply use the same one as that's the simplest and easiest to validation.



So the slower performance of mfence is, in my opinion, significant evidence that mfence is doing some extra.




0.5 This isn't a water-tight argument. Some things may appear in errata that are apparently "by design" and not a bug, such as popcnt false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug.



1 Evidently, the lock-prefixed instruction also perform an atomic operation which isn't possible to achieve solely with mfence, so the lock-prefixed instructions definitely have additional functionality. Therefore, for mfence to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better.



2 It is also entirely possible that he was reading a different version of the manual where the prose was different.



3SFENCE in SSE, lfence and mfence in SSE2.



4 And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles.






share|improve this answer















I believe your question is the same as asking if mfence has the same barrier semantics as the lock-prefixed instructions on x86, or if it provides fewer1 or additional guarantees in some cases.



My current best answer is that it was Intel's intent and that the ISA documentation guarantees that mfence and locked instructions provide the same fencing semantics, but that due to implementation oversights, mfence actually provides stronger fencing semantics on recent hardware (since at least Haswell). In particular, mfence can fence a subsequent non-temporal load from a WC-type memory region, while locked instructions do not.



We know this because Intel tells us this in processor errata such as HSD162 (Haswell) and SKL155 (Skylake) which tell us that locked instructions don't fence a subsequent non-temporal read from WC-memory:




MOVNTDQA From WC Memory May Pass Earlier Locked Instructions



Problem: An execution of (V)MOVNTDQA (streaming load instruction) that loads from WC (write combining) memory may appear to pass an
earlier locked instruction that accesses a different cache line.



Implication: Software that expects a lock to fence subsequent (V)MOVNTDQA instructions may not operate properly.



Workaround: None identified. Software that relies on a locked instruction to fence subsequent executions of (V)MOVNTDQA
should insert an MFENCE instruction between the locked instruction
and subsequent (V)MOVNTDQA instruction.




From this, we can determine that (1) Intel probably intended that locked instructions fence NT loads from WC-type memory, or else this wouldn't be an errata0.5 and (2) that locked instructions don't actually do that, and Intel wasn't able to or chose not to fix this with a microcode update, and mfence is recommended instead.



In Skylake, mfence actually lost its additional fencing capability with respect to NT loads, as per SKL079: MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions - this has pretty much the same text as the lock-instruction errata, but applies to mfence. However, the status of this errata is "It is possible for the BIOS to contain a workaround for this erratum.", which is generally Intel-speak for "a microcode update addresses this".



This sequence of errata can perhaps be explained by timing: the Haswell errata only appears in early 2016, years after the the release of that processor, so we can assume the issue came to Intel's attention some moderate amount of time before that. At this point Skylake was almost certainly already out in the wild, with apparently a less conservative mfence implementation which also didn't fence NT loads on WC-type memory regions. Fixing the way locked instructions works all the way back to Haswell was probably either impossible or expensive based on their wide use, but some way was needed to fence NT loads. mfence apparently already did the job on Haswell, and Skylake would be fixed so that mfence worked there too.



It doesn't really explain why SKL079 (the mfence one) appeared in January 2016, nearly two years before SKL155 (the locked one) appeared in late 2017, or why the latter appeared so much after the identical Haswell errata, however.



One might speculate on what Intel will do in the future. Since they weren't able/willing to change the lock instruction for Haswell through Skylake, representing hundreds of million (billions?) of deployed chips, they'll never be able to guarantee that locked instructions fence NT loads, so they might consider making this the documented, architected behavior in the future. Or they might update the locked instructions, so they do fence such reads, but as a practical matter you can't rely on this probably for a decade or more, until chips with the current non-fencing behavior are almost out of circulation.



Similar to Haswell, according to BV116 and BJ138, NT loads may pass earlier locked instructions on Sandy Bridge and Ivy Bridge, respectively. It's possible that earlier microarchitectures also suffer from this issue. This "bug" does not seem to exist in Broadwell and microarchitectures after Skylake.



Peter Cordes has written a bit about the Skylake mfence change at the end of this answer.



The remaining part of this answer is my original answer, before I knew about the errata, and which is left mostly for historical interest.



Old Answer



My informed guess at the answer is that mfence provides additional barrier functionality: between accesses using weakly-ordered instructions (e.g., NT stores) and perhaps between accesses weakly-ordered regions (e.g., WC-type memory).



That said, this is just an informed guess and you'll find details of my investigation below.



Details



Documentation



It isn't exactly clear the extent that the memory consistency effects of mfence differs that provided by lock-prefixed instruction (including xchg with a memory operand, which is implicitly locked).



I think it is safe to say that solely with respect to write-back memory regions and not involving any non-temporal accesses, mfence provides the same ordering semantics as lock-prefixed operation.



What is open for debate is whether mfence differs at all from lock-prefixed instructions when it comes to scenarios outside the above, in particular when accesses involve regions other than WB regions or when non-temporal (streaming) operations are involved.



For example, you can find some suggestions (such as here or here) that mfence implies strong barrier semantics when WC-type operations (e.g., NT stores) are involved.



For example, quoting Dr. McCalpin in this thread (emphasis added):




The fence instruction is only needed to be absolutely sure that all of
the non-temporal stores are visible before a subsequent "ordinary"
store. The most obvious case where this matters is in a parallel
code, where the "barrier" at the end of a parallel region may include
an "ordinary" store. Without a fence, the processor might still have
modified data in the Write-Combining buffers, but pass through the
barrier and allow other processors to read "stale" copies of the
write-combined data. This scenario might also apply to a single
thread that is migrated by the OS from one core to another core (not
sure about this case).



I can't remember the detailed reasoning (not enough coffee yet this
morning), but the instruction you want to use after the non-temporal
stores is an MFENCE. According to Section 8.2.5 of Volume 3 of the
SWDM, the MFENCE is the only fence instruction that prevents both
subsequent loads and subsequent stores from being executed ahead of
the completion of the fence.
I am surprised that this is not
mentioned in Section 11.3.1, which tells you how important it is to
manually ensure coherence when using write-combining, but does not
tell you how to do it!




Let's check out the referenced section 8.2.5 of the Intel SDM:




Strengthening or Weakening the Memory-Ordering Model



The Intel 64 and
IA-32 architectures provide several mechanisms for strengthening or
weakening the memory- ordering model to handle special programming
situations. These mechanisms include:



• The I/O instructions, locking
instructions, the LOCK prefix, and serializing instructions force
stronger ordering on the processor.



• The SFENCE instruction
(introduced to the IA-32 architecture in the Pentium III processor)
and the LFENCE and MFENCE instructions (introduced in the Pentium 4
processor) provide memory-ordering and serialization capabilities for
specific types of memory operations.



These mechanisms can be used as follows:



Memory mapped devices and
other I/O devices on the bus are often sensitive to the order of
writes to their I/O buffers. I/O instructions can be used to (the IN
and OUT instructions) impose strong write ordering on such accesses as
follows. Prior to executing an I/O instruction, the processor waits
for all previous instructions in the program to complete and for all
buffered writes to drain to memory. Only instruction fetch and page
tables walks can pass I/O instructions. Execution of subsequent
instructions do not begin until the processor determines that the I/O
instruction has been completed.



Synchronization mechanisms in multiple-processor systems may depend
upon a strong memory-ordering model. Here, a program can use a locking
instruction such as the XCHG instruction or the LOCK prefix to ensure
that a read-modify-write operation on memory is carried out
atomically. Locking operations typically operate like I/O operations
in that they wait for all previous instructions to complete and for
all buffered writes to drain to memory (see Section 8.1.2, “Bus
Locking”).



Program synchronization can also be carried out with
serializing instructions (see Section 8.3). These instructions are
typically used at critical procedure or task boundaries to force
completion of all previous instructions before a jump to a new section
of code or a context switch occurs. Like the I/O and locking
instructions, the processor waits until all previous instructions have
been completed and all buffered writes have been drained to memory
before executing the serializing instruction.



The SFENCE, LFENCE, and
MFENCE instructions provide a performance-efficient way of ensuring
load and store memory ordering between routines that produce
weakly-ordered results and routines that consume that data
. The
functions of these instructions are as follows:



• SFENCE — Serializes
all store (write) operations that occurred prior to the SFENCE
instruction in the program instruction stream, but does not affect
load operations.



• LFENCE — Serializes all load (read) operations that
occurred prior to the LFENCE instruction in the program instruction
stream, but does not affect store operations.



• MFENCE — Serializes
all store and load operations that occurred prior to the MFENCE
instruction in the program instruction stream.



Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Contrary to Dr. McCalpin's interpretation2, I see this section as somewhat ambiguous as to whether mfence does something extra. The three sections referring to IO, locked instructions and serializing instructions do imply that they provide a full barrier between memory operations before and after the operation. They don't make any exception for weakly ordered memory and in the case of the IO instructions, one would also assume they need to work in a consistent way with weakly ordered memory regions since such are often used for IO.



Then the section for the FENCE instructions, it explicitly mentions weak memory regions: "The SFENCE, LFENCE, and MFENCE instructions **provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data."



Do we read between the lines and take this to mean that these are the only instructions that accomplish this and that the previously mentioned techniques (including locked instructions) don't help for weak memory regions? We can find some support for this idea by noting that fence instructions were introduced3 at the same time as weakly-ordered non-temporal store instructions, and by text like that found in 11.6.13 Cacheability Hint Instructions dealing specifically with weakly ordered instructions:




The degree to which a consumer of data knows that the data is weakly
ordered can vary for these cases. As a result, the SFENCE or MFENCE
instruction should be used to ensure ordering between routines that
produce weakly-ordered data and routines that consume the data. SFENCE
and MFENCE provide a performance-efficient way to ensure ordering by
guaranteeing that every store instruction that precedes SFENCE/MFENCE
in program order is globally visible before a store instruction that
follows the fence.




Again, here the fence instructions are specifically mentioned to be appropriate for fencing weakly ordered instructions.



We also find support for the idea that locked instruction might not provide a barrier between weakly ordered accesses from the last sentence already quoted above:




Note that the SFENCE,
LFENCE, and MFENCE instructions provide a more efficient method of
controlling memory ordering than the CPUID instruction.




Here is basically implies that the FENCE instructions essentially replace a functionality previously offered by the serializing cpuid in terms of memory ordering. However, if lock-prefixed instructions provided the same barrier capability as cpuid, that would likely have been the previously suggested way, since these are in general much faster than cpuid which often takes 200 or more cycles. The implication being that there were scenarios (probably weakly ordered scenarios) that lock-prefixed instructions didn't handle, and where cpuid was being used, and where mfence is now suggested as a replacement, implying stronger barrier semantics than lock-prefixed instructions.



However, we could interpret some of the above in a different way: note that in the context of the fence instructions it is often mentioned that they are performance-efficient way to ensure ordering. So it could be that these instructions are not intended to provide additional barriers, but simply more efficient barriers for.



Indeed, sfence at a few cycles is much faster than serializing instructions like cpuid or lock-prefixed instructions which are generally 20 cycles or more. On the other hand mfence isn't generally faster than locked instructions4, at least on modern hardware. Still, it could have been faster when introduced, or on some future design, or perhaps it was expected to be faster but that didn't pan out.



So I can't make a certain assessment based on these sections of the manual: I think you can make a reasonable argument that it could be interpreted either way.



We can further look at documentation for various non-temporal store instructions in the Intel ISA guide. For example, in the documentation for the non-temporal store movnti you find the following quote:




Because the WC protocol uses a weakly-ordered memory consistency
model, a fencing operation implemented with the SFENCE or MFENCE
instruction should be used in conjunction with MOVNTI instructions if
multiple processors might use different memory types to read/write the
destination memory locations.




The part about "if multiple processors might use different memory types to read/write the destination memory locations" is a bit confusing to me. I would expect this rather to say something like "to enforce ordering in the globally visible write order between instructions using weakly ordered hints" or something like that. Indeed, the actual memory type (e.g., as defined by the MTTR) probably doesn't even come into play here: the ordering issues can arise solely in WB-memory when using weakly ordered instructions.



Performance



The mfence instruction is reported to take 33 cycles (back-to-back latency) on modern CPUs based on Agner fog's instruction timing, but a more complex locked instructon like lock cmpxchg is reported to take only 18 cycles.



If mfence provided barrier semantics no stronger than lock cmpxchg, the latter is doing strictly more work and there is no apparent reason for mfence to take significantly longer. Of course you could argue that lock cmpxchg is simply more important than mfence and hence gets more optimization. This argument is weakened by the fact that all of the locked instructions are considerably faster than mfence, even infrequently used ones. Also, you would imagine that if there were a single barrier implementation shared by all the lock instructions, mfence would simply use the same one as that's the simplest and easiest to validation.



So the slower performance of mfence is, in my opinion, significant evidence that mfence is doing some extra.




0.5 This isn't a water-tight argument. Some things may appear in errata that are apparently "by design" and not a bug, such as popcnt false dependency on destination register - so some errata can be considered a form of documentation to update expectations rather than always implying a hardware bug.



1 Evidently, the lock-prefixed instruction also perform an atomic operation which isn't possible to achieve solely with mfence, so the lock-prefixed instructions definitely have additional functionality. Therefore, for mfence to be useful, we would expect it either to have additional barrier semantics in some scenarios, or to perform better.



2 It is also entirely possible that he was reading a different version of the manual where the prose was different.



3SFENCE in SSE, lfence and mfence in SSE2.



4 And often it's slower: Agner has it listed at 33 cycles latency on recent hardware, while locked instructions are usually about 20 cycles.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 9 at 2:00









Hadi Brais

11.5k22244




11.5k22244










answered May 10 '18 at 18:58









BeeOnRopeBeeOnRope

26.8k884184




26.8k884184







  • 1





    On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

    – Peter Cordes
    May 11 '18 at 0:27







  • 1





    That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

    – Peter Cordes
    May 11 '18 at 0:36






  • 1





    BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

    – Peter Cordes
    May 11 '18 at 0:37






  • 1





    Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

    – BeeOnRope
    Jul 12 '18 at 21:23












  • 1





    On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

    – Peter Cordes
    May 11 '18 at 0:27







  • 1





    That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

    – Peter Cordes
    May 11 '18 at 0:36






  • 1





    BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

    – Peter Cordes
    May 11 '18 at 0:37






  • 1





    Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

    – BeeOnRope
    Jul 12 '18 at 21:23







1




1





On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

– Peter Cordes
May 11 '18 at 0:27






On Skylake, xchg [shared], eax is a barrier for NT stores. Tested with this code that fills a buffer and stores current output position every cache line to a shared variable with (mfence+)mov or xchg: godbolt.org/g/7Q9xgz (some timing results in comments, from ocperf.py on the whole thing, so the timing includes the time for mmap(MAP_POPULATE)). With just mov but not mfence, we get reordering. But mfence+mov is ok, and so is xchg. The speed of the consumer loop is much different for the two producers, so there's some major difference.

– Peter Cordes
May 11 '18 at 0:27





1




1





That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

– Peter Cordes
May 11 '18 at 0:36





That doesn't rule out locked instructions not fencing movntdqa loads from WC memory; I think I've seen a claim that mfence (not just lfence) is necessary there. The difference when interacting with a consumer thread that spins on reading is interesting and bears further investigation (perhaps with something that profiles producer and consumer separately, and doesn't count the time to mmap(MAP_POPULATE) ~4GiB of RAM. Also, testing on AMD CPUs would be interesting; the x86 docs on paper seem ambiguous, so the fact that xchg is a barrier on Intel doesn't tell us what they mean.

– Peter Cordes
May 11 '18 at 0:36




1




1





BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

– Peter Cordes
May 11 '18 at 0:37





BTW, I compiled with t=nt-produce+consume.xchg; g++ -Wall -std=gnu++17 -march=native -pthread -O2 nt-fence-lock-buffer.cpp -o $t && taskset -c 3,4 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r3 ./"$t" (using gcc7.3.0 on Arch Linux on i7-6700k with DDR4-2666, with the CPU governor running it at ~3.8GHz for most of the test).

– Peter Cordes
May 11 '18 at 0:37




1




1





Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

– BeeOnRope
Jul 12 '18 at 21:23





Thanks @PeterCordes, I had it on my to-do list for a while to run your tests, but now that this errata information has come to light, I think we can say that is highly likely that locked instructions are intended to, and do actually fence NT stores in the usual way, since we have the NT load errata and NT stores to WB-memory are an order of magnitude or two more common and spread across all kinds of code, so a divergence there would likely have been noted (and the fact that the load behavior deserved an errata means we can understand that Intel likely intended lock to fence).

– BeeOnRope
Jul 12 '18 at 21:23



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40409297%2fdoes-lock-xchg-have-the-same-behavior-as-mfence%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved