Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matrix4x4createfromaxisangletest failing on NativeAOT #72149

Closed
runfoapp bot opened this issue Jul 14, 2022 · 38 comments · Fixed by #73870 or #74932
Closed

matrix4x4createfromaxisangletest failing on NativeAOT #72149

runfoapp bot opened this issue Jul 14, 2022 · 38 comments · Fixed by #73870 or #74932
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@runfoapp
Copy link

runfoapp bot commented Jul 14, 2022

Frequency - last 30 days as of 8/27:

Note: main branch failures in last 30 days

Runfo report

Build Kind Start Time
1871493 PR 71819 2022-11-07
1871687 Rolling 2022-11-07
1871847 PR 71819 2022-11-07
1872236 PR 71948 2022-11-07
1872339 PR 71943 2022-11-07
1873007 Rolling 2022-11-07
1873836 PR 71988 2022-12-07
1873835 PR 71175 2022-12-07
1874128 PR 71385 2022-12-07
1874181 PR 71187 2022-12-07
1874234 Rolling 2022-12-07
1874881 PR 71971 2022-12-07
1875008 PR 71385 2022-12-07
1877271 PR 72078 2022-13-07
1877160 Rolling 2022-13-07
1877428 Rolling 2022-13-07
1877456 PR 71948 2022-13-07
1878318 PR 72108 2022-13-07
1878499 Rolling 2022-13-07
1878950 PR 71385 2022-13-07
1878990 Rolling 2022-13-07
1880070 Rolling 2022-14-07
1880254 Rolling 2022-14-07
1880316 PR 72167 2022-14-07
1880554 PR 72091 2022-14-07
1880908 PR 72184 2022-14-07
1881247 Rolling 2022-14-07
1881678 Rolling 2022-14-07
1882597 Rolling 2022-15-07
1882765 PR 72145 2022-15-07
1882819 Rolling 2022-15-07
1882810 PR 72252 2022-15-07
1882817 PR 72236 2022-15-07
1884136 Rolling 2022-15-07
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 14, 2022
@agocke agocke changed the title system.numerics.tests.matrix4x4tests.matrix4x4createfromaxisangletest matrix4x4createfromaxisangletest failing on NativeAOT Jul 14, 2022
@agocke agocke added the blocking-clean-ci-optional Blocking optional rolling runs label Jul 14, 2022
@jkotas
Copy link
Member

jkotas commented Jul 14, 2022

cc @LakshanF

@runfoapp runfoapp bot closed this as completed Jul 14, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 14, 2022
@jtschuster
Copy link
Member

Deleted duplicates in runfo, didn't mean to close this

@jtschuster jtschuster reopened this Jul 14, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 14, 2022
@LakshanF
Copy link
Member

#72108 has the RR passing (including the Linux-arm64 platform). Waiting for some checks to complete in the runtime pipeline that seems to be due to service outage

@LakshanF LakshanF self-assigned this Jul 14, 2022
@LakshanF
Copy link
Member

The NativeAOT rolling runs are passing with #72108

@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 15, 2022
@vitek-karas
Copy link
Member

Failed again in #71485

@vitek-karas vitek-karas reopened this Jul 19, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 19, 2022
@MichalStrehovsky
Copy link
Member

Blocking the test in #72437.

The failure looks non-determinstic. We've seen it fail on Windows and Linux x64 (I didn't go over all the above failures, but it might not be a problem on ARM64).

The test is using Vectors. I wonder if there's a problem with preservation of XMM registers or something like that.

Cc @VSadov as this might be in your area of interest. (This has been happening for a while so it's not from the Linux suspension PR that just merged.)

@MichalStrehovsky MichalStrehovsky added this to the 7.0.0 milestone Jul 19, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 19, 2022
@VSadov
Copy link
Member

VSadov commented Jul 19, 2022

Have we seen this failure on win-x64?

I would suspect preserving of the xmm result in 2-register return case. Win-x64 would not be affected by that.
Otherwise it is something else, but likely still related to preserving xmm results.

@LakshanF LakshanF removed their assignment Jul 19, 2022
@MichalStrehovsky MichalStrehovsky removed the blocking-clean-ci-optional Blocking optional rolling runs label Jul 19, 2022
@VSadov
Copy link
Member

VSadov commented Jul 20, 2022

I've looked through a few failures and it looks like the failures are x64-specific. They were seen on Linux and Windows, but I have not seen a failure on ARM64

This is interesting because Windows and Linux have different specifics in terms of preserving xmm registers (i.e. in Linux all xmm are volatile, but not on Windows, on the other hand Linux can return two xmm registers from a call)

It is hard to see commonality here, that would also be specific to NativeAOT.
The only thing that I can think of is that, I believe, NativeAOT disables use of AVX by default, so vector math may be emitted differently.

@VSadov
Copy link
Member

VSadov commented Aug 28, 2022

now I can build the same commit locally and see what the diffs are in the bits or when executing/debugging

@VSadov
Copy link
Member

VSadov commented Aug 28, 2022

The difference start when the following line is executed:

            expected = Matrix4x4.CreateFromQuaternion(Quaternion.CreateFromAxisAngle(Vector3.Normalize(Vector3.One), radians));

LOADS DIFFERENT DATA ==> 00007FF763CB7E5E  movups      xmm0,xmmword ptr [__readonlydata_System_Numerics_Vectors_Tests_System_Numerics_Tests_Matrix4x4Tests__Matrix4x4CreateFromAxisAngleTest+40h (07FF7647DF070h)]  
00007FF763CB7E65  movups      xmm1,xmmword ptr [__readonlydata_System_Numerics_Vectors_Tests_System_Numerics_Tests_Matrix4x4Tests__Matrix4x4CreateFromAxisAngleTest+50h (07FF7647DF080h)]  
00007FF763CB7E6C  andps       xmm1,xmm0  
00007FF763CB7E6F  mulps       xmm0,xmm1  
00007FF763CB7E72  movaps      xmm1,xmm0  
00007FF763CB7E75  shufps      xmm1,xmm0,0B1h  

The first movups loads some static value (I suspect it is Vector.One ?)

My locally built exe, that passes the test, loads

XMM0 = DDDDDDDD3F800000-3F8000003F800000 

The failing exe from the lab build loads:

XMM0 = FFFFFFFF3F800000-3F8000003F800000

3f800000 is float32 1.0 , so we have three 1.0 floats. Not sure what the rest is. Garbage? it seems it makes a difference.

@MichalStrehovsky
I think someone else should take a look. I have only very vague idea where these __readonlydata_ things come from.

@jkotas
Copy link
Member

jkotas commented Aug 28, 2022

Vector3 is 12 bytes, but the readonly data that Vector3.One is loaded from is 16 bytes. The JIT leaves the top 4 bytes of the readonly data uninitialized that leads to failures depending on what the uninitialized data happened to be. This looks like a JIT bug. @dotnet/jit-contrib Could you please take a look?

The questions to ask:

  • Are the top 4 bytes of xmm register that holds Vector3 data required to be initialized to 0?
  • Is the JIT expected to initialize the top 4 bytes of the Vector3 readonly data to 0?

@jkotas jkotas modified the milestones: 8.0.0, 7.0.0 Aug 28, 2022
@jkotas
Copy link
Member

jkotas commented Aug 28, 2022

This is silent bad codegen. We should fix it for .NET 7.

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-NativeAOT-coreclr labels Aug 28, 2022
@ghost
Copy link

ghost commented Aug 28, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Frequency - last 30 days as of 8/27:

Note: main branch failures in last 30 days

Runfo report

Build Kind Start Time
1871493 PR 71819 2022-11-07
1871687 Rolling 2022-11-07
1871847 PR 71819 2022-11-07
1872236 PR 71948 2022-11-07
1872339 PR 71943 2022-11-07
1873007 Rolling 2022-11-07
1873836 PR 71988 2022-12-07
1873835 PR 71175 2022-12-07
1874128 PR 71385 2022-12-07
1874181 PR 71187 2022-12-07
1874234 Rolling 2022-12-07
1874881 PR 71971 2022-12-07
1875008 PR 71385 2022-12-07
1877271 PR 72078 2022-13-07
1877160 Rolling 2022-13-07
1877428 Rolling 2022-13-07
1877456 PR 71948 2022-13-07
1878318 PR 72108 2022-13-07
1878499 Rolling 2022-13-07
1878950 PR 71385 2022-13-07
1878990 Rolling 2022-13-07
1880070 Rolling 2022-14-07
1880254 Rolling 2022-14-07
1880316 PR 72167 2022-14-07
1880554 PR 72091 2022-14-07
1880908 PR 72184 2022-14-07
1881247 Rolling 2022-14-07
1881678 Rolling 2022-14-07
1882597 Rolling 2022-15-07
1882765 PR 72145 2022-15-07
1882819 Rolling 2022-15-07
1882810 PR 72252 2022-15-07
1882817 PR 72236 2022-15-07
1884136 Rolling 2022-15-07
Author: runfoapp[bot]
Assignees: VSadov
Labels:

area-CodeGen-coreclr, area-NativeAOT-coreclr

Milestone: 7.0.0

@BruceForstall
Copy link
Member

cc @tannergooding

@EgorBo
Copy link
Member

EgorBo commented Aug 28, 2022

We try to promote 12b simd to 16b everywhere where possible so I guess the data section's value has to be 16 bytes with zeroed last 4 bytes indeed

@tannergooding
Copy link
Member

Are the top 4 bytes of xmm register that holds Vector3 data required to be initialized to 0?
Is the JIT expected to initialize the top 4 bytes of the Vector3 readonly data to 0?

We don't have any strict requirement here. For most operations (add, and, subtract, etc) the actual functionality is "element-wise" and so the value of the top 4-bytes doesn't matter.

The few cases where it crosses lanes (such as DotProduct or Sum) we're supposed to ensure the upper bits are zeroed or ignored. In the case of DotProduct we emit the encoding that says to only use the lowest 3 elements.

@tannergooding
Copy link
Member

Given this is NAOT only and NAOT must target a lower baseline, its possible there is a bug in the dpps emulation emitted for SSE2 only hardware that isn't correctly accounting for this.

I'll take a look tomorrow.

@MichalStrehovsky
Copy link
Member

Given this is NAOT only and NAOT must target a lower baseline

If we don't have any other test coverage for the lower baselines, it might be a good quality week topic. With NAOT one at least has a deterministic repro once a bad binary has been produced. Running into this in a JIT configuration would be a lot more difficult to root cause.

This is not the first issue caused by this test hole - there were #72081 and #72158 just in the past month. NAOT covers the lowest denominator but there's a whole spectrum in between the lowest denominator and whatever machines we run the tests on that can happen in the real world and it looks like we don't test.

@tannergooding
Copy link
Member

We do test, but only in the outer loop and only for the runtime tests. There has been a longstanding issue to also run the library tests under the various JIT Stress options for ISA enablement

@JulieLeeMSFT
Copy link
Member

Given this is NAOT only and NAOT must target a lower baseline, its possible there is a bug in the dpps emulation emitted for SSE2 only hardware that isn't correctly accounting for this.

I'll take a look tomorrow.

Assigning to Tanner.

@JulieLeeMSFT JulieLeeMSFT assigned tannergooding and unassigned VSadov Aug 29, 2022
@jkotas
Copy link
Member

jkotas commented Aug 30, 2022

@tannergooding Did you have a chance to take a look?

@tannergooding
Copy link
Member

Yes. but it doesn't look to be in the Vector3.Dot emulation.

The relevant logic is here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L3350

We pick Sse.Multiply, Sse3.HorizontalAdd -or- Sse.Shuffle, and Sse.Add and then the very first thing we do is mask things so that the unused component is zero: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L3418

  • Notably we could do this "more efficiently" now that we have GT_CNS_VEC, but that's an unrelated JIT throughput optimization

@tannergooding
Copy link
Member

The called Matrix and Quaternion methods aren't accelerated, just the Vector3.Normalize call, the various field accesses, the UnitX/Y/Z property accesses, and the Vector3 constructor call.

The test doesn't really use any special tricks or code paths either:

public void Matrix4x4CreateFromAxisAngleTest()

@jkotas
Copy link
Member

jkotas commented Aug 31, 2022

Here is codegen for Vector3.Length that I believe is the root cause of the problem:

[MethodImpl(MethodImplOptions.NoInlining)]
static float f(Vector3 v)
  => v.Length();
movss   xmm0,dword ptr [rcx+8] ds:00000052`9e6ff828=3f800000
movsd   xmm1,mmword ptr [rcx]
shufps  xmm1,xmm0,44h
                 
// xmm1 has Vector3. The top 4 bytes of xmm1 have undefined value. They happen to be well defined in this specific case, but the JIT assumes the general case where they are undefined.

movups  xmm0,xmmword ptr [xxx!_readonlydata_xxx_Program____Main___g__f_0_0 (00007ff6`59060ee0)] // 0x0000 FFFF FFFF FFFF - mask for the top 4 bytes
andps   xmm0,xmm1 // xmm0 has properly masked Vector3
mulps   xmm0,xmm1 // We are multiplying properly masked Vector3 with unmasked Vector3!

movaps  xmm1,xmm0
shufps  xmm1,xmm0,0B1h
addps   xmm1,xmm0
movaps  xmm0,xmm1
shufps  xmm0,xmm1,4Eh
addps   xmm0,xmm1
sqrtss  xmm0,xmm0
ret

I think mulps xmm0,xmm1 is wrong. It should be mulps xmm0,xmm0 so that we multiply the masked values. The bad code works most of the time (0 multiplied by undefined value is zero) except when the undefined value happens to be NaN that will turn the whole thing into NaN.

Does this analysis make sense? Can you take it from here?

@tannergooding
Copy link
Member

Yes, that makes sense. It should definitely be multiplying tmp * tmp, rather than tmp * original.

I'll see about getting a fix up first thing tomorrow.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 1, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 2, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project