Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISAAC plugin exits with segmentation fault #131

Open
benjha opened this issue Feb 1, 2021 · 12 comments
Open

ISAAC plugin exits with segmentation fault #131

benjha opened this issue Feb 1, 2021 · 12 comments

Comments

@benjha
Copy link

benjha commented Feb 1, 2021

Hi @PrometheusPi @psychocoderHPC,

After several unsuccessful attempts to get some traces out with TAU, I ran PIConGPU &ISAAC in a default configuration (profiling off, dumping viz. frames to Alpine, 1000 steps with checkpoint.restart.loop=3, using the /etc/picongpu/8_isaac.cfg file) and noted the simulation breaks with the next errors at the end of its execution, which is the cause TAU can't generate the traces:

[h09n09:151879] *** Process received signal ***
[h09n09:151879] Signal: Segmentation fault (11)
[h09n09:151879] Signal code: Address not mapped (1)
[h09n09:151879] Failing at address: 0x3be700000008
[h09n09:151879] [ 0] [d22n15:170622] *** Process received signal ***
[d22n15:170622] Signal: Segmentation fault (11)
[d22n15:170622] Signal code: Address not mapped (1)
[d22n15:170622] Failing at address: 0x19f800000008
[h09n09:151880] *** Process received signal ***
[h09n09:151880] Signal: Segmentation fault (11)
[h09n09:151880] Signal code: Address not mapped (1)
[h09n09:151880] Failing at address: 0x19fe00000008
[h09n09:151880] [ 0] [0x2000000504d8]
[h09n09:151880] [ 1] [0x2000000504d8]
[h09n09:151879] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[h09n09:151879] [ 2] [d22n15:170623] *** Process received signal ***
[d22n15:170623] Signal: Segmentation fault (11)
[d22n15:170623] Signal code: Address not mapped (1)
[d22n15:170623] Failing at address: 0x3be800000008
/gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[h09n09:151880] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[h09n09:151880] [ 3] [d22n15:170622] [ 0] [0x2000000504d8]
[d22n15:170622] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[h09n09:151879] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[h09n09:151879] [ 4] [d22n15:170623] [ 0] [0x2000000504d8]
[d22n15:170623] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[d22n15:170623] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[d22n15:170622] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[d22n15:170622] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[d22n15:170622] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[h09n09:151880] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[h09n09:151880] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[h09n09:151880] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[d22n15:170623] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[d22n15:170623] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[d22n15:170623] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[d22n15:170623] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[h09n09:151879] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[h09n09:151879] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[h09n09:151879] *** End of error message ***
/gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[d22n15:170622] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[d22n15:170622] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[d22n15:170622] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[h09n09:151880] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[d22n15:170623] *** End of error message ***
ERROR:  One or more process (first noticed rank 6) terminated with signal 11 (core dumped)

Looks like the issue is in the IsaacPlugin.hpp's pluginUnload() method which in turn call the IsaacVisualization destructor.

Can you reproduce this error ?

@PrometheusPi
Copy link
Member

PrometheusPi commented Feb 2, 2021

@benjha Thanks for reporting the error.
Since we are currently pushing out new versions of our software, could you please specify which version you are using that creates the error:

  • PIConGPU
  • ISAAC
  • alpaka

Then we can quickly check whether we are able to reproduce the error on hemera as well.

@benjha
Copy link
Author

benjha commented Feb 2, 2021

PIConGPU came from the dev branch dated back to Nov. 2020 with its own Alpaka distribution

commit 84e03980f2a56c7aea24d88bc3be9eb43f1a3197
Merge: aa86f2d c5208f4
Author: Sergei Bastrakov <sergey.bastrakov@gmail.com>
Date:   Wed Nov 25 10:50:46 2020 +0100

ISAAC:

commit 47c475ddd3fcd732964f5ce22edfe2fbcfae2b14
Merge: 3186666 74ab372
Author: Ren<C3><A9> Widera <r.widera@hzdr.de>
Date:   Fri Nov 6 13:30:40 2020 +0100

    Merge pull request #118 from ComputationalRadiationPhysics/dev
    
    Merge json-rodarae file to latetest release cadidate

@PrometheusPi
Copy link
Member

@benjha Thanks for providing the details. I will see whether I can reproduce this bug.

@benjha
Copy link
Author

benjha commented Feb 8, 2021

Hi @PrometheusPi

I am installing current PIConGPU dev branch with ISAAC 1.5.2 to verify if they work properly from this case.

I am having a list of these errors:

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #135: namespace "alpaka" has no member "Dev"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #65: expected a ";"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #135: namespace "alpaka" has no member "DimInt"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #65: expected a ";"

which likely is an Alpaka version mismatch between the one PIConGPU dev uses and ISAAC uses.

Were there any changes on the way compilation works?

@psychocoderHPC
Copy link
Member

Hi @PrometheusPi

I am installing current PIConGPU dev branch with ISAAC 1.5.2 to verify if they work properly from this case.

I am having a list of these errors:

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #135: namespace "alpaka" has no member "Dev"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #65: expected a ";"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #135: namespace "alpaka" has no member "DimInt"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #65: expected a ";"

which likely is an Alpaka version mismatch between the one PIConGPU dev uses and ISAAC uses.

Were there any changes on the way compilation works?

Are you sure you used the release 1.5.2 and not the current dev branch? The dev branch of ISAAC is currently incompatible with the PIConGPU dev branch. There is a PR ComputationalRadiationPhysics/picongpu#3498 in PIConGPU to fix it but we need to switch our PIConGPU CI first to the ISAAC dev branch.

The release 1.5.2 is currently checked together with PIConGPU dev.

@psychocoderHPC
Copy link
Member

@FelixTUD Could you please test the current dev of PIConGPU together with the release 1.5.2?

@benjha
Copy link
Author

benjha commented Feb 9, 2021

I've rechecked dependencies and fixed the Alpaka mismatch issue.

With PIConGPU current dev branch and ISAAC 1.5.2 following the next configuration:

#################################
## Section: Required Variables ##
#################################

TBG_wallTime="0:30:00"

TBG_devices_x=2
TBG_devices_y=2
TBG_devices_z=2

TBG_gridSize="192 2048 160"
TBG_steps="4000"

TBG_restartLoop="--checkpoint.restart.loop 1"


#################################
## Section: Optional Variables ##
#################################

TBG_isaac="--isaac.width 1280 --isaac.height 720 --isaac.period 1  --isaac.name !TBG_jobName  --isaac.url apps.marble.ccs.ornl.gov  --isaac.port 30167"


TBG_plugins="!TBG_isaac"

#################################
## Section: Program Parameters ##
#################################

TBG_deviceDist="!TBG_devices_x !TBG_devices_y !TBG_devices_z"

TBG_programParams="-d !TBG_deviceDist \
                   -g !TBG_gridSize   \
                   -s !TBG_steps      \
                   !TBG_restartLoop  \
                   !TBG_plugins      \
                   --versionOnce"

# TOTAL number of devices
TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))"

"$TBG_cfgPath"/submitAction.sh

PIConGPU throws the next errors:

$ cat stderr.725795
[a02n05:79941] *** Process received signal ***
[a02n05:79941] Signal: Segmentation fault (11)
[a02n05:79941] Signal code: Address not mapped (1)
[a02n05:79941] Failing at address: 0x12a000000008
[a18n18:153800] *** Process received signal ***
[a18n18:153800] Signal: Segmentation fault (11)
[a18n18:153800] Signal code: Address not mapped (1)
[a18n18:153800] Failing at address: 0x25a900000008
[a18n18:153800] [ 0] [0x2000000504d8]
[a18n18:153800] [ 1] [a02n05:79944] *** Process received signal ***
[a02n05:79944] Signal: Segmentation fault (11)
[a02n05:79944] Signal code: Address not mapped (1)
[a02n05:79944] Failing at address: 0x4bea00000008
/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a18n18:153800] [ 2] [a02n05:79943] *** Process received signal ***
[a02n05:79943] Signal: Segmentation fault (11)
[a02n05:79943] Signal code: Address not mapped (1)
[a02n05:79943] Failing at address: 0x38e400000008
[a02n05:79943] [ 0] [0x2000000504d8]
[a02n05:79943] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79943] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a18n18:153800] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a18n18:153800] [ 4] [a02n05:79941] [ 0] [0x2000000504d8]
[a02n05:79941] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79941] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79941] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a18n18:153800] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a18n18:153800] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a18n18:153800] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a18n18:153800] *** End of error message ***
[a02n05:79944] [ 0] [0x2000000504d8]
[a02n05:79944] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79944] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79944] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79941] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79941] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79941] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79944] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79944] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79944] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79944] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79944] *** End of error message ***
/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79943] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79943] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79943] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79943] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79943] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79943] *** End of error message ***
/lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79941] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79941] *** End of error message ***
ERROR:  One or more process (first noticed rank 7) terminated with signal 11 (core dumped)

this is the output from ISAAC-server:

$  isaac --dump /gpfs/alpine/proj-shared/csc434/PIConGPU_ISAAC_SLATE_output &
[1] 15
sh-4.2$ Using web_port=2459, tcp_port=2458 and sim_port=2460

Running ISAAC Master
Starting insitu plugin listener
Launching WebSocketDataConnector
Launching TCPDataConnector
Launching SaveFileImageConnector
Launching JPEG_URI_Stream
New connection, giving id 0 (control)
Group complete, sending to connected interfaces
sh-4.2$ Connection 0 closed.
Removed group 0

For now, I will be dumping the ISAAC timers into files, but will be great to get more insight by using a profiler.

@FelixTUD
Copy link
Contributor

FelixTUD commented Feb 9, 2021

@psychocoderHPC I'm looking into it, a LWFA setup compiles without a problem on hemera with pic dev and isaac 1.5.2

@FelixTUD
Copy link
Contributor

FelixTUD commented Feb 9, 2021

I can reproduce an identical error with an mpi execution of the example, this should help me tracking down the problem

@FelixTUD
Copy link
Contributor

FelixTUD commented Feb 9, 2021

@benjha I might have found the error, you can try and remove the line

json_decref( json_init_root );
as a hotfix.
I need to have a more detailed look into it later, as it seems that json_init_root is only initialized on the master node, thats why it throws seg fault on all other nodes on destruction, let me know if it fixed it for now

@benjha
Copy link
Author

benjha commented Feb 10, 2021

Thanks @FelixTUD It worked.

I am testing further...

@FelixTUD
Copy link
Contributor

FelixTUD commented Mar 9, 2021

This should be fixed with #132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants