The following diagram shows the how checkpoiting works and different variables which can be user defined
A program can be run in functional simulation mode upto some point and then GPU states are stored in files so that program can be resumed from same point in performance simulation mode
Following details are stored in "checkpoint_files" folder in you run area
-
Global memory per kernel
-
Local memory per thread
-
Shared memory per CTA
-
Register file per thread
-
SIMT stack per warp
The varibales shown in the diagram can be set in gpgpusim.config file.
Whether checkpoint should be executed or not
-checkpoint_option 0
At which kernel checkpoint should be executed . (x from the figure)
-checkpoint_kernel 1
How many CTA are executed 100% before checkpoint. (M from the figure)
-checkpoint_CTA 50
Whether resume should be executed or not
-resume_option 0
From which CTA to resume (M from the figure)
-resume_CTA 50
From which kernel to resume (x from the figure)
-resume_kernel 1
From M to t, CTA are executed partially (t from the figure)
-checkpoint_CTA_t 100
How many instruction are executed before checkpoint in partial CTA
-checkpoint_insn_Y 104
Suppose, each thread is executing 26 instructions and there are 256 threads in a block. You want to checkpoint after 13 instructions in each thread, then Y should be set to = 13*256/warp_size = 104, if 32 is the warp size
For an example, in check samples/0_Simple/vectorAdd folder
vectorAdd.cu launches 2 kernels with 256 CTA each and 256 threads per CTA and 26 instructions per thread.
Checkpoint in 1st kernel after 50 full CTA and and 50 partial CTA (13*256 instructions per CTA)
Then following options should be added to gpgpusim.config
-gpgpu_ptx_sim_mode 1
-checkpoint_option 1
-checkpoint_kernel 1
-checkpoint_CTA 50
-resume_option 0
-checkpoint_CTA_t 100
-checkpoint_insn_Y 104
This will simulate 4,99,200 (50*256*26 + 50*256*13) instruction and only block 0 to 49 will pass in kernel 1
And, after that for resuming from samw point in performance simulation
-gpgpu_ptx_sim_mode 0
-checkpoint_option 0
-resume_option 1
-resume_CTA 50
-resume_kernel 1
-checkpoint_CTA_t 100
This will simulate 12,04,736 instructions in kernel 1 (50*256*0 + 50*256*13 + 156*256*26 ) and 17,03,936 (256*256*26) instructions in kernel 2 and block 0 to 255 will pass in both the kernels