Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for extended data types in DART #462

Merged
merged 34 commits into from
Nov 23, 2017
Merged

Conversation

devreal
Copy link
Member

@devreal devreal commented Nov 20, 2017

This PR adds support for strided and indexed data types / data access patterns in DART. Extended data types are mapped to MPI types and handled by MPI.

New Functions

The DART interface is extended with the following functions:

dart_ret_t
dart_type_create_strided(
  dart_datatype_t   basetype,
  size_t            stride,
  size_t            blocklen,
  dart_datatype_t * newtype);

Create a strided datatype. To keep the type flexible, the underlying MPI type is created on-the-fly based on the number of elements the caller wants to copy. The number of elements in this type is determined by a single block length.

dart_ret_t
dart_type_create_indexed(
  dart_datatype_t   basetype,
  size_t            count,
  const size_t      blocklen[],
  const size_t      offset[],
  dart_datatype_t * newtype);

Creates an indexed data type containing count blocks defined by individual offsets and block lengths. The basetype has to be a basic type, hence no nesting of extended types is allowed atm. This type is mapped directly to MPI_Type_indexed. The number of elements in this type is determined by the sum of all block lengths.

dart_ret_t
dart_type_destroy(dart_datatype_t *dart_type);

Destroys a type. This may be done before the completion of pending operations.

Extended Function Interfaces

All variants of dart_put and dart_get have been extended by a second type parameter. The types now describe the data type from which to copy and to which to copy (similar to MPI but with slightly different naming and parameter ordering). DART does not perform conversion between base types and hence errors out if the base types do not match. In contrast to MPI, we keep a single parameter for the number of elements. It is left to the caller to ensure that no truncation happens with both source and destination type. The user may mix both strided and indexed data types as long as no truncation occurs.

Example

dart_ret_t dart_get(
  void            * dest,
  dart_gptr_t       gptr,
  size_t            nelem,
  dart_datatype_t   src_type,
  dart_datatype_t   dst_type);
  auto num_elem = 100;
  dart_gptr_t gptr;
  dart_team_memalloc_aligned(
    DART_TEAM_ALL, num_elem, DART_TYPE_INT, &gptr);

  dart_datatype_t new_type;
  dart_type_create_strided(DART_TYPE_INT, 
    2, // stride of 2
    1, // blocksize of 1 
    &new_type);

  int *buf = new int[num_elem / 2];
  dart_get_blocking(buf, gptr, num_elem / stride,
                      DART_TYPE_INT, new_type);

The following would result in an error:

  auto num_elem = 100;
  dart_gptr_t gptr;
  dart_team_memalloc_aligned(
    DART_TEAM_ALL, 3*num_elem, DART_TYPE_INT, &gptr);

  dart_datatype_t new_type;
  dart_type_create_strided(DART_TYPE_INT, 
    4, // stride of 4
    3, // blocksize of 3
    &new_type);

  int *buf = new int[num_elem];
  // ERROR: cannot copy 100 elements in blocks of 3!
  dart_get_blocking(buf, gptr, num_elem,
                      DART_TYPE_INT, new_type);

Misc

In the process, I tried to clean up the DART code surrounding the put/get functions to improve readability and reduce code duplication.
The Halo code has been adapted to the new interface. The strided and indexed get functions introduced with #452 have been removed.

Strided dash::copy has not been implemented as the copy code is under development in #410.

Closes #436.

Note: CI fails due to a bug in NastyMPI (dash-project/nasty-MPI#3)

dhinf
dhinf previously approved these changes Nov 21, 2017
Copy link
Member

@dhinf dhinf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. No performance changes for my examples. So everything fine. Nice work btw.

Copy link
Member

@dhinf dhinf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you changed the testing parameters?

@dhinf dhinf dismissed their stale review November 23, 2017 07:14

Test paramter changes

@devreal
Copy link
Member Author

devreal commented Nov 23, 2017

Because according to CI the longest running test is HaloMatrixWrapperCyclic3D (>90s IIRC, now down to <30s). Is that a problem? It looked like your tests are designed to be flexible ;)

@dhinf
Copy link
Member

dhinf commented Nov 23, 2017

it's flexible :) - With 150 elems per dimension it's possible to run tests with more ranks. But 100 is fine too.

Copy link
Member

@dhinf dhinf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test parameter changes - solved

@devreal
Copy link
Member Author

devreal commented Nov 23, 2017

it's flexible :) - With 150 elems per dimension it's possible to run tests with more ranks. But 100 is fine too.

That's actually a good point. Maybe we should set it to N elements per unit and do weak scaling? That way we can be sure that even on 128 cores the tests run (I want to do that one day ;))

@dhinf
Copy link
Member

dhinf commented Nov 23, 2017

I had this idea before, but you will get a problem with the reference matrix which is done on rank 0 sequentially. The bigger the NArray, the more memory is used and at some point its not enough memory left on one node.

@devreal
Copy link
Member Author

devreal commented Nov 23, 2017

OK I see, but to make it resilient we should add a cut-off to avoid unpleasant crashes in the tests when scaling the number of nodes (I will curse any author of tests that require debugging at scale... ;))

@dhinf
Copy link
Member

dhinf commented Nov 23, 2017

It's on my Todo list. I will make it resilient, so you don't need to start playing with your voodoo dolls :) .

@devreal devreal merged commit db6cbb6 into development Nov 23, 2017
@fmoessbauer fmoessbauer deleted the dart-feat-strides branch November 23, 2017 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants