Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type checking and coercion #662

Merged
merged 142 commits into from
Aug 4, 2023
Merged

Type checking and coercion #662

merged 142 commits into from
Aug 4, 2023

Conversation

tclose
Copy link
Contributor

@tclose tclose commented Jun 7, 2023

Types of changes

This PR contains a number of breaking changes:

  • significantly stricter type-checking
  • lists of inputs to be split over, including lazy inputs, now need to set in the TaskBase.split() method (i.e. instead of in Task.init or by setting inputs attribute).
  • Splits and combines need to be applied to typed tasks/workflows before their lzout interface is accessed (so the type of the LazyField can be set properly)
  • ShellCommandTask Inputs that use output_file_template must also be defined in the output spec with fileformats.generic.File (or subclass) type, in addition to a input field of str, Path, ty.Union[str, bool] or ty.Union[Path, bool] type (@ghisvail I forgot about this when you asked whether there were any implications for task interface design).
    • NB: The automatic insertion would work if it was in the other direction than it is currently, i.e. a File-type field defined in the output spec auto inserts a Path-type input field
  • MultiInput(Obj|File) types changed to be proper classes instead of just lists
  • MultiOutput(Obj|File) types changed to be a Union beween a single T, list[T], and a dummy MultiOutputType type used to signify that a converter needs to be inserted into to the field
  • set_input_validators() has been removed
  • File and Directory in pydra.engine.specs now refer to the classes in fileformats.generic. As such, existence (and format) checking of non-lazy inputs is performed in FileSet.__init__, not at runtime as was the case previously
  • Default values assigned to fields in input/output specs are now set to attrs.NOTHING not None (not sure if this actually changes the overall behaviour though as I believe they are always overwritten)
  • Associated files (e.g. JSON side-cars) won't be copied unless an appropriate file-format class is specified (i.e. File is not sufficient and you would need to use fileformats.medimage.NiftiX). However, on the plus side
    • Non-neuroimaging associated files will be copied if an appropriate format is used (currently only selected neuroimaging formats with associated files are supported, which are hard-coded in helpers_file.py)
    • Associated files will be included in hashes
  • No longer compatible with some older micro versions of Python (e.g. 3.8.0) as they don't support typing in quite the same way

Summary

Type checking

Adds detailed type checking at both run and construction time (i.e. lazy fields carry the type of the field they point to). This is done by pydra.utils.TypeParser, which is works as an attrs converter and is set to be the converter into every field in the generated input/output specs by default.

The type parser unwraps arbitrary nested generic types (e.g. List, Tuple and Dict in typing package) and tests that they either match or are a subclass of the declared types.

Type coercion

In addition to the type-checking, selected types will be automatically coerced to their declared type:

  • typing.Sequence -> typing.Sequence (but not str -> typing.Sequence or typing.Sequence -> str)
  • typing.Mapping -> typing.Mapping
  • Path -> os.PathLike
  • str -> os.PathLike
  • os.PathLike -> Path
  • os.PathLike -> str
  • typing.Any -> MultiInputObj
  • typing.Union[os.PathLike, str] -> MultiInputFile
  • typing.Sequence -> MultiOutputObj
  • int -> float
  • numpy.array -> typing.Sequence
  • numpy. ->

Like the type-checking, the coercion is performed between types within arbitrarily nested containers so can handle coercions, e.g.

dict[str, list[str]] -> dict[str, tuple[File]]

Hashing

Hashing is implemented (by @effigies) in a dispatch method, which handles hashes for non-ordered objects such as sets in a process-independent way.

Hashing of files is now handled outside of Pydra within the fileformats package. This enables the hashes to be aware of changes in associated files.

Copying files

Copying files (e.g. into working directories) is also handled in fileformats, and includes graceful fallback between leaving the files in-place, symlinking, hardlinking and copying files (in order of preference), taking into account whether the files need to be in the same folder/have the same file-stem.

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

@satra
Copy link
Contributor

satra commented Jun 7, 2023

@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of gathered. this particular thing seems to be a requirement of type checking rather than readability of code. python isn't really designed to be that kind of language and it seems we are forcing some elements to users here to satisfy a developer constraint. i'm not for or against this at this stage. just wanted to bring it up so that the user perspective is considered.

a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script.

@tclose
Copy link
Contributor Author

tclose commented Jun 8, 2023

@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of gathered. this particular thing seems to be a requirement of type checking rather than readability of code.

Yeah, I tried to avoid the need to add the "gathered" syntax as it seems a bit superfluous, but having to check whether a value is actually a list of that value, or a list of list of that value, etc... just makes the type-checking impractical (unless anyone has any good ideas). I personally think it makes the code a bit easier to understand if values that are to be split are explicitly marked as such, but definitely something it would be good to discuss.

Note that the "type-checking" process is actually as much about coercing strings/paths into File objects so we know how to hash them properly as it is about validation. So I think it will be hard to get away from some form of type-parsing (the existing code this PR replaces already had quite a bit of special-case handling to deal with File objects).

python isn't really designed to be that kind of language and it seems we are forcing some elements to users here to satisfy a developer constraint. i'm not for or against this at this stage. just wanted to bring it up so that the user perspective is considered.

I would perhaps argue that the direction of travel with Python is to move towards a stricter, type-checked, language. Certainly editors like VSCode are pushing you to write code that way, with type-checked linting switched on by default.

On the topic of linting, I have been thinking about what would it would take to facilitate static checking and hinting of pydra workflows. I will try to put together an issue summarising what would be required so we can also debate its merits.

a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script.

Perhaps if it had a better name than gathered it would be more intuitive. I played around with a few but couldn't come up with one that I really liked. Personally, I reckon by the time that the users are thinking about using split, they would already be looking up the docs and in that case having to use gathered wouldn't be too much of an extra burden. But I agree it isn't quite as clean.

@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Patch coverage: 93.65% and project coverage change: +1.07% 🎉

Comparison is base (426564e) 81.77% compared to head (15666a5) 82.84%.
Report is 4 commits behind head on master.

❗ Current head 15666a5 differs from pull request most recent head 292fd3f. Consider uploading reports for the commit 292fd3f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #662      +/-   ##
==========================================
+ Coverage   81.77%   82.84%   +1.07%     
==========================================
  Files          20       22       +2     
  Lines        4400     4845     +445     
  Branches     1264        0    -1264     
==========================================
+ Hits         3598     4014     +416     
- Misses        798      831      +33     
+ Partials        4        0       -4     
Flag Coverage Δ
unittests 82.84% <93.65%> (+1.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pydra/__init__.py 89.47% <ø> (-1.44%) ⬇️
pydra/engine/helpers_file.py 92.34% <91.25%> (+6.24%) ⬆️
pydra/engine/core.py 92.72% <91.37%> (-0.40%) ⬇️
pydra/utils/hash.py 92.85% <92.85%> (ø)
pydra/utils/typing.py 92.97% <92.97%> (ø)
pydra/engine/task.py 93.65% <93.75%> (+0.07%) ⬆️
pydra/engine/helpers.py 86.59% <95.40%> (+0.54%) ⬆️
pydra/engine/specs.py 94.72% <96.15%> (-0.08%) ⬇️
pydra/engine/audit.py 96.55% <100.00%> (+0.12%) ⬆️
pydra/engine/helpers_state.py 95.67% <100.00%> (+0.06%) ⬆️
... and 1 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djarecka
Copy link
Collaborator

djarecka commented Jun 8, 2023

on one hand I don't like introducing like gathered (not sure what would be the best name, we can think), on the other hand I'm not sure if it decreases readability, or actually improves the readability...
It does explicitly points to the argument that should have ben modified when moving the task to the splitting option, so it might help with debugging.

pydra/utils/typing.py Outdated Show resolved Hide resolved
pydra/engine/specs.py Outdated Show resolved Hide resolved
pydra/engine/specs.py Outdated Show resolved Hide resolved
@effigies
Copy link
Contributor

effigies commented Jun 19, 2023

Thinking a bit about gathered and trying to find a way out. I'm worried that a special type will add confusion when we do something like A -> B -> C where A is mapped over input, B accepts a list, and C again maps over inputs.

I believe the issue here is in:

a = TaskA(x=[1, 2, 3, 4]).split("x")

When x is passed to TaskA(), we don't know whether it is a state-mapping list or a regular list, so we don't know whether to check compatibility with int or list[int]. Is this correct?

What if we made the syntax instead:

a = TaskA().split("x", x=[1, 2, 3, 4]).combine("x")

In this formulation, x does not get set on a, but on a.state. We can put a runtime check that the inner type of the sequence matches the type of a.inputs.x, but it does not need to be explicitly called as a gathered.

Lazy inputs/outputs are not going to be reasonably type-checkable, but we could still build-time check them, and should be able to tell whether inputs are mapped or not. We could use a StateList NewType or subclass to ensure that we unwrap things correctly. .split() would add a layer, while .combine() would remove one.

@tclose
Copy link
Contributor Author

tclose commented Aug 2, 2023

This is amazing work, Tom! I hope to get through the rest before we talk, but in general I'm really impressed how this has come together in a way that keeps the basic type-free functionality intact.

Thanks 🙏 I'm really chuffed you think so 😊

On the subject of keeping the basic type-free functionality intact, one alteration I have been considering that would be good to discuss in the meeting, is that of relaxing the type-checking at construction time (i.e. lazy fields) so that base classes can be passed to sub-classed type fields to avoid need to add cast statements between loosely-typed tasks and tightly-typed tasks. Stricter type-checking/coercing can still be performed at runtime.

For example, if B is a subclass of A, then currently you can connect a B-typed output into a A-typed input but if you wanted to connect A-typed to B-typed you would need to cast it to B first. While this aligns with how traditional type-checking is done, we might want to be a bit more flexible so you could connect generic File type fields to Nifti fields for example. If both upstream and downstream nodes use to specific formats that don't match, e.g. MrtrixImage -> Nifti then we would still raise an error.

It would somewhat reduce the effectiveness of the type-checking, but would probably avoid any (false-positive) cases where the type-checking could be annoying

Copy link
Collaborator

@ghisvail ghisvail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this review "diagonally" and provided minor comments along the way.

pydra/engine/core.py Outdated Show resolved Hide resolved
Comment on lines 597 to 600
if self._lzout:
raise Exception(
f"Cannot split {self} as its output interface has already been accessed"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this use case be triggered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you now try to split the node after you have accessed the lzout of the node, it will raise an error, e.g.

wf.add(A(x=1, name="a"))
wf.add(B(w=wf.a.lzout.z))
wf.a.split(y=[1, 2, 3])  # <--- Will raise an error, because it will change z from a single value to a list/state-array

pydra/engine/core.py Outdated Show resolved Hide resolved
pydra/engine/core.py Outdated Show resolved Hide resolved
return SpecInfo(
name="Inputs",
name=spec_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec_name used to be set to "Inputs". Now that it has become a parameter, what happens if spec_name is set to None or the empty string?

Options may include to either coalesce with spec_name = spec_name or "Inputs", or define a default parameter value, or raise an error if spec_name is unset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't spec_name always be a string?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's Python at the end of the day, so spec_name can be anything including None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably just raise the error in this case

pydra/engine/tests/conftest.py Outdated Show resolved Hide resolved
pydra/engine/tests/test_helpers.py Outdated Show resolved Hide resolved
pydra/engine/core.py Show resolved Hide resolved
pydra/engine/core.py Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants