Type checking and coercion #662

tclose · 2023-06-07T14:13:35Z

Types of changes

This PR contains a number of breaking changes:

significantly stricter type-checking
lists of inputs to be split over, including lazy inputs, now need to set in the TaskBase.split() method (i.e. instead of in Task.init or by setting inputs attribute).
Splits and combines need to be applied to typed tasks/workflows before their lzout interface is accessed (so the type of the LazyField can be set properly)
ShellCommandTask Inputs that use output_file_template must also be defined in the output spec with fileformats.generic.File (or subclass) type, in addition to a input field of str, Path, ty.Union[str, bool] or ty.Union[Path, bool] type (@ghisvail I forgot about this when you asked whether there were any implications for task interface design).
- NB: The automatic insertion would work if it was in the other direction than it is currently, i.e. a File-type field defined in the output spec auto inserts a Path-type input field
MultiInput(Obj|File) types changed to be proper classes instead of just lists
MultiOutput(Obj|File) types changed to be a Union beween a single T, list[T], and a dummy MultiOutputType type used to signify that a converter needs to be inserted into to the field
set_input_validators() has been removed
File and Directory in pydra.engine.specs now refer to the classes in fileformats.generic. As such, existence (and format) checking of non-lazy inputs is performed in FileSet.__init__, not at runtime as was the case previously
Default values assigned to fields in input/output specs are now set to attrs.NOTHING not None (not sure if this actually changes the overall behaviour though as I believe they are always overwritten)
Associated files (e.g. JSON side-cars) won't be copied unless an appropriate file-format class is specified (i.e. File is not sufficient and you would need to use fileformats.medimage.NiftiX). However, on the plus side
- Non-neuroimaging associated files will be copied if an appropriate format is used (currently only selected neuroimaging formats with associated files are supported, which are hard-coded in helpers_file.py)
- Associated files will be included in hashes
No longer compatible with some older micro versions of Python (e.g. 3.8.0) as they don't support typing in quite the same way

Summary

Type checking

Adds detailed type checking at both run and construction time (i.e. lazy fields carry the type of the field they point to). This is done by pydra.utils.TypeParser, which is works as an attrs converter and is set to be the converter into every field in the generated input/output specs by default.

The type parser unwraps arbitrary nested generic types (e.g. List, Tuple and Dict in typing package) and tests that they either match or are a subclass of the declared types.

Type coercion

In addition to the type-checking, selected types will be automatically coerced to their declared type:

typing.Sequence -> typing.Sequence (but not str -> typing.Sequence or typing.Sequence -> str)
typing.Mapping -> typing.Mapping
Path -> os.PathLike
str -> os.PathLike
os.PathLike -> Path
os.PathLike -> str
typing.Any -> MultiInputObj
typing.Union[os.PathLike, str] -> MultiInputFile
typing.Sequence -> MultiOutputObj
int -> float
numpy.array -> typing.Sequence
numpy. ->

Like the type-checking, the coercion is performed between types within arbitrarily nested containers so can handle coercions, e.g.

dict[str, list[str]] -> dict[str, tuple[File]]

Hashing

Hashing is implemented (by @effigies) in a dispatch method, which handles hashes for non-ordered objects such as sets in a process-independent way.

Hashing of files is now handled outside of Pydra within the fileformats package. This enables the hashes to be aware of changes in associated files.

Copying files

Copying files (e.g. into working directories) is also handled in fileformats, and includes graceful fallback between leaving the files in-place, symlinking, hardlinking and copying files (in order of preference), taking into account whether the files need to be in the same folder/have the same file-stem.

Checklist

I have added tests to cover my changes (if necessary)
I have updated documentation (if necessary)

for more information, see https://pre-commit.ci

…validation

…ers)

satra · 2023-06-07T14:32:54Z

@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of gathered. this particular thing seems to be a requirement of type checking rather than readability of code. python isn't really designed to be that kind of language and it seems we are forcing some elements to users here to satisfy a developer constraint. i'm not for or against this at this stage. just wanted to bring it up so that the user perspective is considered.

a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script.

tclose · 2023-06-08T01:03:06Z

@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of gathered. this particular thing seems to be a requirement of type checking rather than readability of code.

Yeah, I tried to avoid the need to add the "gathered" syntax as it seems a bit superfluous, but having to check whether a value is actually a list of that value, or a list of list of that value, etc... just makes the type-checking impractical (unless anyone has any good ideas). I personally think it makes the code a bit easier to understand if values that are to be split are explicitly marked as such, but definitely something it would be good to discuss.

Note that the "type-checking" process is actually as much about coercing strings/paths into File objects so we know how to hash them properly as it is about validation. So I think it will be hard to get away from some form of type-parsing (the existing code this PR replaces already had quite a bit of special-case handling to deal with File objects).

python isn't really designed to be that kind of language and it seems we are forcing some elements to users here to satisfy a developer constraint. i'm not for or against this at this stage. just wanted to bring it up so that the user perspective is considered.

I would perhaps argue that the direction of travel with Python is to move towards a stricter, type-checked, language. Certainly editors like VSCode are pushing you to write code that way, with type-checked linting switched on by default.

On the topic of linting, I have been thinking about what would it would take to facilitate static checking and hinting of pydra workflows. I will try to put together an issue summarising what would be required so we can also debate its merits.

a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script.

Perhaps if it had a better name than gathered it would be more intuitive. I played around with a few but couldn't come up with one that I really liked. Personally, I reckon by the time that the users are thinking about using split, they would already be looking up the docs and in that case having to use gathered wouldn't be too much of an extra burden. But I agree it isn't quite as clean.

codecov · 2023-06-08T01:23:42Z

Codecov Report

Patch coverage: 93.65% and project coverage change: +1.07% 🎉

Comparison is base (426564e) 81.77% compared to head (15666a5) 82.84%.
Report is 4 commits behind head on master.

❗ Current head 15666a5 differs from pull request most recent head 292fd3f. Consider uploading reports for the commit 292fd3f to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #662      +/-   ##
==========================================
+ Coverage   81.77%   82.84%   +1.07%     
==========================================
  Files          20       22       +2     
  Lines        4400     4845     +445     
  Branches     1264        0    -1264     
==========================================
+ Hits         3598     4014     +416     
- Misses        798      831      +33     
+ Partials        4        0       -4

Flag	Coverage Δ
unittests	`82.84% <93.65%> (+1.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
pydra/__init__.py	`89.47% <ø> (-1.44%)`	⬇️
pydra/engine/helpers_file.py	`92.34% <91.25%> (+6.24%)`	⬆️
pydra/engine/core.py	`92.72% <91.37%> (-0.40%)`	⬇️
pydra/utils/hash.py	`92.85% <92.85%> (ø)`
pydra/utils/typing.py	`92.97% <92.97%> (ø)`
pydra/engine/task.py	`93.65% <93.75%> (+0.07%)`	⬆️
pydra/engine/helpers.py	`86.59% <95.40%> (+0.54%)`	⬆️
pydra/engine/specs.py	`94.72% <96.15%> (-0.08%)`	⬇️
pydra/engine/audit.py	`96.55% <100.00%> (+0.12%)`	⬆️
pydra/engine/helpers_state.py	`95.67% <100.00%> (+0.06%)`	⬆️
... and 1 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

djarecka · 2023-06-08T13:03:39Z

on one hand I don't like introducing like gathered (not sure what would be the best name, we can think), on the other hand I'm not sure if it decreases readability, or actually improves the readability...
It does explicitly points to the argument that should have ben modified when moving the task to the splitting option, so it might help with debugging.

pydra/utils/typing.py

pydra/engine/specs.py

effigies · 2023-06-19T07:36:24Z

Thinking a bit about gathered and trying to find a way out. I'm worried that a special type will add confusion when we do something like A -> B -> C where A is mapped over input, B accepts a list, and C again maps over inputs.

I believe the issue here is in:

a = TaskA(x=[1, 2, 3, 4]).split("x")

When x is passed to TaskA(), we don't know whether it is a state-mapping list or a regular list, so we don't know whether to check compatibility with int or list[int]. Is this correct?

What if we made the syntax instead:

a = TaskA().split("x", x=[1, 2, 3, 4]).combine("x")

In this formulation, x does not get set on a, but on a.state. We can put a runtime check that the inner type of the sequence matches the type of a.inputs.x, but it does not need to be explicitly called as a gathered.

Lazy inputs/outputs are not going to be reasonably type-checkable, but we could still build-time check them, and should be able to tell whether inputs are mapped or not. We could use a StateList NewType or subclass to ensure that we unwrap things correctly. .split() would add a layer, while .combine() would remove one.

…ons now

tclose · 2023-08-02T02:27:48Z

This is amazing work, Tom! I hope to get through the rest before we talk, but in general I'm really impressed how this has come together in a way that keeps the basic type-free functionality intact.

Thanks 🙏 I'm really chuffed you think so 😊

On the subject of keeping the basic type-free functionality intact, one alteration I have been considering that would be good to discuss in the meeting, is that of relaxing the type-checking at construction time (i.e. lazy fields) so that base classes can be passed to sub-classed type fields to avoid need to add cast statements between loosely-typed tasks and tightly-typed tasks. Stricter type-checking/coercing can still be performed at runtime.

For example, if B is a subclass of A, then currently you can connect a B-typed output into a A-typed input but if you wanted to connect A-typed to B-typed you would need to cast it to B first. While this aligns with how traditional type-checking is done, we might want to be a bit more flexible so you could connect generic File type fields to Nifti fields for example. If both upstream and downstream nodes use to specific formats that don't match, e.g. MrtrixImage -> Nifti then we would still raise an error.

It would somewhat reduce the effectiveness of the type-checking, but would probably avoid any (false-positive) cases where the type-checking could be annoying

… to use serial plugin

ghisvail

I went through this review "diagonally" and provided minor comments along the way.

pydra/engine/core.py

ghisvail · 2023-06-30T12:42:01Z

pydra/engine/core.py

+ if self._lzout:
+ raise Exception(
+ f"Cannot split {self} as its output interface has already been accessed"
+ )


How can this use case be triggered?

If you now try to split the node after you have accessed the lzout of the node, it will raise an error, e.g.

wf.add(A(x=1, name="a")) wf.add(B(w=wf.a.lzout.z)) wf.a.split(y=[1, 2, 3]) # <--- Will raise an error, because it will change z from a single value to a list/state-array

pydra/engine/core.py

ghisvail · 2023-07-03T12:59:29Z

pydra/engine/core.py

 return SpecInfo(
- name="Inputs",
+ name=spec_name,


spec_name used to be set to "Inputs". Now that it has become a parameter, what happens if spec_name is set to None or the empty string?

Options may include to either coalesce with spec_name = spec_name or "Inputs", or define a default parameter value, or raise an error if spec_name is unset.

Shouldn't spec_name always be a string?

It's Python at the end of the day, so spec_name can be anything including None.

I think we can probably just raise the error in this case

pydra/engine/tests/conftest.py

pydra/engine/tests/test_helpers.py

pydra/engine/core.py

pyproject.toml

…atures

… lazy-fields

…t more involved. Will add in separate PR

…o 3.8

effigies and others added 22 commits May 4, 2023 23:02

ENH: Add generic object hasher

2d7142a

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e36d1d

for more information, see https://pre-commit.ci

RF: Alternate hashing and serialization

f52bc66

TEST: Hash dispatcher, dogfood register

da4a6c7

FIX: Use typing.Dict until Python 3.9 is minimum

dffc10f

ENH: Support classes registering using a __bytes_repr__ method

f211d0b

FIX: Pull Iterator from typing for now

189dadf

ENH: Hash frozensets

4dda520

ENH: Register byte serializer for None

29de5aa

RF: Implement hash_function() with hash_object()

94eb216

TEST: Update expected hashes

b4d6125

XXX: Temporarily mark test_wf_with_blocked_tasks xfail

7c25bf8

WIP: Start serializing pathlikes

4076ff2

PY37: Remove walrus operator

76f83d1

RF: Move file_hash to use dispatched hash

abe77f6

ENH: Flesh out more builtin types, make HasBytesRepr protocol

ae6a15c

ENH: Add directory path hashing, test pathlikes

1bb05c2

TEST: Untested types, files with contents, ignore coverage in protocol

a6fd5ca

RF: Always hash pathlike paths, not contents

e6bf4e9

RF: Make File a PathLike that serializes to file contents

6344a98

ENH: Add MtimeCachingHash to cache hashes for PathLikes with mtime in…

cb01648

…validation

implemented TypeCoercer callable class (to be passed to attrs convert…

2ca41e3

…ers)

effigies reviewed Jun 19, 2023

View reviewed changes

pydra/utils/typing.py Outdated Show resolved Hide resolved

effigies reviewed Jun 19, 2023

View reviewed changes

pydra/engine/specs.py Outdated Show resolved Hide resolved

effigies reviewed Jun 19, 2023

View reviewed changes

pydra/engine/specs.py Outdated Show resolved Hide resolved

tclose added 3 commits August 2, 2023 09:38

actually (I think this time) finished applying all of Chris' suggesti…

23d5980

…ons now

applied Chris's suggestions to core.py

131b18a

implemented more of Chris' suggestions

b637b33

tclose added 5 commits August 2, 2023 12:30

reverted submitter plugin

6009924

added back in return statement that was inadvertently removed

f10bfdb

fixed up unittests to match slightly more expressive error messages

51872bb

reverted test_shell_cmd_inputspec_11 and test_shell_cmd_outputspec_7a…

01353af

… to use serial plugin

[skip ci] added link to issue for XXX comment

1dc29da

ghisvail approved these changes Aug 2, 2023

View reviewed changes

tclose added 5 commits August 3, 2023 11:16

applied Ghisvail's review suggestions

ed8fe28

made lzin a cached_property

9710906

fixed up typing bug that was showing up for Enum types

ef3f7e0

added handling for hashing attrs slots classes

4adac59

added upper bound on fileformats package

8200918

This was referenced Aug 3, 2023

MNT: Reimplement file-hash caching in new hashing framework #683

Open

AttributeError: 'Input' object has no attribute 'files_hash' with split and combine #675

Open

tclose added 9 commits August 4, 2023 10:34

removed upper bound on fileformats version

2a831dc

added type var to state array instantiations

b4f49ff

added notimplementederror to catch typevar templates in function sign…

0b8e3e5

…atures

relaxed type-checking to allow parent->child matching when connecting…

d2ed9a0

… lazy-fields

reverted relaxation of parent->child type checking as it will be a bi…

6397e85

…t more involved. Will add in separate PR

attempting to upgrade the python version within the slurm container t…

8ec38dd

…o 3.8

added a couple of hashing tests

03c1286

[skip ci] froze sphinx version

9a28917

[skip ci] pin sphinx to 6.2.1

292fd3f

tclose merged commit 6062274 into nipype:master Aug 4, 2023

tclose deleted the type-coercion branch August 4, 2023 06:55

This was referenced Aug 10, 2023

FIX: Discard lazy fields in template type checks #657

Closed

Issue with composing tasks with output_file_template in a workflow #641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type checking and coercion #662

Type checking and coercion #662

tclose commented Jun 7, 2023 •

edited

Loading

satra commented Jun 7, 2023

tclose commented Jun 8, 2023

codecov bot commented Jun 8, 2023 •

edited

Loading

djarecka commented Jun 8, 2023

effigies commented Jun 19, 2023 •

edited

Loading

tclose commented Aug 2, 2023

ghisvail left a comment •

edited

Loading

ghisvail Jun 30, 2023

tclose Aug 3, 2023

ghisvail Jul 3, 2023

tclose Aug 3, 2023

ghisvail Aug 3, 2023

tclose Aug 3, 2023

Type checking and coercion #662

Type checking and coercion #662

Conversation

tclose commented Jun 7, 2023 • edited Loading

Types of changes

Summary

Type checking

Type coercion

Hashing

Copying files

Checklist

satra commented Jun 7, 2023

tclose commented Jun 8, 2023

codecov bot commented Jun 8, 2023 • edited Loading

Codecov Report

djarecka commented Jun 8, 2023

effigies commented Jun 19, 2023 • edited Loading

tclose commented Aug 2, 2023

ghisvail left a comment • edited Loading

Choose a reason for hiding this comment

ghisvail Jun 30, 2023

Choose a reason for hiding this comment

tclose Aug 3, 2023

Choose a reason for hiding this comment

ghisvail Jul 3, 2023

Choose a reason for hiding this comment

tclose Aug 3, 2023

Choose a reason for hiding this comment

ghisvail Aug 3, 2023

Choose a reason for hiding this comment

tclose Aug 3, 2023

Choose a reason for hiding this comment

tclose commented Jun 7, 2023 •

edited

Loading

codecov bot commented Jun 8, 2023 •

edited

Loading

effigies commented Jun 19, 2023 •

edited

Loading

ghisvail left a comment •

edited

Loading