Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC into RFC #5653

Merged
merged 648 commits into from
Aug 13, 2024
Merged
Changes from 1 commit
Commits
Show all changes
648 commits
Select commit Hold shift + click to select a range
37255a1
Stop admin launcher copying shard key from parent workflow (#5174)
Tom-Newton Apr 4, 2024
bcdbf5f
fix id bigint conversation for not created table (#5157)
ongkong Apr 4, 2024
f1c2231
Add tracking for active node and task execution counts in propeller (…
sshardool Apr 4, 2024
f8d4992
[House keeping] include container statuses for all container exit err…
pvditt Apr 4, 2024
e8a44b4
docs: add missing key in auth guide (#5169)
Jeinhaus Apr 4, 2024
452538a
shallow copying EnvironmentVariables map before modification (#5182)
hamersaw Apr 4, 2024
24a6e4e
Feature/array node workflow parallelism (#5062)
pvditt Apr 4, 2024
14eaa16
Fix streak length metric reporting (#5172)
Tom-Newton Apr 5, 2024
44c701e
Fix path to AuthMetadataService in flyte-binary chart (#5185)
eapolinario Apr 5, 2024
5f9abb1
Change phase to queue on job submit for webapi plugins (#5188)
pingsutw Apr 5, 2024
2eede89
[Docs] Testing agents in the development environment (#5106)
Future-Outlier Apr 5, 2024
c8be3e4
Use ratelimiter config in webapi plugins (#5190)
kumare3 Apr 5, 2024
934f5c1
docs(ray): Update kuberay documentation (#5179)
MortalHappiness Apr 7, 2024
6d58c73
Change phase to WaitingForResources when quota exceeded (#5195)
pingsutw Apr 7, 2024
8380f84
Fix: Update spark operator helm repository (#5198)
fg91 Apr 8, 2024
6568582
docs(troubleshoot): Add docker error troubleshooting guide (#4972)
MortalHappiness Apr 8, 2024
2528de7
add cache client read and write otel tracing (#5184)
pvditt Apr 8, 2024
6a39af7
fix link (#5199)
neverett Apr 9, 2024
674367f
add SyncTask's timeout setting (#5209)
Future-Outlier Apr 10, 2024
1ac8bbe
[easy] [flyteagent] Add `agent-service` endpoint settings for `flyte…
Future-Outlier Apr 10, 2024
8ef5ea9
Update Monitoring documentation (#5206)
davidmirror-ops Apr 10, 2024
dd67ff0
chore: remove obsolete flyte config files (#5196)
pingsutw Apr 10, 2024
6be49e8
Generate rust grpc using tonic (#5187)
eapolinario Apr 11, 2024
ab95f7e
enable parallelism to be set to nil for array node (#5214)
pvditt Apr 11, 2024
b03e86d
Fix get task resource attribute comment (#469)
wild-endeavor Apr 11, 2024
9955256
Fix mounting secrets (#5063)
yini7777 Apr 11, 2024
734d6f3
Update "Creating a Flyte project" with link to new Dockerfile project…
neverett Apr 11, 2024
b36556e
Re-apply changes to dataclass docs from flytesnacks#1553 (#5211)
neverett Apr 12, 2024
c7d1463
feat(ray): Remove initContainers (#5178)
MortalHappiness Apr 12, 2024
6159c27
fix(databricks): Check the response body before unmarshal (#5226)
pingsutw Apr 12, 2024
80ccda3
perf(cache): Use AddRateLimited for batch enqueue (#5228)
pingsutw Apr 12, 2024
4a90440
update scroll behavior so that sidebar maintains location (#5229)
cosmicBboy Apr 13, 2024
7287470
propagate dark/light theme to algolia search bar (#5231)
cosmicBboy Apr 14, 2024
2ba277f
Added additional port configuration for flyte services (#5233)
hamersaw Apr 15, 2024
30b1675
[BUG] fix(doc): Wrong configuration in spark plugin with binary chart…
lowc1012 Apr 15, 2024
ed23620
Fix grammatical error in workflow lifecycle docs (#5227)
Sovietaced Apr 15, 2024
3004af4
add plugins support for k8s imagepullpolicy (#5167)
novahow Apr 15, 2024
d0ed6c4
Added section on overriding lp config in loop (#5223)
pryce-turner Apr 15, 2024
1757750
Added unmarshal attribute for expires_in for device flow auth token (…
eapolinario Apr 15, 2024
c899586
Update template to link issue for closing (#5239)
thomasjpfan Apr 17, 2024
0095165
[Feature] add retries and backoffs for propeller sending events to ad…
pvditt Apr 18, 2024
35797ec
Explain how to enable/disable local caching (#5242)
eapolinario Apr 19, 2024
2ca3111
[House keeping] remove setting max size bytes in node context (#5092)
pvditt Apr 19, 2024
e8588f3
Fix support for limit-namespace in FlytePropeller (#5238)
hamersaw Apr 19, 2024
7e99ba4
[WAIT TO MERGE] Use remoteliteralinclude for code in user guide docs …
neverett Apr 19, 2024
6f0c274
refactor(python): Replace os.path with pathlib (#5243)
MortalHappiness Apr 20, 2024
a521dd0
Containerize documentation build environment and add sphinx-autobuild…
MortalHappiness Apr 21, 2024
de61bd0
use sphinx-design directives instead of sphinx-panels (#471)
cosmicBboy Apr 21, 2024
c64b5e3
Finalize flyteidl Rust crate (#5219)
austin362667 Apr 21, 2024
6b74673
Use sphinx design instead of sphinx panels (#5254)
cosmicBboy Apr 21, 2024
8f0cd02
add pip install flytekitplugins-envd (#5257)
Future-Outlier Apr 21, 2024
575a2a1
[Docs] add colon rule in version (#5258)
Future-Outlier Apr 21, 2024
e3e614a
Add Raw Container Local Execution Doc (#5262)
Future-Outlier Apr 22, 2024
fb52e31
[Docs] add validation file tpye (#5259)
jasonlai1218 Apr 22, 2024
4ff2707
WebAPI plugins optimization (#5237)
pingsutw Apr 22, 2024
0ffab4a
fix(webapi): Ensure cache deletion on abort workflow (#5235)
pingsutw Apr 22, 2024
8c5aac6
RunLLM Widget Configuration (#5266)
agiron123 Apr 23, 2024
8eb7b0d
fix dropdown formatting (#5276)
cosmicBboy Apr 23, 2024
6b03731
added configuration for arraynode default parallelism behavior (#5268)
hamersaw Apr 23, 2024
f0d645b
fix(storage): Improve error msg for limit exceeded (#5275)
pingsutw Apr 24, 2024
5c77765
test sphinx-reredirects (#5281)
cosmicBboy Apr 24, 2024
876999c
enabling parallelism controls on arraynode (#5284)
hamersaw Apr 24, 2024
b79cd51
Move deprecated integrations docs to flyte (#5283)
neverett Apr 25, 2024
c77e2c3
update rli number ranges (#5287)
neverett Apr 25, 2024
8a8bbc0
Integrate Bubbletea Pagination into get execution (#473)
zychen5186 Apr 26, 2024
9853abe
Fix broken gpu resource override when using pod templates (#4925)
fg91 Apr 26, 2024
95e9ac8
Fix order of arguments in copilot log (#5292)
eapolinario Apr 26, 2024
f32132b
fix(databricks): Handle FAILED state as retryable error (#5277)
pingsutw Apr 29, 2024
6ebcb86
Fix uploading of code coverage data (#5298)
eapolinario Apr 29, 2024
1f0a4f2
Update Flyte components (#5300)
flyte-bot Apr 29, 2024
ce0fd45
Do not upload codecoverage data from boilerplate (#478)
eapolinario Apr 29, 2024
5470f3f
Updates runllm configuration to use stable channel (#5289)
agiron123 Apr 30, 2024
40dccab
Remove upperbound on flyteidl's protobuf dependency (#5285)
mark-thm Apr 30, 2024
ec3f38b
fix: doc-requirements.txt to reduce vulnerabilities (#479)
EngHabu Apr 30, 2024
defacc1
Merge remote-tracking branch 'flytectl/prepare-for-monorepo' into pre…
eapolinario Apr 30, 2024
8cad4ed
Fix typos
eapolinario Apr 30, 2024
79a1992
Merge pull request #5301 from flyteorg/prepare-monorepo--flytectl
eapolinario Apr 30, 2024
00cad51
Bump golang.org/x/net from 0.22.0 to 0.23.0 in /flytecopilot (#5251)
dependabot[bot] Apr 30, 2024
a6b32f9
Add GoLand single-binary configuration (#5308)
eapolinario Apr 30, 2024
88a2f50
Upgrade actions/checkout (#5294)
pingsutw May 1, 2024
5350c42
added execution environment protos (#5311)
hamersaw May 1, 2024
b55026e
Bump github.com/containerd/containerd from 1.5.10 to 1.6.26 in /flyte…
dependabot[bot] May 2, 2024
f59e5ae
Fixup flytectl (#5309)
eapolinario May 2, 2024
8cc9617
correctly setting execution environments in flyteidl (#5323)
hamersaw May 3, 2024
8dd8b05
Fix flytectlt tests (#5325)
eapolinario May 3, 2024
0a1c82c
Upates RunLLM chat widget to use Mod + j as shortcut (#5324)
agiron123 May 3, 2024
ccf9473
perf(webapi): Increase maxWorkers limit to 10000 (#5326)
pingsutw May 6, 2024
b3c746f
Upgrade setup-go (#5328)
pingsutw May 6, 2024
dd022b5
ci: Remove duplicate docker-build job (#5327)
pingsutw May 6, 2024
f5c228f
Update Flyte components (#5330)
flyte-bot May 6, 2024
bd3ed0d
chore: handle empty or invalid secrets nicely (#4801)
trutx May 6, 2024
41ee314
Bump golang.org/x/net from 0.22.0 to 0.23.0 (#5253)
dependabot[bot] May 6, 2024
5cfd004
Use published in flyteidl-release gh workflow (#5332)
eapolinario May 7, 2024
de9a5c8
Bring back image build job and be explicit about flyteadmin int tests…
eapolinario May 7, 2024
8db9901
Bump github.com/docker/distribution in /flytectl (#5313)
dependabot[bot] May 7, 2024
f05acc7
Bump idna from 3.6 to 3.7 in /flytectl (#5315)
dependabot[bot] May 7, 2024
46db95b
Bump golang.org/x/net in /boilerplate/flyte/golang_support_tools (#5244)
dependabot[bot] May 7, 2024
b36a1f3
Bump jinja2 from 3.1.3 to 3.1.4 in /flytectl (#5329)
dependabot[bot] May 7, 2024
cb57beb
Bump golang.org/x/net to v0.23.0 (#5333)
eapolinario May 7, 2024
43c0f1a
add tracking pixel (#5336)
neverett May 8, 2024
41b1b84
docs: add `nested_type` example in `data_types_and_io/structured_data…
austin362667 May 9, 2024
aa1f211
Add prefetch functionality for paginator (#5310)
zychen5186 May 9, 2024
c0f5b10
add maptask cache clarification (#5340)
dansola May 9, 2024
ee6037b
Add Google Tag Manager (#5341)
EngHabu May 9, 2024
2f38d65
Grafana dashboard updates (#5255)
Tom-Newton May 12, 2024
7b82397
Bump github.com/docker/docker in /flytectl (#5363)
ddl-ebrown May 15, 2024
4dd5f3c
[monorepo] Move flytectl gh workflows to monorepo (#5354)
eapolinario May 15, 2024
11fbeca
Fix flytectl demo start (#5370)
eapolinario May 15, 2024
add971e
add tracking pixel (#5371)
neverett May 15, 2024
e506d19
Update flytekit tag (#5360)
pingsutw May 16, 2024
9e41df2
Bump version of unionai/flytectl-setup to 0.0.3 (#5377)
eapolinario May 16, 2024
16d54de
Clarify how networking between data plane propeller and control plane…
fg91 May 16, 2024
31802c7
Build flytectl monodocs from monorepo (#5374)
eapolinario May 16, 2024
519080b
Update Optimization Performance docs (#5278)
davidmirror-ops May 16, 2024
96acc5c
Crate updates (#5382)
wild-endeavor May 17, 2024
2f1f813
Skip flytectl upgrade tests (#5381)
eapolinario May 17, 2024
1384b32
Ensure token is refreshed on Unauthenticated (#5388)
pmahindrakar-oss May 20, 2024
16d2b14
remove source code renderer (#5397)
neverett May 20, 2024
458da5c
openai batch agent backend setup documentation (#5291)
samhita-alla May 21, 2024
7d2f0d0
Revert "Ensure token is refreshed on Unauthenticated (#5388)" (#5404)
eapolinario May 21, 2024
cfaedce
Fix link to test on local cluster (#5398)
Sovietaced May 21, 2024
2f7bedf
Replace Azure AD OIDC URL with correct one (#4075)
EraYaN May 22, 2024
317ad30
Update the example Dockerfile to run on k8s (#5412)
Sovietaced May 22, 2024
c1eddad
docs(kubeflow): Fix kubeflow webhook error (#5410)
MortalHappiness May 22, 2024
8a61252
update flytekit version to 1.12.1b2 in monodocs requirements (#5411)
samhita-alla May 22, 2024
2143948
Add supported task types to agent service config and rename (#5402)
Sovietaced May 22, 2024
95196af
update lock file (#5416)
samhita-alla May 23, 2024
0583f77
[monorepo] Fix flytectl install script (#5405)
eapolinario May 23, 2024
470621e
Move to upstream mockery (#4937)
eapolinario May 24, 2024
5f9abaf
Use a different git command to match the flyteidl tags (#5419)
eapolinario May 24, 2024
ba3647f
Fix typos using codespell CI job (#5418)
eapolinario May 24, 2024
d04cf66
[BUG] Handle auto refresh cache race condition (#5406)
pvditt May 24, 2024
1078e09
Add executionClusterLabel (#5394)
RRap0so May 28, 2024
75b33f8
[monodocs] Fix build failure (#5425)
eapolinario May 29, 2024
22d81c6
Update flytefile.md (#5428)
wild-endeavor May 29, 2024
715496a
use SHA instead of master in rli links (#5434)
neverett May 30, 2024
1e61f4e
Update Flyte components (#5441)
flyte-bot May 31, 2024
f08eb47
Upstream revert revert auth token fix (#5407)
pmahindrakar-oss May 31, 2024
9ffc54f
Indicate that jq is now required to install flytectl via the script (…
eapolinario Jun 3, 2024
5a81e76
Indicate that jq is needed to install flytectl (#5446)
eapolinario Jun 4, 2024
fceb78f
Fix flaky auto_refresh_test (#5438)
pingsutw Jun 4, 2024
25c3596
Feat: Allow using in-cluster creds in control plane cluster in a mult…
fg91 Jun 5, 2024
38883c7
Key-value execution tags (#5453)
pingsutw Jun 8, 2024
cd37d1b
Watch agent metadata service (#5017)
pingsutw Jun 8, 2024
15e321b
Make BaseURL insensitive to trailing slashes for metadata endpoint re…
Dlougach Jun 11, 2024
bd2f9a8
remove mmcloud plugin (#5468)
neverett Jun 11, 2024
266beda
[Feature] Refactor distributed job using common ReplicaSpec (#5355)
MortalHappiness Jun 12, 2024
3f4c2c8
charts: honor redoc.enabled=false (#5452)
flixr Jun 12, 2024
9484e36
Minor cleanup for Web API plugins (#5472)
Sovietaced Jun 12, 2024
653ca85
Bump k3s version to 1.29.0 (#5475)
eapolinario Jun 13, 2024
9b78125
fix: Modify the callback URL string in auth flow, to support custom b…
ddl-rliu Jun 13, 2024
bba8c11
[flyteagent] Remove redundant code in Agent (#5454)
Future-Outlier Jun 14, 2024
7d788cb
Add flyteconsole url to FlyteWorkflow CRD (#5449)
eapolinario Jun 14, 2024
02bf85f
deep copying arraynode tasktemplate interface (#5479)
hamersaw Jun 14, 2024
59e18d1
Remove end2end.yml (#5034)
pingsutw Jun 16, 2024
1abdd94
fix(migrations): Correct NULL to empty string in SQL insert (#5482)
pingsutw Jun 17, 2024
791471c
Fix: Make 'flytectl compile' consider launchplans used within workflo…
fellhorn Jun 17, 2024
de415af
include group in apiVersion in plugin_collector (#5457)
trevormcguire Jun 17, 2024
8c8e4f8
Feat: Allow controlling in which task phases log links are shown (#4726)
fg91 Jun 17, 2024
ba88e82
Add lint-fix make target, add gci to flytectl, remove goimports make …
Sovietaced Jun 17, 2024
297cdf8
install image builder (#5487)
pingsutw Jun 18, 2024
16e7780
Inherit execution cluster label from source execution (#5431)
va6996 Jun 20, 2024
f87a049
Add API to get domain (#5443)
zychen5186 Jun 20, 2024
c10346d
keep EnvFrom from pod template (#5423)
flixr Jun 21, 2024
4cb1473
fix broken mermaid diagrams (#5498)
cosmicBboy Jun 21, 2024
f54b74e
Fix: replace with in OSS docs (#5501)
fg91 Jun 21, 2024
87db0a3
Remove references to Office Hours (#5496)
davidmirror-ops Jun 22, 2024
2334d3f
Remove boilerplate golang_dockerfile (#5495)
eapolinario Jun 22, 2024
ef6d491
Use uv in single-binary CI (#5493)
eapolinario Jun 23, 2024
ed40a94
Remove doc building from flytectl (#5494)
eapolinario Jun 23, 2024
8b71469
fix(ray): Use default svc account if not set in task metadata (#5499)
pingsutw Jun 24, 2024
8805613
Added version to ExecutionEnv proto message (#5506)
hamersaw Jun 24, 2024
242303b
Event placeholder (#5464)
wild-endeavor Jun 24, 2024
f1e511d
Don't log auth tokens in debug mode (#5497)
Sovietaced Jun 25, 2024
d5744da
Remove ImageSpec experiment warning (#5510)
ppiegaze Jun 25, 2024
e1d9c5c
Add OTLP and sampling to otelutils (#5504)
andrewwdye Jun 25, 2024
3ee7120
Helm chart updates related to Prometheus, Webhook HPA, and Flyteconso…
mhotan Jun 25, 2024
f715341
Doc: Explain how lifetime of logging links is configured (#5503)
fg91 Jun 25, 2024
5ab247f
Flyte core webhook pod settings should be separate (#5490)
ddl-ebrown Jun 25, 2024
37fe0a3
Retag flyteagent image upon release (#5509)
eapolinario Jun 25, 2024
12bd353
Fix some nits in workflow lifecycle docs. (#5389)
dansola Jun 25, 2024
dc6060d
Update token_source.go (#5396)
eltociear Jun 25, 2024
5c70453
Adds appProtocol values of tcp on services (#5240)
noahjax Jun 25, 2024
5cc7f58
Add name to ExistsDifferentStructureError message (#5507)
pingsutw Jun 25, 2024
83daf56
updated execution environment id field to name (#5514)
hamersaw Jun 26, 2024
ce5eb03
Update Flyte components (#5516)
flyte-bot Jun 26, 2024
b592177
Fix flyteagent version extraction from helm chart in create_release.y…
eapolinario Jun 26, 2024
c97a0e9
Remove github.ref from group (#5518)
pingsutw Jun 27, 2024
4643e2a
Fix link to Kubeflow docs (#5524)
ppiegaze Jun 27, 2024
5b0d787
[flytepropeller] Watch agent metadata service dynamically (#5460)
Future-Outlier Jun 28, 2024
9b2a04b
Push flyteidl to pypi and npm on `flyteidl/v*.*.*` tags instead of on…
eapolinario Jul 1, 2024
c512571
reimplement functions in go-mouff-update, use ghreposervice (#5470)
novahow Jul 2, 2024
2407e90
Fix Dataclass Doc about (#5533)
Future-Outlier Jul 2, 2024
fc1c92c
Upstream/node event update (#5528)
wild-endeavor Jul 2, 2024
2b0e8ef
Document how to release flytectl (#5537)
eapolinario Jul 3, 2024
f075b34
Update Flyte components (#5536)
flyte-bot Jul 3, 2024
5c33464
chore: update runllm widget configuration (#5530)
agiron123 Jul 4, 2024
e13bfe3
Use jsonpb AllowUnknownFields everywhere (#5521)
andrewwdye Jul 4, 2024
b63ce0e
Use WithContext in all DB calls (#5538)
Sovietaced Jul 4, 2024
b7e6959
system archived (#5544)
troychiu Jul 8, 2024
b171b51
Pryce/doc 434 clarify how code is pushed into a given image during py…
pryce-turner Jul 10, 2024
81afb76
Increase more memory limits in flyteagent (#5550)
Future-Outlier Jul 10, 2024
c1563d7
Updated map task information to indicate array node is now the defaul…
pryce-turner Jul 16, 2024
220d594
reverted mockery-v2 on ExecutionContext (#5562)
hamersaw Jul 16, 2024
94af58f
Fix issues with helm chart release process (#5560)
Sovietaced Jul 16, 2024
0cddc15
Add FlyteDIrectory to file_types.rst template (#5564)
ppiegaze Jul 16, 2024
03a1c9e
Fully populate Abort task event fields (#5551)
va6996 Jul 17, 2024
9638db0
Add blob typechecker (#5519)
ddl-rliu Jul 17, 2024
0b39839
Refactor echo plugin (#5565)
pingsutw Jul 17, 2024
fe879f2
[Bug] fix ArrayNode state's TaskPhase reset (#5451)
pvditt Jul 19, 2024
337088e
Cleanup helm chart options for flytepropeller and prometheus (#5549)
Sovietaced Jul 22, 2024
02a0d2e
[Housekeeping] Bump Go version to 1.22 (#5032)
lowc1012 Jul 22, 2024
71779b3
Fix typos (#5571)
omahs Jul 23, 2024
3082cd3
Respect original task definition retry strategy for single task execu…
katrogan Jul 23, 2024
63f8307
Clarify the support for the Java/Scala SDK in the docs (#5582)
eapolinario Jul 23, 2024
a07753e
Fix spelling issues (#5580)
nnsW3 Jul 23, 2024
4514860
Update clusterrole.yaml (#5579)
shreyas44 Jul 23, 2024
1f26055
Simplify single task retry strategy check (#5584)
eapolinario Jul 24, 2024
1e54d21
Run make helm (#5587)
eapolinario Jul 24, 2024
07d3cfc
Update GPU docs (#5515)
davidmirror-ops Jul 25, 2024
6292dec
Update azblob 1.1.0 -> 1.4.0 / azcore 1.7.2 -> 1.13.0 (#5590)
ddl-ebrown Jul 25, 2024
ef27b29
Bump github.com/go-jose/go-jose/v3 from 3.0.0 to 3.0.3 in /flyteadmin…
dependabot[bot] Jul 25, 2024
30d3314
add execution mode to ArrayNode proto (#5512)
pvditt Jul 26, 2024
9d0d67a
Fix incorrect YAML for unp GPU (#5595)
davidmirror-ops Jul 26, 2024
d6da838
Another YAML fix (#5596)
davidmirror-ops Jul 26, 2024
025296a
DOC-431 Document pyflyte option --overwrite-cache (#5567)
ppiegaze Jul 29, 2024
01ecd0a
Upgrade docker dependency to address vulnerability (#5614)
katrogan Aug 1, 2024
92dca29
Support offloading workflow CRD inputs (#5609)
katrogan Aug 1, 2024
45e287a
[flyteadmin] Refactor panic recovery into middleware (#5546)
Sovietaced Aug 1, 2024
b18449e
Snowflake agent Doc (#5620)
Future-Outlier Aug 2, 2024
17719e2
[flytepropeller][compiler] Error Handling when Type is not found (#5612)
Future-Outlier Aug 2, 2024
89fd084
Fix nil pointer when task plugin load returns error (#5622)
Sovietaced Aug 2, 2024
89a70fb
Log stack trace when refresh cache sync recovers from panic (#5623)
Sovietaced Aug 2, 2024
7e16ff4
use private-key (#5626)
Future-Outlier Aug 2, 2024
b6bc902
Explain how Agent Secret Works (#5625)
Future-Outlier Aug 2, 2024
1124ea9
Fix typo in execution manager (#5619)
ddl-rliu Aug 2, 2024
0a441a9
Amend Admin to use grpc message size (#5628)
wild-endeavor Aug 2, 2024
4014bbd
document the process of setting ttl for a ray cluster (#5636)
pingsutw Aug 6, 2024
8156b1c
Add CustomHeaderMatcher to pass additional headers (#5563)
andrewwdye Aug 7, 2024
43c9d94
Turn flyteidl and flytectl releases into manual gh workflows (#5635)
eapolinario Aug 8, 2024
91d14b7
docs: fix typo (#5643)
cratiu222 Aug 8, 2024
1de1b50
Use enable_deck=True in docs (#5645)
thomasjpfan Aug 8, 2024
52322d0
Fix flyteidl release checkout all tags (#5646)
eapolinario Aug 8, 2024
96bbf7e
Install pyarrow in sandbox functional tests (#5647)
eapolinario Aug 8, 2024
5be4545
docs: add documentation for configuring notifications in GCP (#5545)
desihsu Aug 9, 2024
a13a63d
Correct "sucessfile" to "successfile" (#5652)
shengyu7697 Aug 12, 2024
736e338
RFC into RFC
katrogan Aug 12, 2024
9cfb4e3
Fix ordering for custom template values in cluster resource controlle…
katrogan Aug 12, 2024
37b4e13
Don't error when attempting to trigger schedules for inactive project…
katrogan Aug 12, 2024
e89f434
Merge branch 'master' into rfc/katrina-offloaded-literal
katrogan Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 60 additions & 11 deletions rfc/system/5103-offloaded-literal.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,32 +4,77 @@

- @wild-endeavor
- @EngHabu
- @katrogan

## 1 Executive Summary

Flyte depends on a series of `inputs.pb` and `outputs.pb` files to do communication between nodes. This has typically served us well, except for the occasional map task that produces a large Literal output. We sometimes also run into this issue for large dataclasses. This RFC proposes a mechanism that allows the offloading of any Literal, which would be done only of course for now for size reasons.
Flyte depends on a series of `inputs.pb` and `outputs.pb` files to do communication between nodes. This has typically served us well, except for the occasional map task that produces a large Literal collection output. This large collection typically exceeds default gRPC and configured storage limits. We sometimes also run into this issue for large dataclasses. This RFC proposes a mechanism that allows the offloading of any Literal, to get around size limitations for passing large Literal protobuf messages in the system.

## 2 Motivation
A [cursory search](https://discuss.flyte.org/?threads%5Bquery%5D=LIMIT_EXCEEDED) of Slack history shows a few times that this has come up before (and I remember some other instances, I think that search term just wasn't included). This is something that we've historically addressed by just increasing the size of the grpc message that's allowed, but this is an unsustainable solution.
A [cursory search](https://discuss.flyte.org/?threads%5Bquery%5D=LIMIT_EXCEEDED) of Slack history shows a few times that this has come up before (and I remember some other instances, I think that search term just wasn't included). This is something that we've historically addressed by just increasing the size of the grpc message that's allowed, but this is an unsustainable solution and severely reduces the utility of large-fan-out map tasks.

## 3 Proposed Implementation
We propose configuring propeller to offload large literal collections, using the following config

### 3.1 Offloaded Literal IDL
To the `Literal` [message](https://github.com/flyteorg/flyte/blob/cb6384ac6ea60f8b9421a71cfda4279f3579d3cb/flyteidl/protos/flyteidl/core/literals.proto#L95), add a new field called `starp` that will point to a location in the "metadata" bucket of the Flyte backend. The offloaded bytes should be deserialzable into a `Literal` object.
```yaml
type LiteralOffloadingConfig struct {
Enabled bool
// Maps flytekit SDK to minimum supported version that can handle reading offloaded literals.
SupportedSDKVersions map[string]string
// Default, 10Mbs. Determines the size of a literal at which to trigger offloading
MinSizeInMBForOffloading uint64
// Fail fast threshold
MaxSizeInMBForOffloading uint64
}

```

Questions: How will things like metadata be handled? Should they be merged? What should be in the `value` field of the main parent Literal?
### 3.1 Offloaded Literal IDL
Update the `Literal` [message](https://github.com/flyteorg/flyte/blob/4a7c3c0040b1995a43939407b99ca3e87b1eb752/flyteidl/protos/flyteidl/core/literals.proto#L94-L114)
like so

```protobuf
message Literal {
oneof value {
// A simple value.
Scalar scalar = 1;
// A collection of literals to allow nesting.
LiteralCollection collection = 2;
// A map of strings to literals.
LiteralMap map = 3;
}
...
// ** new below this line **
// If this literal is offloaded, this field will contain metadata including the offload location.
string uri = 6;
// Includes information about the size of the literal.
uint64 size_bytes = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is size important again? I mean there's other metadata as well (etag information). The assumption here is that size is super important so we want to be able to show that without making a head call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general consensus seemed to be this is useful, I guess it's nice for clients who want to decide whether to pull massive datasets?

}

```

### 3.2 Flyte Propeller
* When writing map task outputs, depending on the size, Propeller will need to offload the LiteralCollection after constructing it, and create a new Literal for downstream tasks to use, with the
* Also Propeller will need to check the flytekit version of the map task. If it's an older version (i.e. before the change proposed in this RFC), and it's large enough to need to be offloaded, it should fail the task. The assumption here is that if the map task is of the older version then downstream tasks will probably also be of those older versions which won't know how to resolved these offloaded literals.
Once offloading is enabled in the deployment config, flytepropeller can read from the [RuntimeMetadata](https://github.com/flyteorg/flyte/blob/f448a0358d8706a09b65b96543134f629327d755/flyteidl/protos/flyteidl/core/tasks.proto#L71-L87) in the task config to determine the SDK version.

When writing outputs in the [remote_file_output_writer](https://github.com/flyteorg/flyte/blob/2ca31119d6b9258661a71f38e450f93b6692402c/flyteplugins/go/tasks/pluginmachinery/ioutils/remote_file_output_writer.go#L56-L84) the source code should detect whether the literal size exceeds the configured minimum and
- if the task is using a newer SDK version that supports reading offloaded literals, offload the literal to the configured storage backend and update the literal with the offload URI and size.
- if the task is using an older SDK version that doesn't support offloaded literals, fail the task with an error message indicating that the task output is too large and the user should update their SDK version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a line to say downstream tasks also have to upgraded? that is, if you have a reference/remote task downstream that consumes the map task output, but it hasn't been updated, then it'll fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks


### 3.3 Flytekit & Copilot
Flytekit and Copilot will both need to detect that a Literal has been offloaded and know to download it.
- in Flytekit this can be done by checking the `uri` field in the Literal message when converting a literal [to_python_value](https://github.com/flyteorg/flytekit/blob/e394af0be9f904fbf8be675eaa8b8cdc24311ced/flytekit/core/type_engine.py#L1134)
- in Copilot, the data downloader [literal handling](https://github.com/flyteorg/flyte/blob/5f4199899922ca63f7690c82dfca42a783db64c3/flytecopilot/data/download.go#L219-L248) should fetch the value

For large outputs (like large maps of large dataclasses), Flytekit should also know how to offload the data. This should be done transparently to the user. How will propeller know to fail though if propeller hasn't been updated?
As a follow-up, we can also implement literal offloading in the SDK for conventional python tasks. Flytekit should also know how to offload the data. This should be done transparently to the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we cover that here? or is this only for map tasks for now? for the general case, we were going to go with the solution of propeller setting an environment variable that turns on offloading on the flytekit side?

for erroring if things get too big, i don't know that there's a solution. We should just add a size limit asap in flytekit right @eapolinario? Some env var based setting with a 10MB default. If a literal is more than 10MBs then error. considering we don't know when we'll get to the general case, by the time we do, most users might've already upgraded.

Copy link
Contributor Author

@katrogan katrogan Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're leaning towards not tackling the implementation bits for this proposal but I think it's okay to cover future work here?

Updated to include the bits about failing fast here for too-large literals, thank you!


**Open Question:** How will flytekit know to fail though if propeller hasn't been updated?

### 3.4 Other Implications
Does console need to change at all?
#### Flytekit Remote
Flytekit Remote will need to be updated to handle offloaded literals. In order to fetch offloaded literals by URI, users must now authenticate with their cloud provider on their machines using a role which has read access to the _metadata bucket_.

#### Console
Console code should show the offloaded literal URI and gracefully handle nil Literal [values](https://github.com/flyteorg/flyte/blob/4a7c3c0040b1995a43939407b99ca3e87b1eb752/flyteidl/protos/flyteidl/core/literals.proto#L96-L105).

## 4 Metrics & Dashboards

Expand All @@ -45,15 +90,17 @@ Alternate suggestions that were proposed include

* For map tasks, change the type of the output to a Union of the current user defined List and a new Offloaded type. We felt this would be a bit awkward since it changes the user-facing type itself (like if you were to pull up the map task definition in the API endpoint). It's also not extensible to other types of literals (maps of large dataclasses for example).

* Build off of the input wrapper construct that's still in PR. The idea was to have the wrapper contain in large cases, a reference to the data, and in small cases, the data itself. We didn't fully like this idea because the entire input set or output set needs to be offloaded.
* Build off of the input wrapper construct that's still in [PR](https://github.com/flyteorg/flyte/pull/4298). The idea was to have the wrapper contain in large cases, a reference to the data, and in small cases, the data itself. We didn't fully like this idea because the entire input set or output set needs to be offloaded.
* If the task downstream of a map task takes both the output list, along with some other input, after creating and upload the large pb file for the map task's output, Propeller would need to re-upload the entire large list or map (one time for each downstream task). If the offloading is done per literal, Propeller can just upload once and use.
* Modify the workflow CRD to include the offloading bits so that they're respected at execution time, and serialized at registration time. This is a bit heavier handed than just respecting the SDK version

## 7 Potential Impact and Dependencies

There's a couple edge cases that will just not work.

* If the map task is of an older flytekit version but for some reason the downstream task is of a newer version, Propeller will fail unnecessarily.
* If the map task is a newer version, but the downstream task is an older version, the downstream task will fail correctly.
* If workflow is using an older SDK version and launches a child workflow with a newer SDK version, the parent workflow will fail to resolve the child workflow outputs

Are there concerns about the fact that if we're offloading data once, and then sharing the pointer, we're no longer copying-by-value? Does this break any of the guarantees of Flyte and will we need to be more careful in the future around other changes to avoid issues?

Expand All @@ -65,5 +112,7 @@ Is there anything around sampled data, or automatically computed actual metadata

## 9 Conclusion

*Here, we briefly outline why this is the right decision to make at this time and move forward!*
Moving to literal offloading fixes a common and frustrating pain point around map tasks. It's a relatively simple change that should have a big impact on the usability of Flyte.

```

Loading