Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](load) Fix the issue of high-concurrency single-replica load getting stuck #42297

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

liaoxin01
Copy link
Contributor

@liaoxin01 liaoxin01 commented Oct 22, 2024

In high-concurrency single-replica load, the tablet_writer_add_block RPC may occupy the _heavy_work_pool completely, causing the response_slave_tablet_pull_rowset RPC to have no available threads for processing. As a result, tablet_writer_add_block waits indefinitely for a response from the slave tablet, leading to the import getting stuck until it times out.

response_slave_tablet_pull_rowset is relatively lightweight, so it can be handled by the _light_work_pool.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@liaoxin01
Copy link
Contributor Author

run buildall

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 23, 2024
@dataroaring dataroaring added dev/2.1.x and removed approved Indicates a PR has been approved by one committer. labels Oct 23, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.47% (9708/25910)
Line Coverage: 28.72% (80500/280271)
Region Coverage: 28.16% (41642/147884)
Branch Coverage: 24.71% (21146/85562)
Coverage Report: http://coverage.selectdb-in.cc/coverage/19451b63d4587efaa0581b00afb4a254b6cf1ea9_19451b63d4587efaa0581b00afb4a254b6cf1ea9/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 41560 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 19451b63d4587efaa0581b00afb4a254b6cf1ea9, data reload: false

------ Round 1 ----------------------------------
q1	17564	7418	7336	7336
q2	2030	294	289	289
q3	11741	1073	1160	1073
q4	10573	875	1010	875
q5	7744	3140	3070	3070
q6	243	154	150	150
q7	1029	601	596	596
q8	9360	1981	1929	1929
q9	6608	6492	6395	6395
q10	7034	2445	2450	2445
q11	446	238	244	238
q12	424	228	223	223
q13	17766	3019	3016	3016
q14	239	214	218	214
q15	563	529	513	513
q16	657	580	603	580
q17	973	500	563	500
q18	7310	6681	6850	6681
q19	1345	1055	1021	1021
q20	473	184	187	184
q21	3999	3228	3272	3228
q22	1092	1012	1004	1004
Total cold run time: 109213 ms
Total hot run time: 41560 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7353	7307	7290	7290
q2	327	239	239	239
q3	3065	3012	2921	2921
q4	2100	1902	1785	1785
q5	5784	5797	5782	5782
q6	238	145	147	145
q7	2268	1840	1814	1814
q8	3437	3488	3518	3488
q9	8982	8927	8880	8880
q10	3601	3589	3584	3584
q11	598	480	490	480
q12	873	694	641	641
q13	9840	3212	3162	3162
q14	300	303	278	278
q15	579	520	524	520
q16	675	654	640	640
q17	1867	1640	1617	1617
q18	8301	7893	7641	7641
q19	1740	1447	1477	1447
q20	2131	1868	1922	1868
q21	5663	5523	5531	5523
q22	1125	1098	1066	1066
Total cold run time: 70847 ms
Total hot run time: 60811 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192797 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 19451b63d4587efaa0581b00afb4a254b6cf1ea9, data reload: false

query1	928	383	396	383
query2	6235	2159	2103	2103
query3	8686	199	211	199
query4	34279	23677	23597	23597
query5	3431	468	451	451
query6	264	168	162	162
query7	4208	289	287	287
query8	291	241	228	228
query9	9301	2610	2629	2610
query10	466	274	287	274
query11	17931	15225	15343	15225
query12	144	102	101	101
query13	1604	429	397	397
query14	9123	7416	7258	7258
query15	248	168	171	168
query16	7364	455	432	432
query17	1619	609	590	590
query18	2090	322	315	315
query19	355	154	159	154
query20	135	121	116	116
query21	216	116	113	113
query22	4685	4754	4464	4464
query23	35155	34693	34498	34498
query24	11157	2731	2736	2731
query25	626	408	398	398
query26	1181	157	163	157
query27	2430	282	288	282
query28	7235	2382	2401	2382
query29	785	427	439	427
query30	278	153	156	153
query31	1063	829	806	806
query32	104	56	55	55
query33	765	298	297	297
query34	930	511	524	511
query35	884	747	744	744
query36	1082	931	958	931
query37	149	93	89	89
query38	4013	3951	3966	3951
query39	1476	1427	1397	1397
query40	200	99	99	99
query41	51	47	46	46
query42	126	103	102	102
query43	538	499	500	499
query44	1261	807	798	798
query45	197	166	168	166
query46	1139	691	727	691
query47	1959	1813	1832	1813
query48	427	310	332	310
query49	956	424	429	424
query50	818	381	382	381
query51	7175	7013	6947	6947
query52	103	91	84	84
query53	255	179	179	179
query54	1182	427	428	427
query55	83	77	79	77
query56	279	281	290	281
query57	1286	1157	1145	1145
query58	233	245	231	231
query59	3254	3226	3128	3128
query60	285	256	275	256
query61	109	100	101	100
query62	851	682	656	656
query63	234	189	182	182
query64	4093	632	629	629
query65	3291	3194	3209	3194
query66	727	300	307	300
query67	15866	15757	15956	15757
query68	4868	552	554	552
query69	537	299	293	293
query70	1209	1172	1130	1130
query71	373	276	271	271
query72	7345	3980	3983	3980
query73	773	352	350	350
query74	10276	9078	9089	9078
query75	3454	2684	2722	2684
query76	2947	899	952	899
query77	612	294	288	288
query78	10550	9691	9637	9637
query79	1799	587	586	586
query80	1046	466	445	445
query81	583	249	238	238
query82	723	143	134	134
query83	285	134	134	134
query84	273	68	69	68
query85	1330	300	277	277
query86	426	301	304	301
query87	4450	4361	4377	4361
query88	3463	2178	2161	2161
query89	412	290	284	284
query90	1836	185	183	183
query91	138	104	99	99
query92	63	48	49	48
query93	2015	533	537	533
query94	766	289	290	289
query95	340	247	247	247
query96	617	284	290	284
query97	3355	3130	3194	3130
query98	222	201	195	195
query99	1682	1317	1300	1300
Total cold run time: 298662 ms
Total hot run time: 192797 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.52 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 19451b63d4587efaa0581b00afb4a254b6cf1ea9, data reload: false

query1	0.04	0.03	0.03
query2	0.06	0.03	0.02
query3	0.22	0.06	0.06
query4	1.64	0.10	0.10
query5	0.52	0.52	0.52
query6	1.15	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.03	0.04
query9	0.56	0.52	0.50
query10	0.55	0.55	0.56
query11	0.15	0.10	0.11
query12	0.14	0.11	0.10
query13	0.60	0.61	0.60
query14	2.70	2.76	2.90
query15	0.90	0.83	0.84
query16	0.39	0.39	0.38
query17	1.06	1.04	1.00
query18	0.20	0.19	0.20
query19	1.91	1.86	2.04
query20	0.01	0.01	0.01
query21	15.38	0.59	0.57
query22	2.88	1.61	2.27
query23	17.25	0.86	0.76
query24	3.28	1.47	1.15
query25	0.25	0.05	0.05
query26	0.50	0.14	0.14
query27	0.08	0.04	0.05
query28	10.04	1.11	1.07
query29	12.55	3.21	3.15
query30	0.24	0.06	0.06
query31	2.87	0.37	0.37
query32	3.31	0.46	0.46
query33	2.97	3.04	3.02
query34	16.90	4.46	4.42
query35	4.51	4.49	4.52
query36	0.65	0.51	0.49
query37	0.08	0.06	0.05
query38	0.04	0.03	0.03
query39	0.03	0.02	0.02
query40	0.15	0.13	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.97 s
Total hot run time: 32.52 s

Copy link
Contributor

@sollhui sollhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 23, 2024
@liaoxin01 liaoxin01 merged commit e338e1d into apache:master Oct 23, 2024
26 of 29 checks passed
@liaoxin01 liaoxin01 deleted the single_replica_load branch October 23, 2024 06:29
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Oct 23, 2024
…ting stuck (apache#42297)

In high-concurrency single-replica load, the tablet_writer_add_block RPC
may occupy the _heavy_work_pool completely, causing the
response_slave_tablet_pull_rowset RPC to have no available threads for
processing. As a result, tablet_writer_add_block waits indefinitely for
a response from the slave tablet, leading to the import getting stuck
until it times out.

response_slave_tablet_pull_rowset is relatively lightweight, so it can
be handled by the _light_work_pool.
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Oct 23, 2024
…ting stuck (apache#42297)

In high-concurrency single-replica load, the tablet_writer_add_block RPC
may occupy the _heavy_work_pool completely, causing the
response_slave_tablet_pull_rowset RPC to have no available threads for
processing. As a result, tablet_writer_add_block waits indefinitely for
a response from the slave tablet, leading to the import getting stuck
until it times out.

response_slave_tablet_pull_rowset is relatively lightweight, so it can
be handled by the _light_work_pool.
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Oct 23, 2024
…ting stuck (apache#42297)

In high-concurrency single-replica load, the tablet_writer_add_block RPC
may occupy the _heavy_work_pool completely, causing the
response_slave_tablet_pull_rowset RPC to have no available threads for
processing. As a result, tablet_writer_add_block waits indefinitely for
a response from the slave tablet, leading to the import getting stuck
until it times out.

response_slave_tablet_pull_rowset is relatively lightweight, so it can
be handled by the _light_work_pool.
liaoxin01 added a commit that referenced this pull request Oct 23, 2024
liaoxin01 added a commit that referenced this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.7-merged dev/3.0.3-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants