agent: enable TCP keepalive on database connections #1676

psFried · 2024-10-03T15:16:43Z

We've seen an issue where the database is waiting on the agent to send prepared statement parameters, which never arrive. While it's not clear what exactly caused this, enabling TCP keepalive might at least reduce the possibility of that sort of thing happening due to a broken connection.

Unfortunately, sqlx doesn't currently have a way to enable keepalive, so we'll need to wait until launchbadge/sqlx#3540 is resolved before we can address this.

The text was updated successfully, but these errors were encountered:

abonander · 2024-10-03T17:31:04Z

We've seen an issue where the database is waiting on the agent to send prepared statement parameters, which never arrive.

You didn't mention this in the linked issue.

This may be due to cancellation. The async methods on PgConnection are generally not cancel-safe, so if a call is canceled in the middle of preparing a statement, it won't know to resume the process.

It has some ability to recover, but it depends a lot on where in the process it's cancelled.

I can glance through your code and see if there's any obvious sources of cancellation issues.

abonander · 2024-10-03T18:37:58Z

I see you're using Axum. It's great, but one thing to note is that it does cancel the task if the client disconnects before the response is returned. This is the expected behavior, and actually happens inside Hyper, not Axum itself.

If this is the cause, and the connection is left in a state where the database is stuck waiting for a message that will never come, it'll stall at this Connection::ping() call in the task that checks the connection before returning it to Pool: https://github.com/launchbadge/sqlx/blob/19f40d87a669e41081a629a469359d341c9223d6/sqlx-core/src/pool/connection.rs#L314

This would result in connections being effectively leaked from the pool, and future calls returning Error:PoolTimedOut. Come to think of it, this is probably the major source of that error for a lot of people.

The thing is, TCP keepalive wouldn't save you here anyway. That's handled entirely at the transport layer, completely hidden from the application. If the database is stalled on a read but the server's kernel is still properly managing the socket, it'll answer the keepalive messages and the connection won't time out.

What we need to do is wrap that ping() call, or the whole return_to_pool() call, in a timeout() call that prevents a stall from permanently leaking the connection.

In the meantime, you could probably mitigate this by setting an after_release callback that just does connection.ping() wrapped in tokio::time::timeout(). If the connection is stalled, it'll catch it. Returning Err (or Ok(false) if you don't want the error to be logged) from the callback causes the connection to be immediately closed.

abonander · 2024-10-03T18:44:05Z

My ultimate plan is to fix cancellation issues for good by just having the connection state machine run on a background task. This is what a ton of other crates do.

We originally didn't want it to work that way because we thought it might be more efficient for the connection logic to execute directly on the calling task, but it didn't really turn out that way.

psFried · 2024-10-21T20:58:28Z

@abonander thanks for your comments and help here. We had some confusion about the query in pg_stat_activity, which made us think that the database was waiting on the client. But in fact the query column just shows the last query that was run, even after it's completed. So the specific issue we were seeing turned out to be nothing to do with the database or sqlx at all.

So I'm closing this issue for now, since I don't see a specific need for TCP keepalive. We'd definitely enable it if keepalive ends up being supported in sqlx, but it no longer seems at all urgent.

psFried added chore Internal cleanups, refactoring, or improvements control-plane labels Oct 3, 2024

abonander mentioned this issue Oct 15, 2024

feat: add TCP keepalive for MySQL and PostgresSQL. launchbadge/sqlx#3559

Open

psFried closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: enable TCP keepalive on database connections #1676

agent: enable TCP keepalive on database connections #1676

psFried commented Oct 3, 2024

abonander commented Oct 3, 2024

abonander commented Oct 3, 2024 •

edited

Loading

abonander commented Oct 3, 2024

psFried commented Oct 21, 2024

agent: enable TCP keepalive on database connections #1676

agent: enable TCP keepalive on database connections #1676

Comments

psFried commented Oct 3, 2024

abonander commented Oct 3, 2024

abonander commented Oct 3, 2024 • edited Loading

abonander commented Oct 3, 2024

psFried commented Oct 21, 2024

abonander commented Oct 3, 2024 •

edited

Loading