Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] Kernel restarts and internal gateway errors for for U. Texas hub #1021

Closed
4 tasks
choldgraf opened this issue Feb 22, 2022 · 37 comments
Closed
4 tasks
Assignees

Comments

@choldgraf
Copy link
Member

Summary

We recently deployed the UT Austin hub, which is running a database sidecar for each student (#968). Since the hub began use, we've gotten reports of intermittent kernel crashes affecting students. This has been difficult to reproduce, but has been happening consistently.

There was originally a FreshDesk ticket about this here: https://2i2c.freshdesk.com/a/tickets/84?lang=en

We thought we had resolved it by increasing the memory limits for each user session, but these issues have persisted. Here are the two major things that we've found:

  1. Users experiencing "kernel died" errors sporadically
  2. Users getting "internal gateway errors" when using pgadmin ([New Hub] University of Texas #968 (comment))

Here's an issue where we describe some of the problems people have, and how to reproduce them:

#1008 (comment)

Hub information

Timeline (if relevant)

All times in US/Pacific.

2022-02-17

  • @jameshowison reports that student kernels are dying via FreshDesk

2022-02-18

We conclude that this may be a problem with memory limits students are hitting during queries, and bump their guarantee here: #1008

2022-02-21

We get reports that many students are still running into this problem. It is not reliable to replicate, but people are consistently hitting these errors.

ref: #1008 (comment)

TODO: add more as new developments happen


After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

What went wrong

Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.

Where we got lucky

These are good things that happened to us but not because we had planned for them.

Action items

These are only sample subheadings. Every action item should have a GitHub issue
(even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Documentation improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Technical improvements

  1. {{ summary }} [link to github issue]
  2. {{ summary }} [link to github issue]

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues
@choldgraf
Copy link
Member Author

@jameshowison - if you like feel free to add context about the problems you're seeing here. I opened this issue so that we could track this as an ongoing incident because it sounds more complex than expected.

@choldgraf
Copy link
Member Author

Here are some potentially relevant graphs from our JupyterHub dashboard in the Grafana. The following plots are just for the utexas hub...it seems like at some moments the hub response latency spikes, and I wonder if that's related to these problems as well.

image

@jameshowison
Copy link

I think those peaking times correspond with my class (which is 11-12:30 Tues/Thur central time)

Only approach to replicating is to do what i'm asking the class to do:

  1. Open up a new server at https://utexas.pilot.2i2c.cloud/
  2. Import the class_music_festival github folder via: http://utexas.pilot.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fhowisonlab%2Fdatawrangling-exercises&branch=main&urlpath=lab%2Ftree%2Fdatawrangling-exercises%2Fclass_music_festival
  3. Run create_class_music_festival.ipynb
  4. Run load_class_music_festival.ipynb
  5. Run queries against the class_music_festival database. eg.
SELECT bands.name
FROM bands
WHERE bands.id IN (11, 16, 22, 23, 30, 48, 11, 55, 22, 16, 55, 23, 30, 100, 108, 48)

@choldgraf
Copy link
Member Author

Just noting that I just tried this and was able to run all of those commands without an issue. So maybe this is an error that only pops up when a lot of users are doing it at once?

@jameshowison
Copy link

Yes, the issue is definitely transient.

Thing is that I don't know what to do once faced with it. If I was doing this on a server directly I'd be looking at logs, but I don't know if/how to access the logs. Is there some way to centralize logs for review, or could note username and time for later review?

I'm trying to do a screenshare with a student that has it frequently (including at non-peak times and with a totally new instance from a new github username). That makes me wonder whether it is a client browser issue, is that possible?

I'm also going to try to give people instructions for doing queries via iPython SQL magics, in case this is a xeus-sql issue (I think it's only happened in the xeus-sql kernels, but we haven't used ipython kernels yet).

@jameshowison
Copy link

I just did a screen share with the user on the hub with the github username lkiil. She is getting the error consistently when running commands with the homework.ipynb file. Perhaps someone can get into that server and take a look?

I tried using the built in terminal to find logs, but no luck. Googling suggested sudo journalctl -u jupyter would help but no dice (sudo not installed0, so I'm guessing one would need to ssh into the container for that to be useful.

@jameshowison
Copy link

I was able to use the admin interface to connect to the server for lkiil and I can confirm that I'm seeing the error when I execute cells in my browser. Can't see any logs via the notebook terminal, though.

@yuvipanda
Copy link
Member

yuvipanda commented Feb 23, 2022

I'm sorry you're running into this, @jameshowison - definitely frustrating. Thank you for helping us debug as we go along!

Notebook logs are written to ~/.jupyter-server-log.txt file - you can check that out. I looked and found that xeus-sql was crashing with an error message terminate called after throwing an instance of 'char const*'. I used our Google Cloud logger (query) to search for that particular error message in all notebook logs, and found that has been happening for a while (just looking at last 7 days)

image

I think if we could actually get the value of the char const* we might be able to get a better idea of what the error is. I've opened jupyter-xeus/xeus-sql#60 to let the xeus-sql developers know.

@yuvipanda
Copy link
Member

On the postgresql container side, I can see that there's an error message saying 'FATAL: terminating connection due to administrator command' (query), and its frequence looks like this:

image

Looking around, it looks like this might becaused by external process sending SIGTERM to the database (https://www.postgresql.org/message-id/4564.1284559661@sss.pgh.pa.us). In our case, that might be because postgres is hitting a memory limit. I'm going to bump the limit up and see if that helps - earlier we had just bumped up the limit for the data the kernel can handle, but now am going to do that for the database itself.

@yuvipanda
Copy link
Member

But looking at memory usage for the postgres container itself,

image

The current memory limit is 512Mi, and we are nowhere near close to touching that.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Feb 23, 2022
Increase postgresql limits by a bit - it is possible that
the kernel deaths are caused by the postgres container hitting
its memory limit and dying. The graph in
2i2c-org#1021 (comment)
makes it look like it isn't hitting the limit, but perhaps node
memory pressure is making it get killed just past the request
(which was 64MB) - so the bump might still help.
@yuvipanda
Copy link
Member

I've also installed jupyterlab-limit-output - while it seems to work in Python, it seems to have no effect on xeus-sql kernel notebooks :(

image

@yuvipanda
Copy link
Member

@jameshowison I've brought in a newer version of xeus-sql with jupyter-xeus/xeus-sql#60 merged, so hopefully the error logs are more helpful next time.

@jameshowison
Copy link

Great. I can just ask everyone to stop and start their server, right? And that will update everything?

@yuvipanda
Copy link
Member

@jameshowison nobody else is on, so I think everyone (including you!) have the newest version already.

I'm investigating the pgadmin issue. Not sure why that came back...

@yuvipanda
Copy link
Member

yuvipanda commented Feb 23, 2022

@jameshowison ok, for the large outputs freezing the browser, I've a partial solution - to use the classic notebook. I've enabled https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/limit_output, which seems to work well.

Switch to using classic notebook interface, using the interface switcher in jupyterlab on top right
image

image

If I then run SELECT * FROM tickets, I don't crash the browser! Instead, I see at the end of my scroll:

image

You can also modify your nbgitpuller links to point to classic notebook instead of jupyterlab if you would like, although then you lose a lot of the benefits you get with JupyterLab.

@yuvipanda
Copy link
Member

There are basically three issues:

  1. Kernel dying
  2. Browser freezing
  3. pgadmin4 not working

Today, I've put in place fixes for (1) and (2), but not (3). Intermittent issues are always a bit of a mess to debug... I'm hopeful I can put in some more time for (3) tomorrow.

@yuvipanda
Copy link
Member

I've also opened deshaw/jupyterlab-limit-output#2 to add suppoort for xeus-sql in jupyterlab-limit-output

@choldgraf
Copy link
Member Author

This is really excellent investigative engineering @yuvipanda :-)

Would it be helpful for us to document the jupyterlab-limit-output extension as well? I'm happy to submit a PR if we think it'll be common-enough to put in docs/

(as an aside, what a strange world we live in where we can just submit a patch to a random JupyterLab extension developed by DEShaw)

@jameshowison
Copy link

Thanks all. Status here going into class tomorrow is that we have bumped memory on postgres and improving logging with xeus-sql. I intend to use xeus-sql with %%sql magics as a fallback. If we encounter issues I'll take a look at the jupyter logs and flag the user accounts here (so we have a chance of seeing them before the container shuts down after class).

I don't read that we have a definite diagnosis on what was causing it, best assessment is some resource spike on the postgres container, right? My only input there is that I think that I was always able to query the postgres sidecar, but I'll check for certain from the terminal if issues show up again. Perhaps it was just transitory but xeus-sql already terminated and by the time it came back up postgres was responsive again?

One student that was having issues does say that she hasn't seen them in the second half of the day today, which is hopeful!

@yuvipanda
Copy link
Member

@jameshowison yeah, current working diagnosis is that it's a resource limit issue on the postgres database. However, the hope is that the additional debug logging in xeus-sql can give us a better answer for why the kernel is crashing - with the previous version, that information was being lost. I'm hopeful that if the kernels crash again, we'll be investigating less blindly.

@jameshowison
Copy link

This morning prepping for class saw a new error. "Invalid Response: 504"
Screen Shot 2022-02-24 at 8 53 08 AM

log shows:

[E 2022-02-08 16:56:10.202 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2022-02-08 16:56:10.975 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2022-02-08 16:56:11.132 SingleUserNotebookApp mixins:538] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/srv/conda/envs/notebook/lib/python3.7/site-packages/jupyterhub/singleuser/mixins.py", line 536, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden

@choldgraf
Copy link
Member Author

Hmm - does that error persist? Do you know if others have run into it?

@jameshowison
Copy link

Only me working with it at present, but I'm working in ipython rather than xues-sql, so I have a backup if it does.

@jameshowison
Copy link

No sign of kernel issues during class today. No repeat of the "Invalid Response Error". I will report as student do their assignments remainder of the week and weekend.

@yuvipanda
Copy link
Member

@jameshowison GREAT TO HEAR! Is that with xeus-sql or with the ipython magic?

@jameshowison
Copy link

xeus-sql so that's good!

@jameshowison
Copy link

Have run into another issue. A student's xeus-sql ipynb file is making the server unresponsive. The file is 33.9MB large, so I suspect that it has inside it a number of queries that returned very large results. I downloaded the file and yes, it has some very large output cells.

https://utexas.pilot.2i2c.cloud/user/andreakxn/lab/workspaces/auto-S/tree/datawrangling-exercises/class_music_festival/Untitled.ipynb

I'm just not sure how to handle this, I have asked that they include LIMIT 100 or something as they work, but need some guard rails here.

There is a discussion on xeus-sql of options for limiting results. jupyter-xeus/xeus-sql#59

@jameshowison
Copy link

btw, I removed that large file, but I have a copy if it is ever needed for debugging.

@jameshowison
Copy link

The "Invalid Response: 504" error occurred again this morning, immediately after logging in and creating a new xsql file. Choosing "Restart" and the new xsql kernel seemed to work.

@choldgraf
Copy link
Member Author

hmmm - maybe it's a race condition of some kind? This only ever happens right after logging in and running for the first time?

@jameshowison
Copy link

Yes, so far I've only seen it on startup. Will report if I see it at other times.

@jameshowison
Copy link

jameshowison commented Mar 1, 2022

Still have this issue of the Bad Gateway popping up. Sometimes it goes away when tell it to restart xsql, but just then I had to say "No Kernel" and then start an xsql kenel.

Scroll down in http://howisonlab.github.io/datawrangling/class_modules/sql_queries/01_intro/ to see what I mean

@sgibson91
Copy link
Member

@2i2c-org/tech-team What do we need to move this forward? Do we need the ~/.jupyter-server-log.txt from the next time a Bad Gateway error appears, or is there another path?

@choldgraf
Copy link
Member Author

Maybe @yuvipanda has some ideas here since he set this one up. I seem to remember that he said this suggests there is a websockets issue but I'm not sure how to debug that. I did find a stackoverflow post with the same error message, and indeed they mention web sockets in there as well: https://stackoverflow.com/questions/57297987/jupyterhub-kernel-connection-returns-http504-gateway-timeout).

@sgibson91
Copy link
Member

I thought the websockets issue was related to this ticket? https://discourse.pangeo.io/t/google-deployment-stuck/2280

@yuvipanda
Copy link
Member

I think current state is:

  1. xeus-sql is producing some weird 504 gateway errors
  2. ipython-sql works alright, doesn't cause any issues
  3. xeus-sql produces results sometimes that are too large and freezes user browsers
  4. pgadmin4 is still broken

Is this an accurate summary of current status, @jameshowison? If ipython-sql works fine, then perhaps a longer term outlook is to move towards making sure ipython-sql works well, and fix pgadmin4...

@choldgraf
Copy link
Member Author

Hey all - I am going to close out this issue, since I think the hub is in a good-enough steady state, and doesn't require active changes to it. We can spot-check improvements via support channels and new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants