Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD exhibits similar behavior to OSX for socket timeouts #14

Open
notsonic opened this issue Nov 13, 2022 · 8 comments
Open

FreeBSD exhibits similar behavior to OSX for socket timeouts #14

notsonic opened this issue Nov 13, 2022 · 8 comments

Comments

@notsonic
Copy link

I just set up the tools as a web server in a jail on my TrueNAS server, which is freebsd.

My cabinets were rapidly rebooting when trying to send games. I found this PR (#10) and swapped out "darwin" for "freebsd12" and all is well. I see that there's an env var to trigger the behavior as well.

Maybe raising an issue isn't entirely necessary since I don't really have an issue, I figured it might be worthwhile to have this in the repos history if someone happens to come across it themselves.

@DragonMinded
Copy link
Owner

Maybe we should just make that check for either? Kinda a pain, because I want it to quickly figure out when a game has been turned off remotely, which means you need a timeout, but BSD/Darwin breaks that. Could also move to a dedicated thread with no timeout and a monitor that nukes it when there isn't motion for awhile? Dunno. The original script I upgraded worked "better" because it didn't try to be fully in control of the process, but then you lose the ability to treat the game like a kiosk and ensure it is running.

@notsonic
Copy link
Author

I don't fully understand the implications of the different timeout code paths to be honest.
The behavior for my cabinets seems fine. If i turn them on, the game boots. If I change games while its running, they receive the new game. The cabinet status is accurately reflected the whole time (maybe with a bit of delay). Is there some behavior lost without the fast timeout?

@DragonMinded
Copy link
Owner

The idea behind setting a timeout was so that a stalled connection due to a device going offline mid-send could be detected. On some systems, sockets hang forever in that state, and that means you never return control out of the send or recv call. I could experiment with killing the timeout altogether (like the old system had) and seeing if it didn't behave correctly at least on Linux. I think that might fix things across the board, but it might also have the side effect of getting the state machine stuck.

@notsonic
Copy link
Author

I turned one of my cabinets off while it was loading (again I'm using the server set up) and I could see that the status hung at the same percentage in the web interface. After turning it back on again, it restarted from 0% and seems to have transferred the game successfully. I don't know if this means there's a dangling thread from the previous boot.

Is the 1 or 10 second timeout maybe just too aggressive? I'll try out some different values and report back.

@DragonMinded
Copy link
Owner

That's exactly the issue that the timeout was attempting to fix. I didn't want it hung forever (basically until the next time the cab was powered) sitting at the hung percentage. I wanted the state machine to be able to go back to "waiting for cabinet power".

1 second timeout is FAR FAR too aggressive. Are you netbooting a chihiro/triforce? Try a larger timeout. 10 seconds seems fine for naomi.

@notsonic
Copy link
Author

Hey, sorry it's a Naomi. The 1 second timeout I was referring to is this one here: https://github.com/DragonMinded/netboot/blob/trunk/netdimm/netdimm.py#L341

I've been messing with these 3 lines of code but I haven't really used the sockets lib before. Would you be able to explain them (341-343)?

Changing the timeout values doesn't seem to do anything. It really seems like the major difference is setting it to blocking.

I noticed in the docs that settimeout changed with 3.7, I'm on 3.9. Is this relevant?

Changed in version 3.7: The method no longer toggles [SOCK_NONBLOCK](https://docs.python.org/3/library/socket.html#socket.SOCK_NONBLOCK) flag on [socket.type](https://docs.python.org/3/library/socket.html#socket.socket.type).

@DragonMinded
Copy link
Owner

Oh, good catch, that would definitely screw things up. Setting the timeout used to also go along with blocking implications. Hmmm, ugh. I really don't know. Its basically impossible to try to test all permutations of Linux/OSX/BSD with Naomi/Triforce/Chihiro, especially given I don't have any chihiros, triforces or native BSD devices.

@notsonic
Copy link
Author

If it were me, I just wouldn't support BSD lol. I'm only using it because I already had the TrueNAS server running. I could just run this in a linux vm instead of a jail.

I assume the difference comes down to the native socket implementation differences between linux and bsd. They must have different defaults or something. I tried using socket.setsockopt to set time outs SO_RCVTIMEO and SO_SNDTIMEO (socket.settimeout is something that's in the python layer only, apparently) and that didn't seem to do anything.

I did notice one bad behavior using the blocking sockets, the server won't come up if one of the cabinets is already on.

I wonder if there's some magic in socket.create_connection that socket.connect might be missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants