Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: REQ without altering attempts #380

Open
tj opened this issue Jun 20, 2014 · 20 comments
Open

nsqd: REQ without altering attempts #380

tj opened this issue Jun 20, 2014 · 20 comments

Comments

@tj
Copy link
Contributor

tj commented Jun 20, 2014

we have some cases where we have to wait around for some distributed locks so I just keep requeueing the messages to allow messages over other types (that won't collide with those locks) to flow through. Problem is we also need a pretty low maxAttempts

@mreiferson
Copy link
Member

@visionmedia are you proposing that the REQ command would get a parameter to not increment attempts?

A few thoughts...

One problem is that (according to nsqd) you really have attempted the message whatever N times it's been sent to a consumer. It's hard to argue that it's not accurate...

And implementation wise they're disconnected - attempts is incremented on send, not on REQ, so it would be tricky to keep that state around.

What do you think @jehiah

@tj
Copy link
Contributor Author

tj commented Jun 21, 2014

yup, and I agree, it's weird but the ability to give the message back to NSQ side-effect free is definitely something we'll use a lot. Since the client handles discards anyway we could have yet-another client-side layer in redis that helps keep track of what is/isn't a real attempt, but all these layers are getting a little crazy haha. The other problem is that this is at a large scale, 5+ million messages in-flight at any given time so it eventually gets non-trivial to introduce tooling for the weird little edge-cases

I definitely feel like a lot of these are pretty specific to us, and might warrant a fork but I just like bringing them up in case someone else has had similar issues.

@tj
Copy link
Contributor Author

tj commented Jun 22, 2014

Another valid use-case:

When we put redshift in maintenance mode or resize a cluster we need to requeue those messages with a delay, but this also shouldn't count towards it's number of attempts, otherwise we'll lose very large copies containing potentially millions of messages. Under normal circumstances one or two attempts is just fine so they're definitely separate cases IMO

@mreiferson
Copy link
Member

pause the channel while it's in maintenance mode 😄 - don't have the consumers pound it into the ground while performing an operational procedure on the cluster, right?

@tj
Copy link
Contributor Author

tj commented Jun 22, 2014

it's a shared topic/channel ATM :(

@mreiferson
Copy link
Member

@jehiah care to weigh in on your thoughts here

@tj
Copy link
Contributor Author

tj commented Jun 29, 2014

FWIW I'm rewriting the entire thing in Go over the weekend haha, changing how we're handling things now that I understand the edge-cases better. My first case isn't relevant anymore but the second use-case of clusters being under maintenance etc is still relevant

@jehiah
Copy link
Member

jehiah commented Jun 29, 2014

The case you are talking about is where you consume a channel and messages in it fan out to N independent clusters and you are putting one of those clusters in maintenance and want a way to avoid burning your possible attempts against the cluster in maintenance while handling attempts normally for the other clusters. correct?

The combination of consumer backoff, and per-message retry/backoff are entirely meant to deal w/ this state. (individual messages get retried at increasing delays to last beyond your maintenance window, and you process slower burning fewer retries even if messages are ready to be retried) If this is a special maintenance state, it sounds like you might be able to 'finish' these messages when they hit a cluster in maintenance and push them to a second topic/channel where you apply different (higher ) max retry attempts and probably a different strategy to requeueing and backoff.

I think it's hard for nsq to give good primitives for more fine granular controls in this situation without an ability to tag messages with additional metadata that gets passed through. We've actively avoided that metadata because it's often more properly associated w/ the consumer (ie: which cluster a message maps to) rather than the producer.

@tj
Copy link
Contributor Author

tj commented Jun 29, 2014

cluster == redshift cluster in this case, they have mandatory weekly scheduled downtime. If the backoff logic was tailored to user logic that might work ok, if cluster A is under maintenance it backs off and B trickles through fine. The second queue thing could work, more stuff to manage but it would work I guess

@mreiferson
Copy link
Member

I realize this point might be moot since you're moving things to go-nsq which does all of this for you, but implementing backoff (both in slowing down the rate of consumption and deferring requeues) would be useful in nsq.js for this exact reason (to @jehiah's points)

@tj
Copy link
Contributor Author

tj commented Jul 11, 2014

Another nice thing you could use this for is to analyze what's in the queue without having any real effect on it. I guess you could FIN and PUB but that seems a little weird

@tj
Copy link
Contributor Author

tj commented Jul 16, 2014

hmm I keep coming across more and more use-cases for this. Even if I pushed them to another nsqd or topic I need to process those per-client as well, and we have too many clients to have separate topics, so I have pretty much no choice but to REQ with a reasonable delay. It can take anywhere from 3 hours to 3 days to ETL this data though so I can't rely on a large REQ being good enough.

Since lots of nsq relies on pushing logic to the client, I think it's reasonable to have this behaviour. Whether or not the client makes an actual attempt to process the data is up to the client I'd think.

@mreiferson
Copy link
Member

I need to think about this more.

I still have implementation concerns and "does this belong in the core" concerns so I need to sift through those feelings and come up with a reasonable rebuttal or blessing.

Anyone lurking who watches the repo and has any feelings on this now would be the time to weigh in 👍 or 👎

@dudleycarr
Copy link
Contributor

Possible hack: finish the message and re-publish to the same topic. Since messages will be broadcast to subscribing channels, you would unfortunately have to include something in the payload for channels to ignore unless the message originated from that channel. A possible downside is that messages processed in nsqadmin could be inflated. The other possible issue is if you care about order.

Another solution: Making some very broad assumptions about the problem you're trying to solve regarding acquiring locks, one possible solution would be to have a first topic/channel pair where the retry attempts is fairly high. The channel would on a per message basis attempt to acquire the lock and REQ if the lock is unavailable. Once the lock is available, publish the message to another topic/channel with said lock and have it process but this time with the small number of REQ attempts.

@tj
Copy link
Contributor Author

tj commented Jul 17, 2014

yea I was thinking about FIN/PUB, I guess the downsides I can think of would be:

  • it's not atomic
  • publishing would introduce another localized NSQD (unless we could check the origin but that's a little weird)
  • skews the metrics

Reducing the REQ attempts would definitely help, but I guess for me there's a conflict with the idea of what makes an attempt. Does receiving it count as an attempt? Or does actually processing it make it one. Then it also makes you alter max_attempts to allow for these cases vs the "real" max_attempts that you'd want

Might be able to rework things with a non-nsq solution but I thinkkkk this is still a legit thing, core or not is tricky though

@mreiferson
Copy link
Member

@visionmedia I haven't forgotten about this, I just wanted to get the stable release out the door to pave the way for focussing on new things...

@tj
Copy link
Contributor Author

tj commented Jul 25, 2014

no worries! It's nothing too urgent on our end

@twmb
Copy link
Contributor

twmb commented Mar 23, 2015

How about a NoAttempt method on messages and extending the protocol internally to include NOA or something similar?

On consumer shutdown, all messages in flight could be remarked as not attempted and nsqd wouldn't penalize them with an attempt.

This would also be nice for nsqio/go-nsq#96.

@mreiferson
Copy link
Member

@twmb I don't think the specific implementation was ever a question (and your suggestion makes a lot of sense).

I think the question has always been does it fundamentally make sense to allow this?

@judwhite
Copy link
Contributor

Lurking and chiming in here. At least as stated, 👎. MaxAttempts is a client side implementation, you're free to set it to 0 (infinite) and have your own logic for when to FIN a message. I don't think giving nsqd the ability to lie about the Attempts number is a good solution. If we knew a client could cause this number to be inaccurate we would have less confidence in our logs, where we store the Attempts number for both successes and failures.

@tj I realize this issue is > 1 year old and things may have changed. I'm not sure what having a shared topic/channel means in your context - do you have different message types which come through the same channel, and do some internal routing to a handler inside your code? I personally would change to a single purpose topic/channel, it's extremely useful to have operational control over a well defined type of messages.

If you're already using some custom routing, why not have custom MaxAttempts handling also? This way nsqd doesn't change behavior others may depend on and you can handle messages as you wish.

Regarding NoAttempt, if you're not changing the semantics of Attempts, I'm imagining a property that contained data for "this is the number of attempts which definitely failed" (whatever that means in your context) versus the number of overall attempts. Adding a field would be a break in protocol spec, unfortunately. Also I don't see any end to the metadata you may want to store about a message.

A general solution may be if you don't want the OOB way of counting Attempts then set MaxAttempts = 0 and store the topic / channel / message-id combination in a data store to track any arbitrary data you need about the message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants