Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 error handling policy #257

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented May 10, 2024

Copy link
Member

@XSAM XSAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmacd Thanks for inviting me to share my thoughts.

Overall, I like this proposal. We might need to mention how to deal with the current mapping rules since if we convert invalid utf-8 string into �; we probably don't need to convert invalid utf-8 into bytes.

And, I prefer to convert invalid utf-8 into � instead of silently changing the type if we expose the byte-slice valued attributes on our API.

text/0000-utf8-handling.md Outdated Show resolved Hide resolved
@jmacd jmacd marked this pull request as ready for review May 13, 2024 21:58
@jmacd jmacd requested a review from a team May 13, 2024 21:58
@jmacd
Copy link
Contributor Author

jmacd commented May 14, 2024

@XSAM This was discussed in the Spec SIG today. There appears to be not much support for binary-attribute values. I think it's bad for the users, but it's not so bad if we automatically correct invalid UTF-8. Therefore, I will move forward with only half of this proposal.

text/0257-utf8-handling.md Outdated Show resolved Hide resolved
Copy link
Member

@joaopgrassi joaopgrassi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's great that this is handled. For ex, if a receiver uses the bindings we offer here https://github.com/open-telemetry/opentelemetry-proto-java they will drop the entire batch if anything contains invalid UTF-8.

text/0257-utf8-handling.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Oct 10, 2024

@open-telemetry/specs-logs-approvers @open-telemetry/specs-metrics-approvers @open-telemetry/specs-trace-approvers

Please consider this updated OTEP.

The changes I have applied:

  • The OTel group has already decided not to support byte-valued attributes: document this. (Tough!)
  • Specific wording for SDK requirements: SHOULD be opt-out, SHOULD replace invalid sequences w/ �, etc.
  • Specific wording for Collector "behavior": SHOULD be opt-out, SHOULD follow each receiver for validation of external data, not recommended for processor manipulations.

simple and preserves what valid content can be recovered from the
data.

#### Dropping data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the cost of performing this check on valid strings in the collector?

I see this as a performance tradeoff for where to do the enforcement of utf-8, and my preference would be to push as much to generation side as possible.

I'll read your alternatives considered, as you probably call this out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Generation side" - SDK/Exporter? Then the question becomes "should the collector trust the input" (I think the answer is "no").

`rejected_data_points`, or `rejected_log_records` count along with a
message indicating data loss.

### Survey of existing systems
Copy link
Contributor

@jsuereth jsuereth Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a few more.

E.g. Java - only enforces UTF-8 when attempting to read the bytes into Java's String format. See: https://github.com/protocolbuffers/protobuf/blob/0bfe41b27e3dd8a30ae383210d7af10c28a642ea/java/core/src/main/java/com/google/protobuf/Internal.java#L56 for the gore-y details

send, simply resulting from invalid UTF-8.

Considering whether components could support permissive data
transport, it appears to be risky and for little reward. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to understand the cost implications for validating on receivers.

I think the tradeoff in permissive is risky -

You require ANYONE who needs to interpret a string as UTF-8 to handle failure, at that moment.

However, in a risk/reward trade-off, for well-behaving systems, avoiding UTF-8 validation at every endpoint can add up.

I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.

  1. How likely do we think utf-8 issues are in practice?
  2. What is the cost of performing this check in collector components?

Personally - I think, related to your "consider invalid utf8 a bug in a processor", we should push repsonsible utf-8 as close to generation as possible, so I'd rather see this as an opt-in feature of otel than opt-out. BUT, I may be missing some context or use cases where this is highly problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.

+1, I don't just "like" it, I think we SHOULD do this.

Here are my reasons:

  1. The collector is normally sending the data to some endpoints. Many backend services already perform such validation/correction, so folks might want to just do it once in the backend rather than duplicating the effort.
  2. There are cases where data can be consumed directly on the collector side (e.g. a collector running in a local data center might decide to trigger a rollback due to certain metrics KPI drop during a deployment), I think this is a general pattern for things that run on Edge.
  3. Depending on the ownership of different parts of the system, the parts could be designed to be trusting each other or not. Collector needs to provide flexibility.
  4. Things could break between Collector and backend (e.g. bit flips caused by high energy particles from the universe, hardware failures), certain software needs to handle these as part of the design.

Regarding which one should be the default, based on what I've seen in Microsoft across Windows/Office/Azure/etc. I think it should be off-by-default and allow folks to opt-in.

@jmacd
Copy link
Contributor Author

jmacd commented Oct 11, 2024

@jsuereth and @reyang I appreciate the feedback. Both of you are, I think, suggesting to make UTF-8 validation an opt-in instead of an opt-out feature. I support that motion. The most critical thing for me is that if the SDK is configured with a permissive stance (opt-out), the SDK "MUST" configure its underlying technologies in support.

Opting-out does not mean doing nothing, in other words, it means explicitly configuring a pipeline to permit invalid UTF-8 unless a user opts-in to UTF-8 validation.

When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.

@reyang
Copy link
Member

reyang commented Oct 11, 2024

When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.

I think if we have very limited bandwidth, we should do (b). (a) can be added later if we see a huge demand.
One technical detail - I think for attribute values with string type, we should do some correction, for attribute names that have invalid UTF-8, it could be a very bad idea. I'm a bit on the fence here...


#### No byte-slice valued attribute API

As a caveat, the OpenTelemetry project has previously debated and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this ever formally rejected?

I'd like this option because we technically already support it in Erlang. The fact the main string type in Erlang/Elixir is binary and the SDK safety mechanism mentioned elsewhere in this doc that stores invalid utf8 in bytes_value of the proto.

So because attribute values are already type binary the user can pass any binary data they want as an attribute value and it gets used.

I recall it being rejected informally in spec sig meetings but maybe different luck with a formal proposal. Do you think there would be any chance of that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants