Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Object Reuse By Default #1382

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rickysaltzer
Copy link
Contributor

Summary

  • Enabling object reuse by default has the potential to cause unforeseen bugs in user code.

  • When a user decides to pass a Stream<ClickHouseRecord> back as a List<ClickHouseRecord>, all objects by default will be pointing back to the same object. This can be extremely jarring for users who aren't aware objects are being reused.

  • Added a reuseObjects() method to quickly enable object reuse when appropriate. This allows the user to decide when memory efficiency is a goal.

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

- Enabling object reuse by default has the potential to cause unforeseen
  bugs in user code.

- When a user decides to pass a `Stream<ClickHouseRecord>` back as a
  `List<ClickHouseRecord>`, all objects by default will be pointing
  back to the same object. This can be extremely jarring for users
  who aren't aware objects are being reused.

- Added a `reuseObjects()` method to quickly enable object reuse when
  appropriate. This allows the user to decide when memory efficiency
  is a goal.
- Decided it was too hard to understand in tests
- Horray copy paste
@rickysaltzer
Copy link
Contributor Author

I use this library within Kotlin code, and so, Kotlin Collections makes it very easy to go from Iterable<ClickHouseRecord> to Sequence or List. This is what caused me to spend quite a long time figuring out why my tests were failing (expecting data from clickhouse to be correct).

@zhicwu
Copy link
Contributor

zhicwu commented Jun 15, 2023

Thanks for your contribution @rickysaltzer! Besides memory efficiency, object creation also slows things down - the CI failure was because all tests couldn't finish in 15 minutes.

I think the proposed change will greatly impact to all reads, so I'd not suggest to do that. Have you tried ClickHouseRecord.copy() if you don't need to process many records? Alternatively, we may add a new method in the same class to extract all values? For example:

public Object[] extractValues() {
  int size = size();
  Object[] arr = new Object[size];
  for (int i = 0; i < size; i++) {
      arr[i] = getValue(i).asObject();
  }
  return arr;
}

@rickysaltzer
Copy link
Contributor Author

I think the underlying question is, should we be silently corrupting data returned from ClickHouse? Because that is exactly what is happening if a user decides to pass the ClickHouseRecord back.

I think in general it's bad API design to rely on a user to read the code implementation (as I had to do) and call .copy() on an object because it's unknowingly backed by a mutable reference.

Can we not simply enable object reuse by default for tests? I think it might be presumptuous of us to assume it would slow down user's code significantly, because it depends entirely on what they're doing with the Java API. Are they streaming hundreds of millions of rows? Maybe should consider object reuse. Are they simply performing a large aggregation that returns a few hundred or thousand rows? Object reuse might not be so significant now.

Take Apache Flink's API for example, a streaming platform that is meant for extreme scale and load. They disable objectReuse by default with a big warning that enabling it could lead to user's bugs (as this did for me).

@mshustov
Copy link
Member

@zhicwu It would be worthwhile to brainstorm alternative changes to API with @rickysaltzer and @mzitnik. Data integrity is a top priority. Let's try to find a balance here.

@zhicwu
Copy link
Contributor

zhicwu commented Jun 16, 2023

Thanks again @rickysaltzer for the inputs! Your points are indeed valid and well-reasoned. However, I would like to emphasize that it's important to consider the differences in memory efficiency and performance between a small library optimized for a single JVM and a distributed middleware like Flink.

As you know, we have multiple APIs to choose from, each with its own characteristics. JDBC is a well-known and mature option, while R2DBC is asynchronous and gaining popularity, although the driver has not thoroughly tested yet. On the other hand, the Java client provides better performance and lower memory usage compared to others. If performance and memory usage are not a concern, why not stick with JDBC?

Anyway, I think what we're trying to resolve here is to improve Java client API to minimize unintended side effects. ClickHouseResponse.records(Class<T>) can be think of an attempt, which has no such issue and it will be faster than records() once ASM is integrated. Apart from that, I hope we can eventually drop ClickHouse*Value and potentially ClickHouseRecord. These additions were initially made to support a middleware called JDBC bridge, which is no longer a part of the goals. Instead, a potential approach could be to utilize Object[] along with precise APIs to retrieve each value, without relying on implicit type conversions, which can sometimes be problematic.

@mzitnik & @mshustov, anything to add?

@rickysaltzer
Copy link
Contributor Author

Thanks for your response, I do very much appreciate the efficiency we're trying to maintain, especially when it comes to higher-level APIs leveraging this one.

That being said, I think coming up with an elegant solution to this issue is warranted.

@mzitnik
Copy link
Contributor

mzitnik commented Jun 18, 2023

A few comments here
I think API should be self-descriptive. I would also not expect Data integrity issues @zhicwu I understand that currently, it blocks our CI from completing in 15 min. Do you have any expectations for when ASM will be integrated?
Thanks, @rickysaltzer for rising this issue.

@mshustov mshustov requested a review from mzitnik January 19, 2024 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants