Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jdbc] Enhance useragent to indicate Apache Spark #1890

Merged
merged 5 commits into from
Oct 30, 2024

Conversation

BentsiLeviav
Copy link
Contributor

Many users use Apache Spark (or any other framework) via this JDBC connector.
This PR adds "framework detecting" functionality. It looks at the current stack trace, infers the framework usage based on a closed list of tracked frameworks, and adds an indication for it in the user agent string.

@@ -417,10 +410,23 @@ protected String getDefaultUserAgent() {
return config.getClientName();
}

protected String getAdditionalFrameworkUserAgent() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this check be run every time someone makes a request?

Copy link
Contributor Author

@BentsiLeviav BentsiLeviav Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would, but this is what we wanted.
Eventually, a user can write a Java application, using Spark with JDBC, but also might use just JDBC (without any framework). We want to classify each request and ensure we capture the right use case.
Do you happen to have another way to implement this to get the behavior I described?

for (StackTraceElement ste : Thread.currentThread().getStackTrace()) {
for (String framework : frameworks) {
if (ste.toString().contains(framework)) {
inferredFrameworks.add(String.format("(%s)", framework));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to collect all framework, can you provide an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we should break once we detected one of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we don't want to limit ourselves to a single framework use case. A nested or co-shared use case might be possible. From a complexity point of view, it is kind of the same, therefore I would prefer us not to limit this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need to exit loop when framework is detected.
And may be remember it to not search again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chernser
You must search every time. Each request in an application doesn't necessarily mean we use the framework (you can write s Spark application, work with Spark JDBC, and later on open a regular JDBC connection and query CH).

In addition, could you elaborate on why do we need to exit the loop if a framework was detected? what if a user might use a combination of frameworks that we want to detect? (because as I said, from a complexity point of view there isn't much difference (the list of FRAMEWORKS_TO_INFER is going to be limited), so what is the rationality behind limiting this ability?)

protected String getAdditionalFrameworkUserAgent() {
List<String> frameworks = List.of("apache.spark");
Set<String> inferredFrameworks = new LinkedHashSet<>();
for (StackTraceElement ste : Thread.currentThread().getStackTrace()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking current thread stack trace approach would not work with asynchronous tasks (but we have moved away from this default behavior) because task may have very independent stack trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions for that? If not, we better have something rather than nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may look up on startup for specific classes - usually frameworks have some.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have the jar of Apache Spark loaded and won't use it, and with that approach, it will be classified as Apache Spark usage.

@Paultagoras and @mzitnik suggested moving this logic to the JDBC client initiation logic (where there are no async tasks). So that solves this issue.

@chernser
Copy link
Contributor

Good day, @BentsiLeviav!
This feature is only useful for clients that using ClickHouse Cloud.
Other users should have an option to turn it off.

Would you please also tell story behind this feature? What would we like to find out?

Thanks!

@BentsiLeviav
Copy link
Contributor Author

@chernser
This feature was meant for us to better monitor and understand the real usage of the JDBC driver.
We want to be able to know exactly what kind of technology/framework our users use.
As of today, the JDBC driver is the underlying connection for many frameworks/technologies such as:

  • Apache Spark
  • DBeaver
  • Tableau

for some of these frameworks, we can't customize the product name. Therefore, I added this feature to infer what kind of framework is used (out of a close list we define), and add it to the user agent metadata.

I talked in this PR, I have moved the implementation to the JDBC (to avoid any async tasks) and minimized the usage by making sure the logic runs only on client creation.

@@ -152,6 +172,8 @@ public ClickHouseConnection connect(String url, Properties info) throws SQLExcep
if (!acceptsURL(url)) {
return null;
}
if (!url.toLowerCase().contains("disable-frameworks-detection"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not every case passes properties thru url - often properties are used.
I suggest do this check in com.clickhouse.jdbc.internal.ClickHouseConnectionImpl#ClickHouseConnectionImpl(com.clickhouse.jdbc.internal.ClickHouseJdbcUrlParser.ConnectionInfo) - this method is called right after URL is parsed for properties so ConnectionInfo contains all of them.

@chernser
Copy link
Contributor

@BentsiLeviav thank you for explanation!

Copy link
Contributor

@chernser chernser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do

@chernser chernser changed the title Enhance useragent to indicate Apache Spark [jdbc] Enhance useragent to indicate Apache Spark Oct 30, 2024
@chernser chernser merged commit e04e44f into ClickHouse:main Oct 30, 2024
30 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants