feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

paleolimbot · 2024-10-05T03:01:51Z

Just following up on #1563 to see if the missing typarray column is the only issue. To get the details right might be a large project, but we might be able to support a basic connection without too much effort. Paramter binding and non-COPY result fetching seem to work...the default query fetch method (COPY) is not supported, connection_get_info() fails, and at a glance, connection_get_objects() might be returning incorrect results (and fails at the column depth).

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = Sys.getenv("ADBC_REDSHIFT_TEST_URI"),
  adbc.postgresql.load_array_types = FALSE
)

con <- db |> 
  adbc_connection_init()

stmt <- con |> 
  adbc_statement_init(adbc.postgresql.use_copy = FALSE)

stream <- nanoarrow::nanoarrow_allocate_array_stream()
stmt |> 
  adbc_statement_bind(data.frame(45)) |> 
  adbc_statement_set_sql_query("SELECT 1 + $1 as foofy, 'string' as foofy_string") |> 
  adbc_statement_execute_query(stream)
#> [1] -1

tibble::as_tibble(stream)
#> # A tibble: 1 × 2
#>   foofy foofy_string
#>   <dbl> <chr>       
#> 1    46 string

^{Created on 2024-10-04 with reprex v2.1.1}

paleolimbot · 2024-10-29T15:46:08Z

@lidavidm Any ideas on a good approach here? We could eliminate the database option and just fall back on the non-array version of the type query (or issue two queries). It would still need something like cursor.adbc_statement.set_options(use_copy = False) but at least it would let Redshift users connect (and give us their bug reports about other things that don't work).

lidavidm · 2024-10-30T00:25:27Z

It seems Redshift is juuust different enough that it's not quite Postgres anymore but not different enough to warrant a separate codebase. Is there any way to tell from libpq that we're dealing with Redshift and just automatically disable COPY and change the type query?

WillAyd · 2024-10-30T01:36:50Z

Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether?

I'm believe tools like AWS SDK for pandas use a COPY from Parquet to achieve high throughput to Redshift, so there might be some precedent to still go that route

paleolimbot · 2024-10-30T04:40:07Z

It seems Redshift is juuust different enough that it's not quite Postgres anymore but not different enough to warrant a separate codebase.

I agree that it's on the knife edge!

Is there any way to tell from libpq that we're dealing with Redshift and just automatically disable COPY and change the type query?

It looks like SELECT version() does the trick here, with some parsing. I implemented this and pushed it...I'm not sure if it's too much of a hack? (I see there's some failing tests, so maybe)

Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether?

If we could to text we'd need a completely different parser (and in a funny way, even the "use copy = false" branch is using the binary COPY format, it's just accessing it through the PGresult instead of pulling the COPY using PQgetcopy().

I'm believe tools like AWS SDK for pandas use a COPY from Parquet to achieve high throughput to Redshift, so there might be some precedent to still go that route

We could probably exploit that if we used a Go or Rust based driver!

Quick demo:

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = Sys.getenv("ADBC_REDSHIFT_TEST_URI")
)

con <- db |> 
  adbc_connection_init()

con |> 
  read_adbc("SELECT 'foofy'") |> 
  tibble::as_tibble()
#> # A tibble: 1 × 1
#>   `?column?`
#>   <chr>     
#> 1 foofy

con |> 
  adbc_connection_get_info() |> 
  tibble::as_tibble()
#> # A tibble: 6 × 2
#>   info_name info_value$string_value $bool_value $int64_value $int32_bitmask
#>       <dbl> <chr>                   <lgl>              <dbl>          <int>
#> 1         0 Redshift                NA                    NA             NA
#> 2         1 1.0.77467               NA                    NA             NA
#> 3       100 ADBC PostgreSQL Driver  NA                    NA             NA
#> 4       101 (unknown)               NA                    NA             NA
#> 5       102 0.6.0                   NA                    NA             NA
#> 6       103 <NA>                    NA               1001000             NA
#> # ℹ 2 more variables: info_value$string_list <list<chr>>,
#> #   $int32_to_int32_list_map <list<df[,2]>>

^{Created on 2024-10-29 with reprex v2.1.1}

lidavidm · 2024-10-30T06:00:47Z

It seems it's just a missing header on a few platforms?

WillAyd

minor comments / suggestions

c/driver/postgresql/connection.cc

WillAyd · 2024-10-30T14:11:23Z

c/driver/postgresql/database.cc

+
+  // While there are remaining version components and we haven't reached the end of the
+  // string
+  while (component_begin < version.size() && component < out.size()) {


Maybe a good use case for str::find here?

Co-authored-by: William Ayd <william.ayd@icloud.com>

WillAyd · 2024-10-30T15:20:32Z

c/driver/postgresql/connection.cc

+        if (RedshiftVersion()[0] > 0) {
+          infos.emplace_back(info_codes[i], "Redshift");
+        } else {
+          infos.push_back({info_codes[i], "PostgreSQL"});


I should have been clearer but all of the push_back's here I think are better with emplace_back

I don't mind either way, but most advice I read tends to suggest only using emplace back in specific cases (e.g., https://abseil.io/tips/112 ).

C++...what a language.

Well in this case either is likely fine. I am under the impression that emplace_back would avoid any calls to the move constructor of the list element, along with any move constructors that need to be called when the vector is resized. In this particular case it probably doesn't make a difference; maybe something to just look at when performance is more critical

I couldn't find any existing emplace_back() usage so I changed these back. We can always reevaluate!

WillAyd · 2024-10-30T15:20:39Z

c/driver/postgresql/connection.cc

+            return ADBC_STATUS_INTERNAL;
+          }
+          const char* server_version_num = (*it)[0].data;
+          infos.push_back({info_codes[i], server_version_num});


Here's another spot

paleolimbot · 2024-10-30T20:19:20Z

c/driver/postgresql/postgres_type.h

+  kUserDefined,
+  // This is not an actual type, but there are cases where all we have is an Oid
+  // that was not inserted into the type resolver. We can't use "unknown" or "opaque"
+  // or "void" because those names show up in actual pg_type tables.
+  kUnnamed


This surfaced because apparently the "geometry" type is returned with an oid that doesn't exist (3999) despite actually existing (with oid 3000). There's really no reason we can't still return the binary data that was sent there with the appropriate arrow.opaque metadata, which is what this particular hack enables.

kArrowOpaque perhaps to be explicit?

paleolimbot · 2024-10-30T20:20:25Z

c/driver/postgresql/postgres_type.h

+  ArrowErrorCode SetSchema(ArrowSchema* schema,
+                           const std::string& vendor_name = "PostgreSQL") const {


This lets our "opaque" type have the appropriate vendor name (since it's not always "PostgreSQL" any more).

paleolimbot · 2024-10-30T20:23:18Z

c/driver/postgresql/connection.cc

+  // Allow Redshift to execute this query without constraints
+  // TODO(paleolimbot): Investigate to see if we can simplify the constraits query so that
+  // it works on both!
+  void SetEnableConstraints(bool enable_constraints) {
+    enable_constraints_ = enable_constraints;
+  }


I didn't dig too deeply here but I did check that we get column names! I am not sure that we're getting tables from schemas outside "public" (there are quite a few things that look like sample database schemas but I don't see any tables in them listed by our query).

paleolimbot · 2024-10-30T20:38:22Z

Would it make sense to try and do a text based COPY instead of binary? Or does Redshift disable that altogether?

If we did want to seriously support redshift we would need to do this somehow (right now bulk insert doesn't work, and the workaround of INSERT INTO with a bind stream is ungodly slow). There are both Rust and Go SDKs!

lidavidm · 2024-10-30T23:46:19Z

Ah...Redshift has entirely separate SDKs? In that case maybe a different driver would be better long term...

paleolimbot · 2024-10-31T01:57:08Z

Ah...Redshift has entirely separate SDKs? In that case maybe a different driver would be better long term...

It's definitely a better long-term plan since all of the performance optimizations have to be disabled for this to work (I only discovered the SDKs in the process of reading the documentation to implement the things here 🙂 ). I'm neutral on whether this PR is too big of a hack, although some of the things here are good ideas anyway (e.g., updating the type resolver population to use the helpers/status, not failing when an OID isn't recognized).

paleolimbot added 4 commits October 29, 2024 21:01

tire kicking

bc69e99

handle redshift versioning

3ae81b1

set use copy by default

07cf6b6

nix previous type

148f44d

paleolimbot force-pushed the c-driver-postgresql-redshift branch from 846221b to 148f44d Compare October 30, 2024 04:33

WillAyd reviewed Oct 30, 2024

View reviewed changes

paleolimbot and others added 3 commits October 30, 2024 09:48

missing headers

ae2f052

Update c/driver/postgresql/connection.cc

1ac3bc6

Co-authored-by: William Ayd <william.ayd@icloud.com>

Update c/driver/postgresql/connection.cc

98c633f

Co-authored-by: William Ayd <william.ayd@icloud.com>

paleolimbot changed the title ~~poc(c/driver/postgresql): Try to connect to redshift~~ feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift Oct 30, 2024

better parsing

c220c81

WillAyd reviewed Oct 30, 2024

View reviewed changes

paleolimbot added 9 commits October 30, 2024 10:56

use result helper

646d08c

use helpers for building type table

98d54f8

condense type query logic

afedae1

move detail function

4eb70e0

use status

775d4cd

pass on vendor to opaque output type

d443eda

slightly better type query

d47ee45

allow type lookups to fail

56b80d1

try to execute a few queries

8a11b39

paleolimbot commented Oct 30, 2024

View reviewed changes

format

038204d

try some static casts for msvc

0828f00

one more

73af17a

slightly better enum name

562c1c1

paleolimbot marked this pull request as ready for review October 31, 2024 02:17

github-actions bot added this to the ADBC Libraries 15 milestone Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

paleolimbot commented Oct 5, 2024 •

edited

Loading

paleolimbot commented Oct 29, 2024

lidavidm commented Oct 30, 2024

WillAyd commented Oct 30, 2024

paleolimbot commented Oct 30, 2024

lidavidm commented Oct 30, 2024

WillAyd left a comment

WillAyd Oct 30, 2024

WillAyd Oct 30, 2024

paleolimbot Oct 30, 2024

WillAyd Oct 30, 2024 •

edited

Loading

paleolimbot Oct 30, 2024

WillAyd Oct 30, 2024

paleolimbot Oct 30, 2024

lidavidm Oct 30, 2024

paleolimbot Oct 30, 2024

paleolimbot Oct 30, 2024

paleolimbot commented Oct 30, 2024

lidavidm commented Oct 30, 2024

paleolimbot commented Oct 31, 2024

		ArrowErrorCode SetSchema(ArrowSchema* schema,
		const std::string& vendor_name = "PostgreSQL") const {

feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

Are you sure you want to change the base?

feat(c/driver/postgresql): Enable basic connect/query workflow for Redshift #2219

Conversation

paleolimbot commented Oct 5, 2024 • edited Loading

paleolimbot commented Oct 29, 2024

lidavidm commented Oct 30, 2024

WillAyd commented Oct 30, 2024

paleolimbot commented Oct 30, 2024

lidavidm commented Oct 30, 2024

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Oct 30, 2024

lidavidm commented Oct 30, 2024

paleolimbot commented Oct 31, 2024

paleolimbot commented Oct 5, 2024 •

edited

Loading

WillAyd Oct 30, 2024 •

edited

Loading