Skip to content

Releases: vertica/spark-connector

Vertica Spark Connector V3.0.3 Release

22 Feb 17:33
5690ec9
Compare
Choose a tag to compare

Overview

This release added supported for aggregate pushdown (Spark 3.2 only) as well as other fixes.

New Features:

  • Added support for aggregate pushdown for Spark in 3.2. Supported aggregates are: COUNT, MIN, MAX, SUM
  • Added support for JDBC connection failover onto Vertica backup nodes.
  • Added save_job_table option to control if a status table should be used.
  • Added time_operations option to show time taken for each operations.

Fixes:

  • HA HDFS configuration setting will no longer fail.
  • Connector now only uses one JDBC connection for read/write and properly closes connections.

BETA -- Vertica Spark Connector V3.0.3 Beta Release 2

14 Feb 23:19
5fabd07
Compare
Choose a tag to compare

This is a beta release. This contains a couple of fixes:

  • Hadoop HA configuration fix
  • Fix for JDBC sessions hanging around

BETA -- Vertica Spark Connector V3.0.3 Beta Release

04 Feb 17:59
Compare
Choose a tag to compare

This is a beta release. This contains a fix for using the spark connector with hadoop high-availability configurations.

Vertica Spark Connector V3.0.2 Release

03 Feb 22:52
7fb8ecd
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 3.0.2

Overview

This release of Vertica adds some minor options to the connector, as well as additional debug information to help troubleshoot problem cases.

New Additions:

  • Update S3 Examples to work locally
  • Add support for a failover in the JDBC connection
  • Remove creation of job status table for external tables
  • Add option for whether or not to save job status table
  • Add debug logging for hadoop configuration
  • Add timing information of operations performed by the spark connector

Vertica Spark Connector V3.0.1 Release

17 Dec 20:19
1673151
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 3.0.1

Overview

This release of Vertica's Spark Connector includes changes, such as a new connector option that prevents parquet cleanup and removing the unique ID in the filepath to the external table the connector is creating. This release also includes logs for rejected data, and solutions that modify varchar and varbinary column sizes to avoid data truncation when creating an external table from existing data.

Important Note: The Spark Connector does not source the log4j dependency directly; however, it uses Spark, which has a dependency on an older version of log4j (1.2.17) that is not impacted by the most recent vulnerability based on https://logging.apache.org/log4j/2.x/security.html

New Additions:

  • Prevent Parquet Cleanup: Prior to this change, parquet data was cleaned up after each Spark job. This change creates a new connector option, which prevents the cleanup of parquet files in the intermediary file store during both a read and a write. This option was created so users could debug data that was written or exported as parquet data.This prevent_cleanup connector option defaults to false, but can be specified to true in the options map.

  • Remove Unique ID from External Table File Path: Previously, when exporting data as an external table, the connector would append a unique identifier to to the export path. This change removes that identifier from the path so that it is easier to query it directly.

  • Log Rejected Rows Data: Prior to this change, the rejected data table never persisted in Vertica, as it is a temporary table. Consequently, rejected data was never accessible and therefore, was difficult to address. Now, the connector logs a summary of up to 10 reasons for the rejected data along with an instance of the rejected data for each reason.

  • Add Compression Codec In Parquet File Name: Previously, the connector did not include the .snappy encoding when writing parquet files. This change now includes the compression codec the same way Spark includes it when writing parquet data (/data/filename.snappy.parquet instead of /data/filename.parquet)

  • Avoid Truncation On VARCHAR Or VARBINARY Columns When Creating External Tables: Prior to this change, when creating an external table using data existing in hdfs, the connector produced warning about data being truncated if it contains columns of varchar and varbinary types. This change allows the size of each column to be overwritten individually using column metadata. However, this is optional as the default size for varchar and varbinary columns has been changed to 1024 and 65000, respectively. Both methods are detailed in the new README. This change also removes the warning entirely, as varchar and varbinary data is no longer truncated to the smaller size of 80. In addition, a solution has been implemented to allow use of the strlen connector option to override all varchar columns with a size different from the default (1024).

Bug Fixes

  • Handle Error When Copy Is Invalid: Prior to this change, users would not get a meaningful error when attempting to perform an invalid copy. Instead, the connector would give a fault tolerance error where 0 rows were copied and 0 were rejected. This change now provides a more meaningful error before the copied data is checked against the error tolerance.

Vertica Spark Connector V3.0.0 Release

29 Nov 02:12
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 3.0.0

Important Note: The change from V2 to V3 does not include any major updates. The reason for this change is to provide clarity on which versions of Spark the connector supports.

This release of Vertica's Spark Connector includes support for Spark 3.2.0 and support for passing the HDFS URI without the file-system prefix in the connector options.

New Additions:

  • Spark 3.2.0 Support: This change ensures compatibility with projects that use the Vertica Spark Connector and Spark 3.2.0.

  • Intermediary URI Simplification: This change allows the 'staging_fs_url' connector option to be specified without the filesystem modifier prefix (i.e. 'user/v1/parquet' rather than 'hdfs://hacluster/user/v1/parquet')

Vertica Spark Connector V2.0.4 Release

29 Oct 22:46
78316be
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 2.0.4

This release of Vertica's Spark Connector includes support for HDFS High Availability, using the dbschema connector option when creating external tables from existing data, and replacing any column type when creating external tables from existing data.

New Additions:

  • HDFS High Availability Support: This change ensures the delegation token is applied to a HDFS nameservice if HA is enabled, instead of one namenode address. If HA is not enabled, then the token is applied to the namenode address as usual.

Bug Fixes

  • Fix issue using dbschema with external tables: This change fixes the issue when trying to create an external table (using data existing on disk) with a specific database schema. Prior to this change, the schema remained public and the dbschema was appended to the table name.

  • Fix issue replacing varchar/varbinary types in external tables: This change fixes the issue when trying to use a Spark schema to replace varchar and varbinary data types in the create external table statement inferred from parquet data. Prior to this change, only partitioned columns with the data type "UNKNOWN" would be replaced.

Vertica Spark Connector V2.0.3 Release

28 Sep 23:44
cb71501
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 2.0.3

This release of Vertica's Spark Connector includes support for creating an external table from existing data, creating a kerberized sandbox environment, and using the Docker environment on Windows.

New Additions:

  • External Table Existing Data Support: Added option to create an external table out of existing data in HDFS/S3. This feature also supports providing a partial schema to accommodate partitioned parquet data.

Bug Fixes

  • Fix issue fetching webhdfs and swebhdfs delegation token: This change fixes issues with fetching the webhdfs swebhdfs delegation token. The connector also searches for either the dfs.namenode.http-address or dfs.namenode.https-address Hadoop configuration options when setting the Hadoop Impersonation Config.

Sandbox Environment Improvements:

  • Simplify parameter to run-example.sh: The parameter was originally the entire path to the assembled jar file of the example. Now it has been simplified to just the example name.
  • Ability to create a kerberized sandbox environment: The original sandbox-clientenv script now accepts a parameter called 'kerberos', which spins up a kerberized Docker environment to more easily run the kerberos-example.
  • Windows support for Docker setup: The sandbox-clientenv batch file is now updated, along with the project documentation on how to setup (kerberized and non-kerberized) Docker on a Windows machine.

Vertica Spark Connector V2.0.2 Release

30 Aug 23:39
cf8c1bd
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 2.0.2

This release of Vertica's Spark Connector includes a published thin-jar, additional log statements in the connector, and changes to the sandbox client environment for added stability.

New Additions:

  • Published Thin Jar: Published thin JAR file in Maven Repository, allowing users to manage transitive dependencies with the facilities of their build tools.
  • Additional Log Statements: Added logs to the connector including sql statements and high level indicators to let user know which major event is taking place. The additional details allow for easier debugging.

Sandbox Environment Improvements:

  • Create single source of truth for example setup: Before this change, Spark and Hadoop were downloaded on the client container when an example was run for the first time. Now, the setup takes place only when the containers are composed up for the first time, allowing for faster example runs.
  • Fix Potential Active Sessions Bug: Increased active sessions from 55 to 100 in the sandbox container so users are not limited when running their own applications.

Vertica Spark Connector V2.0.1 Release

05 Aug 00:20
2629ba8
Compare
Choose a tag to compare

Vertica Spark Connector Changelog

Vertica Spark Connector 2.0.1

This is the second official release of the Vertica Spark Connector. This release includes added support for YARN, Merge Statements, and creating External Tables and addressed fixes around running examples in the provided docker sandbox container.

New Features:

  • External Table Support: Added option to create external table out of written connector data. Creating external tables from existing data in HDFS/S3 is not currently supported.
  • Sparklyr Support: Added Sparklyr read and write examples
  • Merge Statements: Added support to merge data from Spark with data in Vertica using a new merge_key connector option
  • Reserved Keywords: Added support for reserved keywords and special characters in table columns
  • WebHDFS: Support for accessing Hadoop from multiple languages without installing Hadoop. The examples provided show how to use WebHDFS in our connector.
  • YARN: Added support for running Spark applications on Yarn in a kerberized or non-kerberized environment

Bug fixes: