PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

Question

I have a Hive Orc table with a definition similar to the following definition

CREATE EXTERNAL TABLE `example.example_table`(
  ...
  )
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
WITH SERDEPROPERTIES ( 
  'path'='s3a://path/to/table') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3a://path/to/table'
TBLPROPERTIES (
  ...
)

I am attempting to use PySpark to append a dataframe to this table using "df.write.insertInto("example.example_table")". When running this, I get the following error:

org.apache.spark.sql.AnalysisException: Can only write data to relations with a single path.;
    at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:188)
    at org.apache.spark.sql.execution.datasources.DataSourceAnalysis$$anonfun$apply$1.applyOrElse(DataSourceStrategy.scala:134)
    ...

When looking at the underlying Scala code, the condition that throws this error is checking to see if the table location has multiple "rootPaths". Obviously, my table is defined with a single location. What else could cause this?

Garrett Poore · Accepted Answer · 2019-03-15 18:57:13Z

It is that path that you are defining that causes the error. I just ran into this same problem myself. Hive generates a location path based on the hive.metastore.warehouse.dir property, so you have that default location plus the path you specified, which is causing that linked code to fail.

If you want to pick a specific path other than the default, then try using LOCATION.

Try running a describe extended example.example_table query to see more detailed information on the table. One of the output rows will be a Detailed Table Information which contains a bunch of useful information:

Table(
  tableName:
  dbName:
  owner:
  createTime:1548335003
  lastAccessTime:0
  retention:0
  sd:StorageDescriptor(cols:
    location:[*path_to_table*]
    inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
    outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
    compressed:false
    numBuckets:-1
    serdeInfo:SerDeInfo(
      name:null
      serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      parameters:{
        serialization.format=1
        path=[*path_to_table*]
      }
    )
    bucketCols:[]
    sortCols:[]
    parameters:{}
    skewedInfo:SkewedInfo(skewedColNames:[]
    skewedColValues:[]
    skewedColValueLocationMaps:{})
    storedAsSubDirectories:false
  )
  partitionKeys:[]
  parameters:{transient_lastDdlTime=1548335003}
  viewOriginalText:null
  viewExpandedText:null
  tableType:MANAGED_TABLE
  rewriteEnabled:false
)

In my case, the "location" and "path" values of the "describe extended" query match and are what I was expecting. I am using an external table as well.
Mine also matched, but I think that is the problem, that Spark is seeing both of them. Since the location is automatically generated, you don't need to specify the path manually.
Hi @conrosebraugh Any Luck on this one, Did you overcome this Problem?
@Pavan_Obj unfortunately I did not find a work around. I think I ended up using Hive for the class of tables that had this issue so that I could move on to other projects. I should've opened a bug with the Apache Spark team.

Santiago Tissoni · Accepted Answer · 2022-07-12 00:02:29Z

We had the same problem in a project when migrating from Spark 1.x and HDFS to Spark 3.x and S3. We solve this issue setting the next Spark property to false:

spark.sql.hive.convertMetastoreParquet

You can just run

spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")

Or maybe

spark.conf("spark.sql.hive.convertMetastoreParquet", False)

Being spark the SparkSession object. The explanaition of this is currently in Spark documentation.

Collectives™ on Stack Overflow

PySpark cannot insertInto Hive table because "Can only write data to relations with a single path"

2 Answers 2

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Related