Spark Sql - Insert Into External Hive Table Error

Question

I am trying to insert data into a external hive table through spark sql. My hive table is bucketed via a column. The query to create the external hive table is this

create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet

Now I tried to store data from a parquet file (stored in hdfs) into the table. This is my code

    SparkSession session = SparkSession.builder().appName("ParquetReadWrite").
                    config("hive.exec.dynamic.partition", "true").
                    config("hive.exec.dynamic.partition.mode", "nonstrict").
                    config("hive.execution.engine","tez").
                    config("hive.exec.max.dynamic.partitions","400").
                    config("hive.exec.max.dynamic.partitions.pernode","400").
                    config("hive.enforce.bucketing","true").
                    config("optimize.sort.dynamic.partitionining","true").
                    config("hive.vectorized.execution.enabled","true").
                    config("hive.enforce.sorting","true").
                    enableHiveSupport()
                    .master(args[0]).getOrCreate();
String insertSql="insert into tab1 select * from"+"'"+parquetInput+"'";

session.sql(insertSql);

When I run the code , its throwing the below error

mismatched input ''hdfs://url:port/user/clsadmin/somedata.parquet'' expecting (line 1, pos 50)

== SQL == insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet' --------------------------------------------------^^^
```
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
```
What is the difference between using the hive execution engine as Tez and Spark ?

Have you tried LOAD Command cwiki.apache.org/confluence/display/Hive/… ? — Constantine
– Constantine, Commented Oct 4, 2018 at 4:08
@Constantine , No I have not tried the load command yet, but is there a reason why the insert into statement will not work? — Ayan Biswas
– Ayan Biswas, Commented Oct 4, 2018 at 4:15
I am not sure whether you can run insert from directory. I have not encountered this command before. — Constantine
– Constantine, Commented Oct 4, 2018 at 5:45

justcode · Accepted Answer · 2018-10-04 04:27:12Z

0

Have you tried

LOAD DATA LOCAL INPATH '/path/to/data'

OVERWRITE INTO TABLE tablename;

answered Oct 4, 2018 at 4:27

justcode

1288 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ayan Biswas Over a year ago

Is there any reason why the insert into command will not work ?

justcode Over a year ago

Because parquetInput variable is an HDFS directory and not a table. You can't use SELECT FROM from an HDFS directory. You can only CREATE EXTERNAL TABLE and set your location to the HDFS directory

Ravikumar · Accepted Answer · 2018-10-04 15:00:28Z

0

Creating external table in Hive, HDFS location to be specified.

create external table tab1 ( col1 type,col2 type,col3 type) 
clustered by (col1,col2) sorted by (col1) into 8 buckets 
stored as parquet 
LOCATION hdfs://url:port/user/clsadmin/tab1

There won't be necessity that hive will populate the data, either same application or other application can ingest the data into the location and hive will access the data by defining the schema top of the location.

*== SQL == insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet' --------------------------------------------------^^^*

parquetInput is parquet HDFS file path and not Hive table name. Hence the error.

There are two ways you can solve this issue:

Define the external table for "parquetInput" and give the table name
Use LOAD DATA INPATH 'hdfs://url:port/user/clsadmin/somedata.parquet' INTO TABLE tab1

answered Oct 4, 2018 at 15:00

Ravikumar

1,1311 gold badge14 silver badges26 bronze badges

3 Comments

Ayan Biswas Over a year ago

but with this approach the parquet file is just getting transferred to the location hdfs://url:port/user/clsadmin . it is not honouring the bucket size that was defined . So , when I look into hdfs://url:port/user/clsadmin I see that the no. of partition files is not same as the no. of buckets I had defined in the table

Ravikumar Over a year ago

@AyanBiswas - That is true. Please use the dataframe API, repartition the df into number of buckets will solve your problem. Refer stackoverflow.com/questions/48585744/…

Ayan Biswas Over a year ago

Ok . but from this link stackoverflow.com/questions/48585744/… , I felt that using spark sql to store the data into hive external table will solve the problem. The reason I am inclined on bucketing is for performance improvement in join queries . Hopefully the repartitioning the data and storing it in hive table will not degrade the join performance

Collectives™ on Stack Overflow

Spark Sql - Insert Into External Hive Table Error

2 Answers 2

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Linked

Related