4

I know Spark Sql is almost same as Hive.

Now I have created a table and when I am doing Spark sql query to create the table index, it always gives me this error:

Error in SQL statement: AnalysisException: mismatched input '' expecting AS near ')' in create index statement

test table

The Spark sql query I am using is:

CREATE INDEX word_idx ON TABLE t (id)

The data type of id is bigint. Before this, I have also tried to create table index on "word" column of this table, it gave me the same error.

So, is there anyway to create index through Spark sql query?

1 Answer 1

5

There's no way to do this through a Spark SQL query, really. But there's an RDD function called zipWithIndex. You can convert the DataFrame to an RDD, do zipWithIndex, and convert the resulting RDD back to a DataFrame.

See this community Wiki article for a full-blown solution.

Another approach could be to use the Spark MLLib String Indexer.

Sign up to request clarification or add additional context in comments.

3 Comments

Yes, I am using zipWithIndex on some RDDs, but for this one, I need to create index on a specific column, zipWithIndex is not very convenient for yes, I need to separate the data first, use zipWithIndex, then join. I am wondering whether there is a simpler way.
Maybe take a look at the mLib StringIndexer?
If you don't need ID to be sequential you could look at monotonically_increasing_id()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.