How to create index in Spark Table?

Question

I know Spark Sql is almost same as Hive.

Now I have created a table and when I am doing Spark sql query to create the table index, it always gives me this error:

Error in SQL statement: AnalysisException: mismatched input '' expecting AS near ')' in create index statement

The Spark sql query I am using is:

CREATE INDEX word_idx ON TABLE t (id)

The data type of id is bigint. Before this, I have also tried to create table index on "word" column of this table, it gave me the same error.

So, is there anyway to create index through Spark sql query?

Community · Accepted Answer · 2017-05-23 10:30:52Z

5

There's no way to do this through a Spark SQL query, really. But there's an RDD function called zipWithIndex. You can convert the DataFrame to an RDD, do zipWithIndex, and convert the resulting RDD back to a DataFrame.

See this community Wiki article for a full-blown solution.

Another approach could be to use the Spark MLLib String Indexer.

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Mar 20, 2016 at 12:25

David Griffin

14k5 gold badges49 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cherry Wu Over a year ago

Yes, I am using zipWithIndex on some RDDs, but for this one, I need to create index on a specific column, zipWithIndex is not very convenient for yes, I need to separate the data first, use zipWithIndex, then join. I am wondering whether there is a simpler way.

David Griffin Over a year ago

Maybe take a look at the mLib StringIndexer?

David Griffin Over a year ago

If you don't need ID to be sequential you could look at monotonically_increasing_id()

Collectives™ on Stack Overflow

How to create index in Spark Table?

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related