How to create a Dataframe from a String?

Question

I have a String like below , each line separated by new line and fields with spaces. The first row is my header .

col1 col2 col3 col4 col5 col6 col7 col8
val1 val2 val3 val4 val5 val6 val7 val8
val9 val10 val11 val12 val13 val14 val15 val16
val17 val18 val19 val20 val21 val22 val23 val24

How can i build a Spark DataFrame from String in Java?

Leo C · Accepted Answer · 2017-05-17 19:40:51Z

I believe @Shankar Koirala has already provided a solution in Java by treating the text/string file as a CSV file (with custom separator " " instead of ","). Below is a Scala-equivalence of the same approach:

val spark = org.apache.spark.sql.SparkSession.builder.
  master("local").
  appName("Spark custom CSV").
  getOrCreate

val df = spark.read.
  format("csv").
  option("header", "true").
  option("delimiter", " ").
  csv("/path/to/textfile")

df.show
+-----+-----+-----+-----+-----+-----+-----+-----+
| col1| col2| col3| col4| col5| col6| col7| col8|
+-----+-----+-----+-----+-----+-----+-----+-----+
| val1| val2| val3| val4| val5| val6| val7| val8|
| val9|val10|val11|val12|val13|val14|val15|val16|
|val17|val18|val19|val20|val21|val22|val23|val24|
+-----+-----+-----+-----+-----+-----+-----+-----+

[UPDATE] Create DataFrame from string content

val s: String = """col1 col2 col3 col4 col5 col6 col7 col8
                  |val1 val2 val3 val4 val5 val6 val7 val8
                  |val9 val10 val11 val12 val13 val14 val15 val16
                  |val17 val18 val19 val20 val21 val22 val23 val24
|"""

// remove header line
val s2 = s.substring(s.indexOf('\n') + 1)

// create RDD
val rdd = sc.parallelize( s2.split("\n").map(_.split(" ")) )

// create DataFrame
val df = rdd.map{ case Array(c1, c2, c3, c4, c5, c6, c7, c8) => (c1, c2, c3, c4, c5, c6, c7, c8) }.
  toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")

df.show
+-----+-----+-----+-----+-----+-----+-----+-----+
| col1| col2| col3| col4| col5| col6| col7| col8|
+-----+-----+-----+-----+-----+-----+-----+-----+
| val1| val2| val3| val4| val5| val6| val7| val8|
| val9|val10|val11|val12|val13|val14|val15|val16|
|val17|val18|val19|val20|val21|val22|val23|val24|
+-----+-----+-----+-----+-----+-----+-----+-----+

Yes @Shankar and you are giving a solution to read from the csv file . What i have is a String which is extracted from a particular file . I don't want to write back the String i have into a csv file and read it again. How can i convert the String i have to a data frame?
Ahh, my oversight. Please see expanded answer (code is in Scala, though).
could you help me with this piece of code in java ? sc.parallelize( s2.split("\n").map(_.split(" ")) )
One way to process the string in Java would be similar to this.

koiralo · Accepted Answer · 2017-05-17 15:35:49Z

0

You can read csv file in spark Java API as follows: Creating spark session

SparkSession spark = SparkSession.builder()
  .master("local[*]")
  .appName("Example")
  .getOrCreate();

//read file with header true and delimiter as " " (space)
DataFrame df = spark.read
    .option("delimiter", " ")
    .option("header", true)
    .csv("path to file");
df.show();

answered May 17, 2017 at 15:35

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

1 Comment

John Thomas Over a year ago

Its not a csv file , its a string .

Collectives™ on Stack Overflow

How to create a Dataframe from a String?

2 Answers 2

4 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Linked

Related