0

I am working in Spark and Using Scala

I am having two csv files, one having the column names and other having data, how I can integrate both of them so that I can make a resultant file with schema and data, then I have to apply operations on that file like groupby, cout, etc as I need to count the distinct values from those columns.

So can anyone help out here will be really helpful

I wrote the below code made two DF from both the file after reading them than I joined both the DF using union now how I can make the first row as schema , or anyother way to proceed with this . Anyone can suggest .

     val sparkConf = new SparkConf().setMaster("local[4]").setAppName("hbase sql")
val sc = new SparkContext(sparkConf)
val spark1 = SparkSession.builder().config(sc.getConf).getOrCreate()
    val sqlContext = spark1.sqlContext

val spark = SparkSession
  .builder
  .appName("SparkSQL")
  .master("local[*]")
  .getOrCreate()
import spark.implicits._
val lines = spark1.sparkContext.textFile("C:/Users/ayushgup/Downloads/home_data_usage_2018122723_1372672.csv").map(lines=>lines.split("""\|""")).toDF()  
 val header = spark1.sparkContext.textFile("C:/Users/ayushgup/Downloads/Header.csv").map(lin=>lin.split("""\|""")).toDF()

val file = header.unionAll(lines).toDF()

1 Answer 1

1

spark.sparkContext.textFile() will return rdd and will not infer schema, even if you are doing a .toDF() on top of that rdd.

sc.textFile() is for reading unstructured text files. You should use

spark.read.format("csv").option("header",true").option("inferSchema","true").load("..path.to.csv")

to get the schema from headers.

It is better you cat the files together, create anew csv and read them in HDFS

cat header.csv home_data_usage_2018122723_1372672.csv >> new_home_data_usage.csv

and then

hadoop fs -copyFromLocal new_home_data_usage.csv <hdfs_path>

then use

spark.read.format("csv").option("header",true").option("inferSchema","true").load("..path.to.csv")
Sign up to request clarification or add additional context in comments.

1 Comment

There are anyother way by which we can add the two files and read the data with column directly using spark scala, without using hdfs or its command ?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.