0

I am trying to remove all the characters from ROW FORMAT SERDE, with the gsub function however it does not work. Any suggestion.

x <- c("CREATE TABLE `cld_ml_bi_eng.iris`(", "  `sepal_length` double, ", 
  "  `sepal_width` double, ", "  `petal_length` double, ", "  `petal_width` double, ", 
  "  `species` string)", "ROW FORMAT SERDE ", "  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' ", 
  "STORED AS INPUTFORMAT ", "  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' ", 
  "OUTPUTFORMAT ", "  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'", 
  "LOCATION", "  'hdfs://haprod/warehouse/tablespace/managed/hive/cld_ml_bi_eng.db/iris'", 
  "TBLPROPERTIES (", "  'bucketing_version'='2', ", "  'transactional'='true', ", 
  "  'transactional_properties'='default', ", "  'transient_lastDdlTime'='1636686825')")

Here I use gsub

gsub(pattern = "(ROW FORMAT SERDE).*", replacement = "\\1", x = x)

My expected output

c("CREATE TABLE `cld_ml_bi_eng.iris`(", "  `sepal_length` double, ", 
  "  `sepal_width` double, ", "  `petal_length` double, ", "  `petal_width` double, ", 
  "  `species` string)")
1
  • Each of your lines is a separate object - you either need to paste it together first for gsub to work, or just select the chunk - head(x, grep("ROW FORMAT SERDE\\s+", x)-1) Commented Nov 12, 2021 at 3:35

1 Answer 1

1

One approach would be to use grep to find the index of the string in your input vector which starts with the text ROW FORMAT SERDE. Then, subset the input vector and paste into a single string:

paste0(x[1:(grep("^ROW FORMAT SERDE", x)-1)], collapse="")

[1] "CREATE TABLE cld_ml_bi_eng.iris( sepal_length double, sepal_width double, petal_length double, petal_width double, species string)"

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.