1

I have the following df where two columns ara labeled with the same name:

dput(df_test)
structure(list(X = c("Gen", "ABCB1", "ABCG2", "CES1"), X.1 = c("Prioridad del gen", 
"Candidato", "Candidato", "Candidato"), X.2 = c("Región codificante", 
"2110", "1526", "3533"), X.3 = c("Categoría Reg. Codif.", "intron", 
"intron", "intron"), X.4 = c("Alineamiento múltiple", "No", "No", 
"No"), X.5 = c("Cromosoma", "7", "4", "16"), X.6 = c("Posición inicial", 
"87153584", "89096060", "55855151"), X.7 = c("Posición final", 
"87153585", "89096061", "55855151"), X.8 = c("Tamaño (pb)", "2", 
"2", "1"), X.9 = c("Nº pb cob. ? 15X", "0", "1", "0"), X.10 = c("Nº pb cob. ? 15X", 
"2", "1", "1"), X.11 = c("% pb cob. ? 15X", "0%", "50%", "0%"
), X.12 = c("Cobertura media", "3", "14,50", "0"), X.13 = c("Nº pb sin cubrir", 
"0", "0", "1"), X.14 = c("Nº pb cob. [1-5]", "2", "0", "0"), 
    X.15 = c("Nº pb cob. [6-14]", "0", "1", "0"), X.16 = c("Nº pb cob. [15-29]", 
    "0", "1", "0"), X.17 = c("Nº pb cob. ? 30X", "0", "0", "0"
    )), class = "data.frame", row.names = c(NA, -4L))

Because the first raw is empty in the original file, the real header becomes part of the df instead of being used as the header. Hence, I use row_to_names to move up the raw containing the names:

df1 <- read.delim("file", header = T) %>% row_to_names(row_number = 1)

Now I need to rename the columns "Nº pb cob. ? 15X" as "Nº pb cob. ≥ 15X" and " Nº pb cob. ≤ 15X", respectively. I've tried with:

  • clean_ rename_at clean_names() after row_to_names() and didn't change anything.

  • rename_vars and rename_at didn't work out either.

    df1 <- rename_at(df1, 10, ~"Num pb cob. ≥ 15X") Error in combine_names(): ! Can't rename duplicate variables to {name}. Run rlang::last_error() to see where the error occurred.

Could any one give me some advice¿?

Thanks!!

5
  • 2
    I recommend fixing the initial file loading rather than fiddling with the headers afterwards. Can’t you read the table with e.g. readr::read_tsv? Commented Feb 22, 2022 at 13:05
  • Please also add the first line of the csv file to teh question and not just dput(df_test) so we can replicate the parsing and modification of column names Commented Feb 22, 2022 at 13:13
  • Thanks Konrad, but I did not consider this option because I have 50 files with the same format, therefore, it would take me a lot of time. Commented Feb 22, 2022 at 13:15
  • 1
    With 50 files to fix, invest a little time in automating the fix. I believe @KonradRudolph's advice is very sound. It shouldn't be too difficult. We really need to see a (few) sample(s) of your raw files rather than the df you have after import. Commented Feb 22, 2022 at 13:43
  • 1
    @MireiaBoluda To clarify, my suggestion wasn’t to fix the files but rather to fix the way you’re reading them. As mentioned by danlooo, it would be helpful to see the first few lines of your files to understand why the reading is going so wrong. It’s entirely possible that my suggestion isn’t practical but we’d need to see the files to judge that. Commented Feb 22, 2022 at 13:47

2 Answers 2

1

I did not find a way to place those especial characters in a data frame variable name, so I used a minor variation.

The idea was to create a function that cleans your data, that way you can apply this function to all your files.

library(stringr)
library(purrr)

test <- structure(
  list(
    X = c("Gen", "ABCB1", "ABCG2", "CES1"), 
    X.1 = c("Prioridad del gen","Candidato", "Candidato", "Candidato"),
    X.2 = c("Región codificante","2110", "1526", "3533"), 
    X.3 = c("Categoría Reg. Codif.", "intron", "intron", "intron"),
    X.4 = c("Alineamiento múltiple", "No", "No", "No"),
    X.5 = c("Cromosoma", "7", "4", "16"),
    X.6 = c("Posición inicial", "87153584", "89096060", "55855151"),
    X.7 = c("Posición final", "87153585", "89096061", "55855151"), 
    X.8 = c("Tamaño (pb)", "2", "2", "1"), 
    X.9 = c("Nº pb cob. ? 15X", "0", "1", "0"), 
    X.10 = c("Nº pb cob. ? 15X", "2", "1", "1"), 
    X.11 = c("% pb cob. ? 15X", "0%", "50%", "0%"),
    X.12 = c("Cobertura media", "3", "14,50", "0"), 
    X.13 = c("Nº pb sin cubrir", "0", "0", "1"), 
    X.14 = c("Nº pb cob. [1-5]", "2", "0", "0"), 
    X.15 = c("Nº pb cob. [6-14]", "0", "1", "0"), 
    X.16 = c("Nº pb cob. [15-29]", "0", "1", "0"),
    X.17 = c("Nº pb cob. ? 30X", "0", "0", "0")), 
  class = "data.frame", row.names = c(NA, -4L))

# Function to clean the names as you need
clean_df_names <- function(df) {
  df_names <- df[1, ] %>%
    unlist(use.names = FALSE)
  
  repeated_names <- which(df_names == 'Nº pb cob. ? 15X')
  
  
  #name_symbols <- c('\u2265', '\u2264') # these are the unicode symbols, but can not be used in df names
  name_symbols <- c('>=', '<=')
  
  new_names <- purrr::map2_chr(
    df_names[repeated_names], name_symbols,
    ~stringr::str_replace(.x, '\\?', .y)
  )
  
  df_names[repeated_names] <- new_names
  
  new_df <- df[-1, ]
  
  setNames(new_df, df_names)
}

test <- clean_df_names(test)

str(test)
#> 'data.frame':    3 obs. of  18 variables:
#>  $ Gen                  : chr  "ABCB1" "ABCG2" "CES1"
#>  $ Prioridad del gen    : chr  "Candidato" "Candidato" "Candidato"
#>  $ Región codificante   : chr  "2110" "1526" "3533"
#>  $ Categoría Reg. Codif.: chr  "intron" "intron" "intron"
#>  $ Alineamiento múltiple: chr  "No" "No" "No"
#>  $ Cromosoma            : chr  "7" "4" "16"
#>  $ Posición inicial     : chr  "87153584" "89096060" "55855151"
#>  $ Posición final       : chr  "87153585" "89096061" "55855151"
#>  $ Tamaño (pb)          : chr  "2" "2" "1"
#>  $ Nº pb cob. >= 15X    : chr  "0" "1" "0"
#>  $ Nº pb cob. <= 15X    : chr  "2" "1" "1"
#>  $ % pb cob. ? 15X      : chr  "0%" "50%" "0%"
#>  $ Cobertura media      : chr  "3" "14,50" "0"
#>  $ Nº pb sin cubrir     : chr  "0" "0" "1"
#>  $ Nº pb cob. [1-5]     : chr  "2" "0" "0"
#>  $ Nº pb cob. [6-14]    : chr  "0" "1" "0"
#>  $ Nº pb cob. [15-29]   : chr  "0" "1" "0"
#>  $ Nº pb cob. ? 30X     : chr  "0" "0" "0"

Created on 2022-02-22 by the reprex package (v2.0.1)

Sign up to request clarification or add additional context in comments.

1 Comment

This worked out just fine! Many thanks :)
0

You can do it manually using backticks:

library(tidyverse)
df <-tibble(`Nº pb cob. ? 15X` = seq(2))
df
#> # A tibble: 2 x 1
#>   `Nº pb cob. ? 15X`
#>                <int>
#> 1                  1
#> 2                  2
rename(df, `Nº pb cob. ≤ 15X` = `Nº pb cob. ? 15X`)
#> # A tibble: 2 x 1
#>   `Nº pb cob. ≤ 15X`
#>                <int>
#> 1                  1
#> 2                  2

Created on 2022-02-22 by the reprex package (v2.0.0)

2 Comments

Note that he have two columns with the same name.
exactly, that is why rename does not work

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.