39

How can I detect non-ascii characters in a vector of strings in a grep like fashion. For example below I'd like to return c(1, 3) or c(TRUE, FALSE, TRUE, FALSE):

x <- c("façile test of showNonASCII(): details{", 
    "This is a good line", "This has an ümlaut in it.", "OK again. }")

Attempt:

y <- tools::showNonASCII(x)
str(y)
p <- capture.output(tools::showNonASCII(x))
7
  • 3
    Maybe stringi::stri_enc_mark(x)? Commented Jan 5, 2016 at 14:20
  • 1
    @David I think that will do it... can you throw down as an answer. Maybe others will see an issue with it or have different solutions. Commented Jan 5, 2016 at 14:23
  • 1
    Why not fix the code so it handles Unicode properly instead? Commented Jan 5, 2016 at 14:26
  • 1
    @PanagiotisKanavos I will, that's easy, but this is to validate strings so I first need to detect if there's a problem with the data so as to inform the client. Commented Jan 5, 2016 at 14:29
  • 3
    @PanagiotisKanavos b/c it's data from a client. We want it in a particular format. Non-standard data format is a data scientist's enemy, particularly if you're trying to automate a task. It's far easier and cheaper to get clients to put data in the correct format than to try to clean up and address un-foreseen errors later. Commented Jan 5, 2016 at 14:42

5 Answers 5

31

Came across this later using pure base regex and so simple:

grepl("[^ -~]", x)
## [1]  TRUE FALSE  TRUE FALSE

More here: http://www.catonmat.net/blog/my-favorite-regex/

Sign up to request clarification or add additional context in comments.

2 Comments

From the link: "[ -~] matches all ASCII characters from the space to tilde. What are these characters? These are all printable characters!"
Short, simple, smart: simply beautiful!
22

another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted

grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1]  TRUE FALSE  TRUE FALSE

Though it seems stringi has a built in function for this type of things too

stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII"  "latin1" "ASCII" 

1 Comment

Both solutions are terrific. This one is a bit more compact and may be more robust to other encodings, though, admittedly, I know very little about encodings.
13

Why don't you extract the relevant code from showNonASCII?

x <- c("façile test of showNonASCII(): details{", 
       "This is a good line", "This has an ümlaut in it.", "OK again. }")

grepNonASCII <- function(x) {
  asc <- iconv(x, "latin1", "ASCII")
  ind <- is.na(asc) | asc != x
  which(ind)
}

grepNonASCII(x)
#[1] 1 3

4 Comments

The iconv function appears to remove the variable label attributes randomly in a dataframe when applied. What could be reasons?
@Heatshock I have no idea what you are doing. iconv should preserve attributes.
I read in a sas dataset in xpt format. here is code. dv <-read_xpt('adsl.xpt'). dv1<-dv |> mutate(across(everything(), ~iconv(.,"latin2", "ascii") )) then use str(dv1). Some of the variable labels get lost in a consistent way. but not all variables. Do you know why?
No, I don't. Might be due to your use of dplyr.
8

A bit late I guess but it could be useful for the next readers.

You can find these functions:

  • showNonASCII(<character_vector>)
  • showNonASCIIfile(<file>)

in the tools R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.

Comments

3

A stringr regex solution:

library(stringr)
x <- c("façile test of showNonASCII(): details{", 
    "This is a good line",
    "This has an ümlaut in it.", "OK again. }")
str_detect(x, "[^[:ascii:]]")
# => [1]  TRUE FALSE  TRUE FALSE

The [^[:ascii:]] pattern matches any non-ASCII character.

The [[:ascii:]] pattern matches any ASCII character.

If you ever need to make sure the whole string consists of non-ASCII chars, use

str_detect(x, "^[^[:ascii:]]+\\z")

where ^ matches the start of string and \z matches the very end of string.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.