1

I am new to regular expressions in java. I have a csv file which consists of newline characters in some of the fields like below:

name,address,phone
tom,123 baker st,1234
jim,"234 baker st
some city",5678
james,"897 lowell st
some city, some state",78910

If a particular value has commas or newlines, the whole value is enclosed between " ". I need to remove the newline characters (and replace it with a single space) in the fields and I think using a regex would be easier.

hoping it would make it easier, I have read the whole file into a String using the below lines:

String str = new String(Files.readAllBytes(Paths.get("file path")),"UTF-8");

Now I have the whole file in str. All the fields are separated by commas. so, any newline characters between ," and ", in the string str should be removed (replaced with " ").I am guessing I should write a regex to match this pattern and then replace the newlines('\n') with " ".

My knowledge ends there and I have no clue how to implement it in my code.

after the transformation, the data should look like this:

name,address,phone
tom,123 baker st,1234
jim,"234 baker st some city",5678
james,"897 lowell st some city, some state",78910

Any help would be appreciated! Thank you.

6
  • use CSVParser for parsing with fields delimited by , and enclosed by " Commented Jan 5, 2018 at 4:26
  • I need the whole data as a new file without the newline characters in the fields like I mentioned in my question. Can it be done using the parser? If yes, can you please link an example? Commented Jan 5, 2018 at 4:30
  • @Hemnath you can parse with CSVParser and replace \r\n with empty string for the fields you want to remove new line Commented Jan 5, 2018 at 4:33
  • There will me multiple fields which contain newlines and it differs for every record. Is that still possible? I think using a regex to replace newlines in the string would be easier..@Saravana Commented Jan 5, 2018 at 4:36
  • 1
    It's probably possible using zero-width lookahead and lookbehind assertions in the regex, but they'll become very complex - you also need to take into account that a CSV cell can contain double quotes, and they'll be escaped as two double-quote characters in sequence ("") and these don't terminate the value. Saravana's suggestion is much better. Commented Jan 5, 2018 at 4:43

3 Answers 3

2

You can use CSVParser to parse and remove space after reading the fields

CSVFormat format = CSVFormat.DEFAULT
                .withRecordSeparator(',')
                .withIgnoreEmptyLines()
                .withQuote('"');
        CSVParser parser = CSVParser.parse(new File("/file/path/csv"), Charset.defaultCharset(), format);
        List<CSVRecord> recordList = parser.getRecords();
        for (CSVRecord record : recordList) {
            Iterator<String> it = record.iterator();
            while (it.hasNext()) {
                System.out.print(it.next().replace("\n", "") + "|");
            }
            System.out.println();
        }

output

name|address|phone|
tom|123 baker st|1234|
jim|234 baker stsome city|5678|
james|897 lowell stsome city, some state|78910|

maven dependency

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-csv</artifactId>
        <version>1.1</version>
    </dependency>
Sign up to request clarification or add additional context in comments.

2 Comments

I came here expecting to get an answer using the regex, but I have to say, it's a brilliant answer. working perfectly! Thank you! just one thing though, is there a way I can avoid putting | at the end of each line?
@Hemanth if you see the sysout statement, I am contacting | for readability, you can remove it
0

EDIT: Here is a solution

str.replaceAll("(,\".*)(\n+)(.*\",)", "$1 $3")

Here is a good tutorial explaining grouping and back reference in regex in java http://www.vogella.com/tutorials/JavaRegularExpressions/article.html#grouping-and-back-reference

6 Comments

Thanks for the answer! I strictly need to replace only the newlines between ," and ", . Does the above code still work? or do we have to do something like this: str.replaceAll("(,\".*\",)(\n+)", "$1 ")
updated the proposed code in the answer. it replaces only newlines between ," and ",
It's working perfectly for the sample data in the question.But, on my actual data, most of the newlines are still present. Any idea why this is happening? My actual data also consists of special characters between ," and ," . could that be causing an issue?
@AlexBinkovsky, just a thought. As Hemanth, wants to replace the text till the number. So should this be a better expression for that -- str.replaceAll("(,\".+)(\n+)(.+\",\d+)", "$1 $3") ??
I only posted the above data as a sample. It's not like I'll have a number after ", in the actual data. @Abhishek
|
-1

Java Regex removing spaces and new line character

String str = " \n a b c \n 1 2 3 \n x y z ";
str = str.trim().replaceAll("\n ", "");

2 Comments

I won't be able to use this as I don't want to replace all the newlines in the string. only the newlines between ," and ",
The OP doesn't want to remove all newline characters- in fact he needs to keep most of them to separate the lines. OP only wants to remove newline characters within the double quotes of a single value in the CSV.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.