2

I have been breaking my head over the past few hours at trying to figure out what's wrong with my code. This piece of code was all working fine until i received a file which had japanese characters in it. Notepad++ and even some online utility tools say that the encoding of the file is UTF-8. Notepad says its UTF-8-BOM. I have read my data from the file and I have processed it and finally want to write it out to the database.

I get the error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xee My database encoding is UTF8 only..

package citynet.dta.pump;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

import org.postgresql.copy.CopyManager;
import org.postgresql.core.BaseConnection;

import citynet.common.ServerException;

public class TestEncoding {

public static void main(String[] args) {
    byte[] bytes = null;
    try {
        //use the below sql to create table 'testtable'
        // create table testtable (text1 character varying, text2 character varying,text3 character varying)
        try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
            DataOutputStream out = new DataOutputStream(baos);

            out.writeBytes("INR,字仮名交じり文,3255104BTK1");

            bytes = baos.toByteArray();
        }
        Class.forName("org.postgresql.Driver");
        Connection c = DriverManager.getConnection("jdbc:postgresql://server:5432/dbname", "username", "password");
        if (bytes != null) {
            try (ByteArrayInputStream input = new ByteArrayInputStream(bytes)) {
                String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
                BaseConnection pgcon = (BaseConnection) c;
                CopyManager mgr = new CopyManager(pgcon);
                try {
                    mgr.copyIn(sql, input);

                } catch (SQLException ex) {
                    throw new ServerException("Error while copying data in Postgres DB:" + ex);

                }
            }
        }
    } catch (Exception e) {
        System.out.println("Error:" + e);
    }
  }
}

1 Answer 1

3

The issue is DataOutputStream#writeBytes("INR,字仮名交じり文,3255104BTK1") is not doing what you expect.

  1. You should refrain from using BaseConnection as it is an internal class. Application code should use PGConnection

  2. Here's how you get CopyManager:

    Connection con = ...;
    PGConnection pgcon = con.unwrap(org.postgresql.PGConnection.class);
    CopyManager mgr = pgcon.getCopyAPI();
    
  3. The source of your data might be different so there are multiple ways to execute copyAPI.

    If you want to convert String to UTF-8 bytes via your own code, then you need getBytes.

    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
    byte[] bytes = "INR,字仮名交じり文,3255104BTK1".getBytes(StandardCharsets.UTF_8);
    mgr.copyIn(sql, new ByteArrayInputStream(bytes));
    

    Note: there's no need to close ByteArrayInputStream (see its Javadoc).

  4. If you need to stream a CSV file to the database, you might use FileInputStream:

    try (InputStream fis = new FileInputStream("file.csv")) {
      mgr.copyIn(sql, fis);
    }
    
  5. If you want to build the contents incrementally, then you might use ByteArrayOutputStream + OutputStreamWriter

    Note: all the rows would need to fit in the memory otherwise you get OutOfMemoryError.

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    try (OutputStreamWriter wr = new OutputStreamWriter(baos, StandardCharsets.UTF_8)) {
      // Write 10 rows
      for (int i = 0; i < 10; i++) {
        wr.write("INR,字仮名交じり文,3255104BTK1\n");
      }
    }
    
    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8'";
    mgr.copyIn(sql, new ByteArrayInputStream(baos.toByteArray()));
    
  6. An alternative option is to use Reader

    Note: encoding is not specified, and it is using connection-default encoding (which is utf-8 in 99.42% of the cases since the driver defaults to utf-8 connection encoding).

    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null'";
    mgr.copyIn(sql, new StringReader("INR,字仮名交じり文,3255104BTK1"));
    
  7. Yet another alternative is to use copyIn(String sql, ByteStreamWriter from) API which might be more efficient for certain use-cases (e.g. all the data is in-memory, and you know the number of bytes you are going to write)

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a lot Valdimir. I see you have spent a lot of time writing all that up..Appreciate the help :) On few points that you have mentioned, yes we shouldn't be using BaseConnection and as a matter of fact it was only this piece of code where it was using this. All the rest of the application uses PGConnection..So not much code changes thankfully :) and regarding the point about writeBytes not doing what we expect, yes unfortunately I was just focusing on any overloaded method for writeBytes itself which might take encoding and i knew String object had getBytes method..Damn..
Just a followup question. Do we still have to mention ENCODING UTF8 in the copy command as you mentioned that 99.42% of the cases the driver defaults to utf-8 connection encoding
Well, when you use byte API, and you should specify the encoding to make it explicit and avoid unexpected data corruption
To just be clear. on your point 5, you are saying we should still specify the encoding in String sql = "COPY testtable FROM stdin delimiter ',' NULL AS 'null' ENCODING 'UTF8'";
Yes, item5 should have explicit enoding in sql otherwise the driver (and the database) does not have enough information. Item6 is different because it uses String, so the driver knows the way to convert Java String to a sequence of bytes the database would understand. However, item5 is sending bytes, and the database should know the way to interpret that.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.