Category Archives: database

Loading multibyte CSV files into Oracle

Here is the problem:

  1. A bunch of CSV files are generated under windows
  2. some of the content for those files were copied/pasted from  M$ Word
  3. Word uses “smart quotes” and some other special characters that are not in ISO 8859-1 standard
  4. The Oracle schema has max character limits defined on the content columns in above (item 2) and the content may reach to the maximum.

So when loading the CSV files into database using sqlldr, got the following error:
Record 108: Rejected - Error on table QUESTIONDIM, column QUESTIONABBREV.
ORA-12899: value too large for column QUESTIONABBREV (actual: 49, maximum: 50)

49 is larger than 50???

When you open the file with a plain text editor such as VIM and place the cursor on those special characters, you will see the column count at the bottom shows multiple values (49-51) instead of single value as on normal character. It shows this character are multibyte character.

By default, the character columns in Oracle database tables was created as byte-characters, which means varchar2(20) can only store 20 byte data and it is equals two 20 ISO 8859-1 characters. Because each character only takes one byte. However, when the multibyte character is in the content, the storage required will be larger than 20 bytes. Because each multibyte character may takes 3 or 4 bytes and this is why VIM showing multiple values (49-51) instead of one value and Oracle thinks 49 is larger than 50!

To solve the problem:

Oracle database is set up to use UTF-8 character set. Check with the following SQL:
SELECT * FROM NLS_DATABASE_PARAMETERS

Creating tables using Character Semantics, e.g.:
SQL> CREATE TABLE tab2 (
2 id NUMBER(10),
3 description VARCHAR2(20 CHAR)
4 );

Or change the default Character Semantics using:
ALTER SYSTEM SET NLS_LENGTH_SEMANTICS=CHAR; or
ALTER SESSION SET NLS_LENGTH_SEMANTICS=CHAR;

Then all the fiels using varchar or character are created by char character semantics instead of byte.

The CSV files generated by windows are WINDOWS-1252 encoded. To allow sqlldr correctly translate the text into UTF8, which used by Oracle database, we need to set up correct character set (WE8MSWIN1252) for control file used by sqlldr.

OPTIONS (SKIP=1, DIRECT=true)
UNRECOVERABLE LOAD DATA
CHARACTERSET WE8MSWIN1252
INFILE 'Infile.CSV'
TRUNCATE INTO TABLE MY_TABLE
fields terminated by "," optionally enclosed by '"'
(column1, column2....)

The $NLS_LANG environment variable is set to AMERICAN_AMERICA.UTF8. I didn’t test other values though. The above settings works well for me.

Final notes:

  • Using character semantics and UTF8 character set will use more storage than latin1 + byte.
  • Avoid using M$ Word “smart” features or don’t copy/paste from word, it causes more trouble…..