Tuesday, April 21, 2015

NLS_SORT = ESTONIAN; REGEXP_LIKE(); disturbing behaviour OR cool feature?


Test case


$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.6 (Santiago)
 

$ env | grep NLS_LANG
NLS_LANG=AMERICAN_AMERICA.UTF8

$ sqlplus test/test

SQL*Plus: Release 12.1.0.2.0 Production on Tue Apr 21 17:26:00 2015

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 


SQL> ALTER SESSION SET NLS_SORT=ESTONIAN; 

SQL>  
SELECT 
  MY_CHAR, CASE WHEN REGEXP_LIKE(MY_CHAR,'^[A-Z][a-zA-Z0-9_]{0,}$') THEN 1 ELSE 0 END AS REGEXP_RESULT
FROM (
  SELECT CHR(ASCII('A')-1+ROWNUM) AS MY_CHAR FROM DUAL CONNECT BY LEVEL <= 26
);
 


Result


MY_CHAR REGEXP_RESULT
A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 1
J 1
K 1
L 1
M 1
N 1
O 1
P 1
Q 1
R 1
S 1
T 0
U 0
V 0
W 0
X 0
Y 0
Z 1

Expected result would be "1" for all rows. Isn't it?

Need


Need was to restrict primary key values of parameter table. Values should follow java enum rules. CHECK constraint and REGEXP_LIKE() function seems good and simple way to implement it.
Example below demonstrates that unexpected error is reported if NLS_SORT=ESTONIAN:

SQL> CREATE TABLE PARAMETER_TABLE (
  CODE VARCHAR2(50 CHAR) CONSTRAINT CODE_CK CHECK(REGEXP_LIKE(CODE,'^[A-Z][A-Z0-9_]{0,}$'))
, PRIMARY KEY (CODE)
);

SQL> ALTER SESSION SET NLS_SORT=ESTONIAN;

SQL> INSERT INTO PARAMETER_TABLE VALUES ('S');

1 row created.

SQL> INSERT INTO PARAMETER_TABLE VALUES ('T');
INSERT INTO PARAMETER_TABLE VALUES ('T')
*
ERROR at line 1:
ORA-02290: check constraint (TEST.CODE_CK) violated


Explanation


Explanation turns out to be simple: REGEXP_LIKE() behavior depends on NLS_SORT value!
Check this example:

ALTER SESSION SET NLS_SORT=ESTONIAN; 

SELECT MY_CHAR, CASE WHEN REGEXP_LIKE(MY_CHAR,'^[A-Z][a-zA-Z0-9_]{0,}$') THEN 1 ELSE 0 END AS REGEXP_RESULT 
FROM ( 
  SELECT SUBSTR(MY_CHAR, LEVEL, 1) MY_CHAR 
  FROM ( 
    SELECT 'ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY' AS MY_CHAR FROM DUAL 
  )
  CONNECT BY LEVEL <= LENGTH( MY_CHAR) 
) 
ORDER BY 1;

MY_CHAR REGEXP_RESULT
A 1
B 1
C 1
D 1
E 1
F 1
G 1
H 1
I 1
J 1
K 1
L 1
M 1
N 1
O 1
P 1
Q 1
R 1
S 1
Š 1
Z 1
Ž 0
T 0
U 0
V 0
W 0
Õ 0
Ä 0
Ö 0
Ü 0
X 0
Y 0
 

Z is not last character in Estonian alphabet. Letter T is after Z. Therefore letter T is not covered by "regexp [A-Z]" expression.

Regexp_like() covers whole Estonian alphabet if you use "regexp [A-Y]" expression instead (only if NLS_SORT=ESTONIAN)!


PS


It seems that linux regexp (and sort) works also like regexp_like() in database. Check this example:


# LANG=en_US.UTF-8
# echo -e "S\nT\nX\nY\nZ" | grep -e "^[A-Z]\$"
S
T
X
Y

Z
# LANG=et_EE.UTF-8
# echo -e "S\nT\nX\nY\nZ" | grep -e "^[A-Z]\$"
S
Z



References


Bug 21025097 - REGEXP_LIKE FUNCTION DEPENDS ON SESSION VALUE FOR NLS_SORT 
Registered in May 2015

Bug 12429872 - REGEXP_LIKE FUNCTION TAKING WRONG NLS_SORT DURING CONSTRAINT CREATION
 
Registered in May 2011


-- end --