JavaCC and Unicode issue. Why u696d cannot be managed in JavaCC although it belong to the range "u4e00"-"u9fff"

We’re trying to use JavaCC as a parser to parse source code which is in UTF-8( the language is Japanese). In JavaCC, we have a declaration like:

< #LETTER:
  [
   "u0024",
   "u0041"-"u005a",
   "u005f",
   "u0061"-"u007a",
   "u00c0"-"u00d6",
   "u00d8"-"u00f6",
   "u00f8"-"u00ff",
   "u0100"-"u1fff",
   "u3040"-"u318f",
   "u3300"-"u337f",
   "u3400"-"u3d2d",
   "u4e00"-"u9fff",
   "uf900"-"ufaff"
  ]
>

If it meets a string like “日建フェンス工業”, it will fail because of 業 character. If I remove it, it works as expected. The code of 業 character is “u696d”, and as you can see in the declaration, it should belong to the range “u4e00”-“u9fff”

Any suggestion on this?

PS: If we rewrite this grammar using Antlr, how does it look like

Thank you so much


Source: java

Leave a Reply