Character Encoding

Andrew_Stacey · December 13, 2011, 5:04am

Not sure if this is a bug or a “what do I do?”.

I want to convert characters to codes, so I tried string.byte("å"). However, it seems that the return value of string.byte() is 195 or 194 for most “non standard” characters. Unfortunately, “non standard” characters are quite standard in some parts of the world!

What’s the character encoding for Codea’s editor? I can certainly type things like øæå, and print them to the screen, but manhandling them as ordinary strings proves a little trickier due to the above.

sim · December 13, 2011, 5:30am

I believe string.byte() assumes ASCII in the default Lua implementation.

Here is the source code for string.byte from the Lua 5.1 source.


/* macro to `unsign' a character */
#define uchar(c)        ((unsigned char)(c))

static ptrdiff_t posrelat (ptrdiff_t pos, size_t len) {
  /* relative string position: negative means back from end */
  if (pos < 0) pos += (ptrdiff_t)len + 1;
  return (pos >= 0) ? pos : 0;
}

static int str_byte (lua_State *L) {
  size_t l;
  const char *s = luaL_checklstring(L, 1, &l);
  ptrdiff_t posi = posrelat(luaL_optinteger(L, 2, 1), l);
  ptrdiff_t pose = posrelat(luaL_optinteger(L, 3, posi), l);
  int n, i;
  if (posi <= 0) posi = 1;
  if ((size_t)pose > l) pose = l;
  if (posi > pose) return 0;  /* empty interval; return no values */
  n = (int)(pose -  posi + 1);
  if (posi + n [less than or equals] pose)  /* overflow? */
    luaL_error(L, "string slice too long");
  luaL_checkstack(L, n, "string slice too long");
  for (i=0; i<n; i++)
    lua_pushinteger(L, uchar(s[posi+i-1]));
  return n;
}

As you can see, all this does is take a byte from the string and cast it directly to unsigned char. So I suspect calling string.byte(“å”) will return the first byte of å – perhaps it returns two values for that character, you can check like this.


first, second = string.byte("å")
print( first )
print( second )

Andrew_Stacey · December 13, 2011, 5:42am

I get 195 nil for that test so it looks as though it’s just returning the first byte.

I’ve now discovered that string.gfind suffers from the same problem. This is annoying because I want to step through a string and take some action according to the characters. Any suggestions on how I might do that?

Ipad41001 · December 13, 2011, 6:57am



function uchar(xchar)
    return string.find("Ã€ÃÃ‚ÃƒÃ„Ã…Ã†Ã‡ÃˆÃ‰ÃŠÃ‹ÃŒÃÃŽÃ",xchar) + 191
end

Will that help at all?

Andrew_Stacey · December 13, 2011, 7:53am

Not really because then I have to supply it with a list of “special” characters and I’d quite like to do it automatically. At the moment, if the byte returned by string.gfind is 195 then I add 64 to the next one. I’ll probably have to put in some more options … once I can find an explanation of how to recognise multibyte strings on a byte-by-byte basis.

Andrew_Stacey · December 13, 2011, 12:15pm

Okay, I found the description (isn’t Wikipedia wonderful?) and have written an iterator that iterates over a UTF8 string, returning the codepoints for the characters. It seems to work, though I’ve only thrown Norwegian at it so far.

CrazyEd · December 13, 2011, 12:18pm

Does this also work with ä, ö etc as used in Finnish?

Andrew_Stacey · December 13, 2011, 12:27pm

Should do as it works by examining bytes, not characters. I’ll test it when I can get my iPad back.

Andrew_Stacey · December 13, 2011, 1:26pm

Got it back and tried it with lots of different os and it worked fine with all of them.

sim · December 13, 2011, 5:07pm

Sorry about the lack of nice character support, it’s really down to the way Lua deals with strings. Basically, it doesn’t, it lets the underlying OS interpret the character sequence however it wants.

Andrew_Stacey · December 13, 2011, 5:10pm

No worries. It wasn’t that hard to do. (The iterator is buried in my YAFC code, by the way)