Lua - 从文件中读取一个UTF-8字符

2017-5-23 10:27:25

收藏：0

阅读：188

评论：3

有可能从文件中读取一个 UTF-8 字符吗？

当我打印时，file:read(1) 返回奇怪的字符。

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

该函数从字符串 str 返回一个 UTF-8 字符。我需要按这种方式读取一个 UTF-8 字符，但是从输入文件中读取（不想通过 file:read("*all") 将某个文件读入内存）。

问题与此帖子非常相似：使用 Lua 提取 UTF-8 字符串的第一个字母

用户1847592

函数read_utf8_char(file)用于读取UTF-8编码中的一个字符，并返回该字符的字节码序列。

function read_utf8_char(file)
  local c1 = file:read(1)
  local ctr, c = -1, math.max(c1:byte(), 128)
  repeat
    ctr = ctr + 1
    c = (c - 128)*2
  until c < 128
  return c1..file:read(ctr)
end

2015-04-24 20:46:28

用户1442917

你需要读取字符，以便要匹配的字符串始终有四个或更多字符（这将允许您应用所引用答案的逻辑）。如果匹配并删除UTF-8字符后长度为len，那么您就需要从文件中读取4-len个字符。

ZeroBrane Studio在将字符打印到输出面板时，会将无效的UTF-8字符替换为[SYN]字符（正如您在屏幕截图中所看到的）。这篇博客文章描述了在ZeroBrane Studio中检测无效UTF-8字符（在Lua中）以及它们的处理的逻辑。

2015-04-24 20:48:53

用户90511

在 UTF-8 编码中，一个字符占用的字节数是由该字符的第一个字节决定的，根据以下表格得出（摘自 RFC 3629）：

字符数值范围 |          UTF-8 字节序列
  (十六进制)  |              (二进制)
-------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

如果第一个字节的最高位为“0”，则该字符只有一个字节。如果最高位是“110”，则该字符有 2 个字节，依此类推。

然后，您可以从文件中读取一个字节，并确定需要读取多少个连续字节才能得到完整的 UTF-8 字符：

function get_one_utf8_character(file)

  local c1 = file:read(1)
  if not c1 then return nil end

  local ncont
  if     c1:match("[\000-\127]") then ncont = 0
  elseif c1:match("[\192-\223]") then ncont = 1
  elseif c1:match("[\224-\239]") then ncont = 2
  elseif c1:match("[\240-\247]") then ncont = 3
  else
    return nil, "invalid leading byte"
  end

  local bytes = { c1 }
  for i=1,ncont do
    local ci = file:read(1)
    if not (ci and ci:match("[\128-\191]")) then
      return nil, "expected continuation byte"
    end
    bytes[#bytes+1] = ci
  end

  return table.concat(bytes)
end

2016-08-12 00:12:28

评论区的留言会收到邮件通知哦~

作者:

用户4636982

技术支撑

Nana 框架
Kong API 网关
Nuxt 服务端渲染

统计信息

会员 0
文章数: 0
话题数: ...