English 中文(简体)
UTF-8 中文字符宽度显示问题
原标题:UTF-8 Width Display Issue of Chinese Characters

当我使用 Perl 或 C 到 printf 一些数据时, 我尝试了它们的格式来控制每列的宽度, 比如

printf("%-30s", str);

But when str contains Chinese character, then the column doesn t align as expected. see the attachment picture.

我的 ubuntu s 字符串编码是zh_ CN.utf8, 据我所知, utf-8 编码有1~ 4个字节长度。 中文字符有 3 个字节。 在我测试中, 我发现 printf s 格式控件将中文字符计为 3 个, 但实际上它显示为 2 个 ascii 宽度 。

因此,真实显示宽度不是一个预期的常数,而是一个与中文字符数有关的变量,即:

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w 是预期的宽度限制, x 是中文字符的数, Sw(x) 是真实显示宽度 。

所以中文字符串包含的越多, 显示的越短。

我怎样才能得到我想要的?

据我所知,所有中文甚至所有宽度字符, 我猜, 显示为两个宽度, 那么为什么打印页数算为 3? UTF-8 编码与显示长度无关 。

问题回答

是的,我知道所有版本的printf 都存在问题,我在 Unicode:: GCString 模块表 CPAN 来计算 Unicode 字符串将包含的打印栏数 。 这考虑到了 < a href=" http://unicode. org/reports/ tr11/ " rel=" nofolncode noreferr" > Unicode 标准附件 # 11: 东亚 Width

例如,有些代码点取一列,而另一些则取两列。甚至有些代码点也取不取任何列,例如将字符和无形控制字符结合起来。该类使用 columns 方法返回字符串取取的列数。

我有一个用于垂直对齐 Unicode 文本的示例 : < a href=" http://training.perl.com/ scripts/perlunicook.html#PROGRAM:- Demo-of- Unicode-corration-corration-and-printing" rel = “ nofollow noreferrerr" > here 。 它将排序一连串 Unicode 字符, 包括一些包含字符和“ 范围” 亚洲的“ idegraphics (CJK 字符) 的字符, 并允许您垂直对齐 。

"https://i.sstatic.net/ifmnX.png" alt="sample终端输出"/ >

以下列出了小 umenu 演示程序代码,该代码打印出与输出相近的精准输出。

您也可能感兴趣的是远为雄心勃勃的“http://search.cpan.org/perldoc?Unicode% 3a% 3aLineBreak” rel=“不跟随Noreferr”>Unicode:::LineBreak 模块,上述 Unicode::GCString 类只是一个较小的组成部分。这个模块更酷,并且考虑到 Unicode 标准附件#14:Unicode 线破解 Algorithm

这是在 Perl v5.14 中测试的小 < code> umenu 演示的代码 :

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won t freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f
", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=pL)(S)     (S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

正如你所看到的,让它在这个特定方案中发挥作用的关键 就是这行代码, 它只是呼唤上面定义的其他功能, 并使用我正在讨论的模块:

print pad(entitle($item), $width, ".");

这将用点作为填充字符将项目插入给定宽度。

是的,printf 的方便程度要低得多,但至少是可能的。





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...

热门标签