30歳からのプログラミング

この記事では、 JavaScript で文字コードを扱う際に知っておくべき概念である Code Point や Code Unit、サロゲートペア、といったものについて説明していく。
また、具体的にそれらの概念を使ってどのようにコードを書いていくのかについても扱う。

この記事に出てくるコードの動作確認は以下の環境で行った。

Deno 1.26.0
TypeScript 4.8.3

Code Point （符号位置）

プログラムで文字を表現する方法は複数あるが、 JavaScript では Unicode という方法を採用している。
Unicode ではあらゆる文字に対して一意の値を割り振ることを目的としており、この値のことを Code Point （符号位置）という。

Code Point は 16 進数の非負整数で、文章中で表記するときは接頭辞としてU+をつける。
例えばAという文字の Code Point はU+0041、あはU+3042、🐶はU+1f436として定義されている。

ES2015 で追加されたcodePointAtメソッドを使うと、任意の文字列リテラルの Code Point を取得できる。
数値リテラルが返ってくるので、 16 進数による表記を得たい場合はtoStringで変換する。

const str = "Aあ🐶";
console.log(str.codePointAt(0)); // 65
console.log(str.codePointAt(0)?.toString(16)); // 41
console.log(str.codePointAt(1)?.toString(16)); //3042
console.log(str.codePointAt(2)?.toString(16)); //1f436

`\u{CodePoint}`と書くことで Code Point から文字列リテラルを得ることもできる。

console.log(`\u{41}`); // A
console.log(`\u{3042}`); // あ
console.log(`\u{1f436}`); // 🐶

ES2015 で追加された静的メソッドであるString.fromCodePointを使うことでも Code Point から文字列リテラルへの変換を行える。この方法だと Code Point を変数に入れて使うこともできる。

const codePoint = 0x41;
console.log(String.fromCodePoint(codePoint)); // A

Code Unit（符号単位）

文字を実際にコンピュータで扱うためには、 Code Point をさらに Code Unit（符号単位）に変換する必要がある。
Code Unit はプログラムにおける文字の内部表現であり、これを元に0と1の羅列であるバイト列に変換することで、コンピュータが文字をスムーズに扱えるようになる。

Unicode の Code Point を Code Unit に変換する方法はいくつか定義されているが、 JavaScript では UTF-16 という方法を採用している。
UTF-16 では、 Code Unit を符号なし 16 ビット整数を使って表現する。そのため、 JavaScript の内部においては文字列は、符号なし 16 ビット整数が並んでいるものとして扱われる。

Code Unit も Code Point 同様に 16 進数で表記されることが多い。

符号なし 16 ビット整数の範囲は0000からFFFF。
16 ビットは 16 桁の 2 進数なので2 ^ 16 = 65536であり、FFFFの 10 進数表記が65535であるためそうなる。

charCodeAtメソッドで、任意の文字列リテラルの Code Unit を取得できる。
これもcodePointAtと同様に数値リテラルが返ってくる。

const str = "Aあ";
console.log(str.charCodeAt(0)); // 65
console.log(str.charCodeAt(0)?.toString(16)); // 41
console.log(str.charCodeAt(1)?.toString(16)); //3042

Code Unit から文字列リテラルに変換する方法も用意されており、`\uCodeUnit`とString.fromCharCodeがある。

console.log(`\u0041`); // A
console.log(`\u3042`); // あ

const codeUnit = 0x41;
console.log(String.fromCharCode(codeUnit)); // A

サロゲートペア

Aとあは Code Point と Code Unit が同じだったが、🐶は異なる。
そもそもAやあとは異なり Code Unit が 2 つある。

const check = (str: string): void => {
  const length = str.length;
  for (let i = 0; i < length; i++) {
    console.log(i, str.charCodeAt(i).toString(16));
  }
};

// 0 41
check("A");

// 0 3042
check("あ");

// 0 d83d
// 1 dc36
check("🐶");

先程、符号なし 16 ビット整数では65536個の数を扱えると書いたが、 Unicode が扱う文字の数はそれをゆうに超える。
つまり符号なし 16 ビット整数では、 Unicode が扱う全ての文字を表現することが出来ないのである。
そのため UTF-16 では、 Code Unit をふたつ組み合わせてひとつの文字を表現する方法を導入した。
そのような文字をサロゲートペアと呼ぶ。
🐶もサロゲートペアである。そのため、 Code Unit がふたつあった。

// U+1f436（🐶）は d83d と dc36 の組み合わせで表現される

console.log(`\ud83d\udc36`); // 🐶
console.log(String.fromCharCode(0xd83d, 0xdc36)); // 🐶

一方でAとあはひとつの Code Unit で表現されており、サロゲートペアではない。
このように UTF-16 においては、ひとつの Code Unit で表現する文字と、ふたつの Code Unit で表現する文字が混在している。

UTF-16 による変換ロジック

Code Point から Code Unit への変換は、定義された所定のロジックで行われる。

まず、U+10000からU+10FFFFの Code Point がサロゲートペアになり、それ以外の Code Point は Code Point がそのまま Code Unit になる。

サロゲートペアの場合、 2 進数表記の Code Point をゼロパディングして 24 桁にする。
そして以下の表の変換ロジックで、ふたつの 16 ビットのビット列に変換する。

	Code Point	UTF-16	備考
ロジック	`000uuuuuyyyyyyxxxxxxxxxx`	`110110wwwwyyyyyy` `110111xxxxxxxxxx`	`wwww = uuuuu - 1`
U+1f436（🐶）	`000000011111010000110110`	`1101100000111101` `1101110000110110`	`0000 = 00001 - 1`

U+1f436（🐶）の例も合わせて書いておいた。
1f436をビット列（2 進数）で表現すると11111010000110110なので、それをゼロパディングした000000011111010000110110から変換ロジックが始まる。

そして変換を行うと、U+1f436の Code Unit は1101100000111101（d83d）と1101110000110110（dc36）の組み合わせになる。

このロジックを TypeScript で雑に実装すると以下のようになる。

const encode = (codePoint: string): [string] | [string, string] => {
  const decimalCodePoint = parseInt(codePoint, 16);

  const isSurrogatePair =
    decimalCodePoint >= 0x10000 && decimalCodePoint <= 0x10ffff;

  if (!isSurrogatePair) {
    return [codePoint];
  }

  const scalar = decimalCodePoint.toString(2).padStart(24, "0");
  const u = scalar.substring(3, 8);
  const x1 = scalar.substring(8, 14);
  const x2 = scalar.substring(scalar.length - 10);
  const w = (parseInt(u, 2) - 1).toString(2).padStart(4, "0");
  return [
    parseInt(`110110${w}${x1}`, 2).toString(16),
    parseInt(`110111${x2}`, 2).toString(16),
  ];
};

console.log(encode("0041")); // [ "0041" ]
console.log(encode("3042")); // [ "3042" ]
console.log(encode("1f436")); // [ "d83d", "dc36" ]
console.log(
  String.fromCharCode(
    ...encode("1f436").map((codeUnit) => parseInt(codeUnit, 16))
  )
); // 🐶

文字列リテラルとバイト列の相互変換

Web API の機能を使うことで、文字列リテラルとバイト列の相互変換を行える。

文字列リテラルからバイト列への変換にはTextEncoderを使う。
TextEncoderインスタンスのencodeメソッドは文字列リテラルを受け取り、それを UTF-8 でエンコードしたUint8Arrayを返す。

const encoder = new TextEncoder();
console.log(encoder.encode("A")); // Uint8Array(1) [ 65 ]
console.log(encoder.encode("あ")); // Uint8Array(3) [ 227, 129, 130 ]
console.log(encoder.encode("🐶")); // Uint8Array(4) [ 240, 159, 144, 182 ]

UTF-8 では Code Unit を符号なし 8 ビット整数で表現し、ひとつの文字を 1 ~ 4 つの Code Unit で表現する。
そのため、あや🐶のケースを見れば分かるように、 UTF-16 による表現とは一致しないので注意する。

バイト列から文字列リテラルへの変換はTextDecoderで行える。
コンストラクタの引数にはエンコーディング形式を渡すことができ、省略した場合はutf-8になる。

utf-8を指定したTextDecoderインスタンスのdecodeメソッドに、 UTF-8 でエンコードされたUint8Arrayを渡すと、デコードした文字列リテラルが返ってくる。

const decoder = new TextDecoder("utf-8");
console.log(decoder.decode(new Uint8Array([65]))); // A
console.log(decoder.decode(new Uint8Array([227, 129, 130]))); // あ
console.log(decoder.decode(new Uint8Array([240, 159, 144, 182]))); // 🐶
console.log(
  decoder.decode(new Uint8Array([65, 227, 129, 130, 240, 159, 144, 182]))
); // Aあ🐶

参考資料

UTF-16 - Wikipedia

この記事では、継続渡しスタイル（continuation passing style、以下 CPS）の概要と、CPS の活用例を書いていく。

この記事に出てくるコードの動作確認は TypeScript の4.7.4で行っている。

後続の処理を引数として渡す

関数が終わった後に実行される後続の処理をその関数の引数として渡すスタイル、そういったプログラムの書き方を、 CPS と呼ぶ。

例えば、以下のようなコードがあるとする。

const getLength = (str: string): number => str.length;

const n: number = getLength("hello");
console.log(n); // 5

getLength("hello")の結果をnに代入し、それを使ってconsole.logを実行している。

getLengthを CPS に書き換えると次のようになる。

const getLengthCps = <T>(cont: (x: number) => T, str: string): T =>
  cont(str.length);

getLengthはstr.lengthを返していたが、getLengthCpsはstr.lengthを「関数が終わった後に実行される後続の処理」であるcontに渡している。

numberを受け取る関数ならどんなものでも、contとして渡すことができる。

const getLengthCps = <T>(cont: (x: number) => T, str: string): T =>
  cont(str.length);

getLengthCps(console.log, "hello"); // 5
console.log(getLengthCps((length) => length * 3, "foo")); // 9

CPS を使ったリファクタリング

CPS の利用例のひとつとして、特定の条件を満たしたときにのみ後続の処理を実行する、というプログラムを書いてみる。

CPS を使っていない、以下のコードがあったとする。
少し長いが、getEmployeesPagePropsとgetOfficesPagePropsの概要さえ理解できれば問題ない。
これらの関数の返り値をコンポーネントに渡して View を作ることを想定している。

type Employees = string[];

type Offices = string[];

type GetPageProps<T> = (
  sessionId: string
) => { ok: false; message: string } | { ok: true; data: T };

const CORRECT_SESSION_ID = "123";

const auth = (
  sessionId: string
): { ok: true; companyId: string } | { ok: false } => {
  if (sessionId === CORRECT_SESSION_ID) {
    return {
      ok: true,
      companyId: "1",
    };
  }
  return {
    ok: false,
  };
};

const getEmployeesPageProps: GetPageProps<{ employees: Employees | null }> = (
  sessionId
) => {
  const authResult = auth(sessionId);
  if (authResult.ok === false) {
    return {
      ok: false,
      message: "Unauthorized",
    };
  }

  const dummyDb = new Map([["1", ["Alice", "Bob"]]]);

  return {
    ok: true,
    data: { employees: dummyDb.get(authResult.companyId) ?? null },
  };
};

const getOfficesPageProps: GetPageProps<{ offices: Offices | null }> = (
  sessionId
) => {
  const authResult = auth(sessionId);
  if (authResult.ok === false) {
    return {
      ok: false,
      message: "Unauthorized",
    };
  }

  const dummyDb = new Map([["1", ["London", "Paris"]]]);

  return {
    ok: true,
    data: { offices: dummyDb.get(authResult.companyId) ?? null },
  };
};

console.log(getEmployeesPageProps("xyz")); // { ok: false, message: 'Unauthorized' }
console.log(getEmployeesPageProps("123")); // { ok: true, data: { employees: [ 'Alice', 'Bob' ] } }
console.log(getOfficesPageProps("xyz")); // { ok: false, message: 'Unauthorized' }
console.log(getOfficesPageProps("123")); // { ok: true, data: { offices: [ 'London', 'Paris' ] } }

getEmployeesPagePropsとgetOfficesPagePropsはどちらも、以下の処理を行っている。

引数として渡されたsessionIdを使って認証を行う
認証に失敗した場合はその旨を返し、処理を終了する
認証に成功した場合は手に入れたcompanyIdを使ってEmployeesもしくはOfficesを取得し、それを含んだGetPagePropsを返す

このうち、1と2は全く同じコードなので、これを共通化したい。
CPS を使って「認証が成功した後の処理を引数として渡す」という書き方にすることで、これを実現できる。

以下のcontinueWithAuthは「認証が成功した後の処理」をcontとして受け取り、認証が成功したときにのみcontを呼び出している。
こうすることで、1と2の処理を共通化し、3として任意の処理を渡せるようになる。

const continueWithAuth = <T>(
  cont: (companyId: string) => { ok: true; data: T },
  sessionId: string
): ReturnType<GetPageProps<T>> => {
  const authResult = auth(sessionId);
  if (authResult.ok === false) {
    return {
      ok: false,
      message: "Unauthorized",
    };
  }
  return cont(authResult.companyId);
};

const getEmployeesPageProps: GetPageProps<{ employees: Employees | null }> = (
  sessionId
) =>
  continueWithAuth((companyId) => {
    const dummyDb = new Map([["1", ["Alice", "Bob"]]]);
    return {
      ok: true,
      data: {
        employees: dummyDb.get(companyId) ?? null,
      },
    };
  }, sessionId);

const getOfficesPageProps: GetPageProps<{ offices: Offices | null }> = (
  sessionId
) =>
  continueWithAuth((companyId) => {
    const dummyDb = new Map([["1", ["London", "Paris"]]]);
    return {
      ok: true,
      data: {
        offices: dummyDb.get(companyId) ?? null,
      },
    };
  }, sessionId);