the space of all haiku Nov 9, 2011

How many haiku can there possibly be? Due to their small, rigid form, we should be able to roughly determine the size of the haikuspace. We will use Japanese, as it is the only language suitable for proper haiku.*

* Of course words come to mean many things, but if you're used to reading and writing haiku in English with a 5-7-5 syllable pattern, I highly recommend investigating some Japanese haiku, and writing with something like 3-4-3 (syllables) or 2-3-2 (words) to get a feeling for the Japanese style.

Phonetic Attack

Japanese syllables are generally smaller than syllables in English. They consist of a consonant and a vowel, or a vowel by itself. Here are various estimates on the size of the Japanese sound inventory:

Source Count Notes
the fifty sounds, see also i ro ha 50 only the basic sounds of Japanese, and so a lower bound on their total number
Wikipedia article on hiragana 102 the vowels a/i/u/e/o (5), Ya/Yu/Yo (3), Wa/Wo (2), Da/De/Do (3), K/S/T/N/H/M/R/G/Z/B/P (11) combined with a/i/u/e/o/ya/yu/yo (8), and N by itself (1), for a total of 5+3+2+3+11*8+1 = 102
Japanese pronunciation 113 14 consonants * 8 vowels + syllabic n
The Range of Sounds in Japanese 133
JMdict 172 from all kana entries, counting only syllable-characters, see below

We'll eliminate the 50, as it's clearly a low-boru. A haiku's 5-7-5 pattern is 17 syllables total, and so the upper bound is between 10217 = 14002414191924244276669361796022272 ≈ 1034.146 and 17217 = 100921476901355254279645541839050637312 ≈ 1038.004.

This is still a pretty wide range (about four orders of magnitude, or a factor of 10,000), and the numbers are pretty unfathomable. Here are a few others for comparison. A googol is 10100. There are estimated to be about 1080 atoms in the observable universe. The number of possible positions in chess is fewer than 1046.7. There are about 1026 molecules of water in a gallon of the stuff. But those doesn't really help, do they?

Dictionary Attack

From JMdict, a machine-readable Japanese dictionary containing nearly 160,000 entries, we extract the most common* kanji (ideographic) and kana (syllabic/reading) records from each entry. Syllables are counted by applying the regular expression substitution below, and then taking the length of the resulting string.

* Roughly, determined using JMdict's "priority" markers, otherwise using the first one. (Most entries (92%) have only one anyway.)

Thanks to memoization, it takes mere seconds for these huge permutations to be computed.

Non-syllable-character removal regex:

s/([きしちにひみりぎじびぴ])[ゃゅょ]/\1/g

(Please let me know if there are other characters or cases which do not count as syllables.)

All characters used in JMdict's kana entries: (172 characters)

、〜ぁあぃいうぇえおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゆょよらりるれろゎわゐゑをんゝゞァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロワヰヱヲンヴヶ・ーヽヾ

Using All Kana Entries

Permutations fitting in 5 syllables = 13724842934828

Permutations fitting in 7 syllables = 2495396740987223584

Permutations of 5-7-5 lines = 470061162017233273469657393428518492432749056 ≈ 1044.672154

Using Only Common* Kana Entries

Permutations fitting in 5 syllables = 94865603412

Permutations fitting in 7 syllables = 2411754014092300

Permutations of 5-7-5 lines = 21704538552340125271960104096068971200 ≈ 1037.336551

* as denoted by JMdict's "priority" markers

Using Only Unique Kana Entries

Permutations fitting in 5 syllables = 21007905554

Permutations fitting in 7 syllables = 302428066343444

Permutations of 5-7-5 lines = 133471212337745718580643080665018704 ≈ 1035.125388

Duplicate Kana Entries: 18784 out of 158685 entries.

The duplication is a bit of a wrinkle. It appears (by sifting randomly through duplicates) that the vast majority of duplicate readings are indeed for separate meanings/kanji, and so I am inclined to believe the "all entries" number. The truth is probably somewhere in the middle, but don't forget we've only used one dictionary.

Tangent: I would love to be able to get a number on the phonetic saturation of Japanese from this. Perhaps after some input regarding syllable counting from those more fluent in Japanese. Until then, I'll just say this: if you map kana readings to kanji entries, there are 9377 readings (6.7%) with 2 or more kanji entries, 1161 (.8%) have 5 or more, and 181 (.1%) have 10 or more. Look at that beautiful power law action.

Summary

That was rather blustery, so here's the take-away: haikuspace is huge. Like 1044 huge. On top of that, a phonetic approach doesn't reach a good upper bound, apparently because of homophones, which increase the haikuspace by almost seven(!) orders of magnitude. Some independent confirmation of that would be nice, though.

The next major step in finding a lower upper-bound would be to apply some sort of "sense-making" filter to the poems. This is beyond the scope of this writeup.

Some Random Haiku

A natural consequence of being able to permute all the words of a Japanese dictionary into haiku is being able to generate random haiku. And so here are a few of those that rose slightly above noise. Translations courtesy of mauler!

詰め込む間
ざあざあネオン
酸化物

While I cram
Whooshing neon
Oxide
狂暴戸
レッドテープ子
史籍ポロ

Enraged door
Red tape child
A history of polo
険悪絵
願掛け火食
公有気

Hostile pictures
Prayer cooked food
Public aspiration
孝道子
引ったくり急
穴居人

Michiko Takashi
Sudden snatching
Caveman
国花櫛
結論回目
圏外死

National flower comb
Conclusionth
Out of range death
代弁課
身の上西部
簾戸葉書

Department of spokesmen
Circumstances western
Bamboo blinds postcard
沿海二
心嚢浸す
教唆罪

Coast two
Soak pericardium
Criminal incitement
幼児予示
ボンレスハム荷
バラスト医

Infant foreshadowing
A load of boneless ham
Ballast medicine
横に頃
民利草規矩
横丁科

That horizontal time
The people's interests, grass rules
Department of alleys
投げ入れミ
拒絶滑りい
浸食シ

Throw mi
Rejection slippage i
Erosion shi
表立つ
夏枯れ無窮
真鶸説

Stand out
Summer slump eternal
Siskin theory

Update 2013 March 22

Having just read this exploration of the size of Twitterspace, it occurred to me that I could use written language entropy as another estimate on the size of haikuspace:

number of haiku = 2(5 + 7 + 5) * b

where b is the number of bits per character for Japanese. I'm going to use 2.4 (= 452337 * 8 / 1519224) from this paper (html version via google). This gives 240.81012.3 haiku, a little more than a bit shy (as expected) of my previous estimate of 1044.


See Also

kigo—season word

senryu, tanka, renga, waka—other haiku-like forms


the space of all haiku

category: factuals
next: Neil deGrasse Tyson
previous: winterhaiku

all writing, chronological
next: Neil deGrasse Tyson
previous: winterhaiku