Letter and polygram frequencies

An expanded version of this blog post.

Part 1

In order of decreasing frequency, here are the common English polygrams (single letters, bigrams, trigrams, etc., also called n-grams) gathered from a private corpus. We arbitrarily stop at the least frequent single letter, namely "z". There are 326 polygrams including the 26 single letters.

e t a o i n s r h l d u c th m g y p he w f in b the re er an on at ha v ou it en ng or k to es st ing te ar is ti al nd se nt ed as ve le me ll ea hi ne co hat of li de be tha no ri ca ic ot ro ly and ho ut om us that so io ra el et ma wh ce pe ta wa ch la fo ion ur si ent ec di il do ge ee pr her thi yo ac ns ow for ul ke all un ere we ss id ct em rs lo tio wi tion you rt ay ad mo ld po ai tr bu su mi wo hin oo pa pl ts ter gh av sh os nc ol ig ther ey ab am op ir na ni im ie if ate j thin not sa x ver fi bl are go bo one ev rea out ave tu ati oul was ry uld ould hey ome vi ba hav they here iv res ci tt ith ck ag con fe but ers have ia lly eve ap rd wit fr ally da pro sta mp his tin with ill ny ex hou ess ty cl com ht nk up sp ye ei can ear ght int est nce ki ff ep ted use ug ons od igh men ov ting som some pp ive atio our ore ation hing eo ect ak uc sti cou ple any ue ide ust ju ik pre gr cr der ef gi tho rr um ink ga ls there ike wha fa act han sc what ant thing ment q ell ua cu ight art ew bi whe oi oun ugh this nte per ica ble ist wou lik au like ru ove ob woul would rn z

Here are the corresponding log frequencies. The normalization is such that the sum of the frequencies of the single letters is unity.

-2.133 -2.314 -2.526 -2.549 -2.590 -2.685 -2.728 -2.866 -2.978 -3.165 -3.365 -3.495 -3.523 -3.571 -3.684 -3.817 -3.823 -3.879 -3.883 -3.917 -3.939 -3.947 -4.083 -4.099 -4.194 -4.246 -4.356 -4.364 -4.414 -4.539 -4.584 -4.609 -4.626 -4.644 -4.663 -4.668 -4.684 -4.719 -4.747 -4.769 -4.797 -4.802 -4.805 -4.827 -4.836 -4.852 -4.889 -4.970 -4.984 -4.987 -4.989 -5.002 -5.013 -5.055 -5.082 -5.097 -5.168 -5.198 -5.207 -5.226 -5.234 -5.266 -5.296 -5.308 -5.325 -5.344 -5.352 -5.371 -5.373 -5.381 -5.381 -5.386 -5.389 -5.397 -5.417 -5.432 -5.439 -5.461 -5.499 -5.500 -5.514 -5.543 -5.544 -5.548 -5.569 -5.579 -5.589 -5.606 -5.610 -5.612 -5.639 -5.657 -5.677 -5.679 -5.686 -5.692 -5.700 -5.706 -5.711 -5.735 -5.751 -5.751 -5.756 -5.765 -5.777 -5.781 -5.785 -5.788 -5.789 -5.804 -5.810 -5.811 -5.813 -5.820 -5.840 -5.850 -5.852 -5.858 -5.866 -5.884 -5.890 -5.899 -5.900 -5.904 -5.908 -5.910 -5.910 -5.929 -5.936 -5.957 -5.981 -5.982 -6.004 -6.004 -6.016 -6.018 -6.021 -6.034 -6.056 -6.056 -6.058 -6.063 -6.068 -6.073 -6.079 -6.093 -6.100 -6.103 -6.107 -6.109 -6.119 -6.125 -6.125 -6.134 -6.137 -6.151 -6.164 -6.165 -6.165 -6.169 -6.172 -6.172 -6.173 -6.184 -6.200 -6.200 -6.201 -6.206 -6.210 -6.213 -6.216 -6.217 -6.222 -6.225 -6.261 -6.266 -6.292 -6.305 -6.313 -6.331 -6.336 -6.338 -6.343 -6.346 -6.346 -6.347 -6.357 -6.362 -6.366 -6.369 -6.371 -6.381 -6.388 -6.399 -6.416 -6.423 -6.429 -6.440 -6.443 -6.449 -6.452 -6.458 -6.463 -6.465 -6.466 -6.468 -6.476 -6.492 -6.514 -6.523 -6.526 -6.542 -6.542 -6.544 -6.553 -6.557 -6.565 -6.566 -6.573 -6.574 -6.574 -6.584 -6.592 -6.594 -6.600 -6.603 -6.606 -6.608 -6.614 -6.615 -6.619 -6.627 -6.631 -6.634 -6.636 -6.639 -6.650 -6.652 -6.653 -6.660 -6.662 -6.667 -6.669 -6.673 -6.679 -6.679 -6.686 -6.688 -6.689 -6.690 -6.690 -6.702 -6.705 -6.712 -6.715 -6.715 -6.717 -6.717 -6.720 -6.724 -6.726 -6.733 -6.736 -6.737 -6.742 -6.743 -6.747 -6.756 -6.760 -6.781 -6.786 -6.792 -6.795 -6.800 -6.800 -6.807 -6.811 -6.813 -6.819 -6.819 -6.825 -6.830 -6.840 -6.840 -6.840 -6.845 -6.847 -6.853 -6.853 -6.853 -6.857 -6.857 -6.860 -6.863 -6.868 -6.871 -6.886 -6.887 -6.892 -6.894 -6.895 -6.903 -6.904 -6.905 -6.905 -6.905 -6.909 -6.911 -6.914 -6.918 -6.923 -6.926 -6.931 -6.931 -6.938 -6.941 -6.944 -6.948 -6.949 -6.949 -6.949 -6.950 -6.951 -6.951 -6.955 -6.957

Part 2

To avoid the redundancy between polygrams like "woul" and "would" in Part 1, we choose a subset of them by iteratively finding the length of the longest polygrams more frequent than the original frequency of "z", then greedily selecting the most frequent polygram among those with that length. We cut out that polygram from the corpus, then reanalyze and repeat. (This was a long computation.) The polygrams got eliminated in the following order. This list of 126 polygrams has far less redundancy.

5 letters: ation there thing would 4 letters: that they have tion ally with ting some ther what this ight like thin 3 letters: the ing and ent for you ter ver one are not ate out was con all but pro res ere ill com rea ear use can cou any nce ple ust ive hou ers ell ess ant ble 2 letters: in to it re is on ed er es of en ar or as al an ic be st se ly ch ow at id wh le ac ay me un ad ro su ge if th mo go pe so lo do de us po ab he ir im am ld tr ur ne ap gh il we ve ts ke no ma ex ul ta te co pl

The remaining single letters in decreasing order of frequency were s t i a m e d o p c h l u f y r b n w g k v j q z

The letter "x" dropped below the "z" frequency threshold after elimination of the "ex" bigram.

Part 3.1

Part 2 cut out occurrences of a polygram out of a word, so, for example, the word "others" became two words "o" and "s" in the corpus after elimination of "ther". In this part, we instead replace a common polygram with a single composite character, so the 5-letter word "other" becomes the 3-letter word "o(the)r", where (the) counts as only one letter. We choose composite characters greedily, maximizing at each step the scoring function N*(L-1), where N is the number of occurrences and L is the length of the polygram. This scoring function is equivalent to the total number of keystrokes that would be saved to type the entire corpus if the polygram could be typed with one keystroke. After substituting highest scoring composite polygram throughout the corpus, we reanalyze and repeat. (This was a long computation.) We arbitrarily stop when the best score drops below the frequency of "z", finding 126 polygrams:

the in re an that on er ou it st en (in)g or al at to is ar le as ed th es of ic (an)d ly have om be se wh no (ou)ld i(on) but ac ight peop(le) (en)t id ll w(it)h like f(or) ch y(ou) ay ad pro me ab(ou)t we su s(om)e ge he lo ve if a(re) do v(er) un (th)(in)k go (ou)gh de (no)t po w(as) (on)e mo (the)y ju(st) fr(om) am so ir us im te ne c(on) ce il tr t(er) ap how (al)(ly) (be)cau(se) ct c(om) ab (at)e (al)l i(ll) o(the)r ur (at)(i(on)) ak pe ig (an)y ts p(re) ol sh (the)(re) k(no)w la (th)((in)g) d(if)fe(re)n ex (ou)nd (th)(is) (wh)(at) (wh)(ic)h c(an) (ac)t li ri ro w((ou)ld) m((en)t)

Part 3.2

Here are the 218 composite polygrams selected in greedy order when using the scoring function N*L instead of N*(L-1). We now need to explicitly forbid creating useless one-letter composite characters.

th in (th)e re an on at er ou st en (in)g it or al to is ar le as ed es of ic (an)d ly om (th)(at) be se wh no ve i(on) ac (en)t ha id ll f(or) y(ou) ay ch (ou)ld me ut ro li we su gh ad ge lo he ab wi ke pe do a(re) v(er) if un go de (no)t ri s(om)e po (ou)t w(as) (on)e mo ((th)e)y so am ir us ne c(on) b(ut) ce ct (pe)op(le) (ha)(ve) te (al)(ly) t(er) (th)(in)k il (wi)(th) ho c(om) im p(ro) ju(st) (gh)t fr(om) (at)e (al)l (at)(i(on)) (an)y ((th)e)r ur fe p(re) la ma tr ((th)e)(re) (th)((in)g) ap ex ts co (th)(is) (wh)(at) c(an) (ac)t (er)s (ou)n w((ou)ld) m((en)t) ss ta (li)(ke) (no)w ye qu (be)cau(se) (ar)t (an)t up sh d(on) (ab)((ou)t) ul ig ag bo my (en)d (ge)t (ar)d pl sa i(st) (ou)r by (wh)(ic)h w(or) em (ho)w (wh)o (mo)(re) ol k((in)g) ti h(as) (ou)(gh) i(ll) um d(er) i(ve) p(er) (es)s w(ay) (in)d od (it)y (re)d v(en) ok (th)(an) gu (in)e ra (re)n o(((th)e)r) (do)(es) (ac)k sp ((th)e)m publ(ic) (or)t (id)e fo (wh)e(re) (ic)e d(id) ty s(ay) mu(ch) (wi)(ll) (we)(ll) (as)s k((no)w) pr t((in)g) op ud d(ay) cl (wh)(en) (ou)s (an)s ck a(st) ((ou)n)d b(le) ci gr wn (ab)(le) (v(er))y (of)f (in)t

Part 3.3

Here are the 62 composite polygrams selected in greedy order when using the scoring function N*(L-2).

the that ing tion and ould ent have for all people think with you about some ter though what ver ight this like not one are ate because was just from con but which pro res more ere ill com rea ear actu(all)y can than any does nce time ive well know (the)re ers when ess ound ant able th(ing) out part

Part 4.1

If we lower the frequency floor to half the frequency of "z" and repeat the analysis of Part 1, we find the following 294 additional polygrams and corresponding log frequencies. "z" is reprinted for convenience.

z hink think und hen don cal rom oug lin now ain ough fro kin tic ine abo pi qu mor ort par get bou oth ds oc rm gu eal pu cti jus ren just bout abou from eas about ok abl eg my end ste tl ame nti nde ake mu ard ure tra rat sho ose by wer whi see ven how min lea oin ast ead king ind iti sin eop rk opl ice va wor peo ople peop mm eopl peopl eople people ost ys ass ever sed oe ui tor red fu een lu who app has din nn rc more mb rg sn str age ime yea ett thou pla br rin ual era aw tim ms nl ous du ding othe tte tat vo ite den les over man ud ten ub other nal oes way lt ib spe inc rec ase ins eri ity enc tes rl ack nts ich ies did ran son hem kn ions ote lit lat hic real than sto pos pt ely cha eat lic dr rou doe ssi sur ah does ks dis them sen ong pen time ont af ber wil wel ree inte og ial uch day eme ft ery kno wn nder por unt say als tur lle will rti well houg know hough thoug nat though ona tan ans dn nter hea nin ral ip own car ctio ction ric rig eir aus tri ring ook eth omp off gs nu pri ood stat when har sid were nly tiv ound tly fin ele whic hich which comp ning mon ces hei ire cc able ph yi mak anc gre cau ical ues very pres tions side hr ctu tal new part hu nst lan fic

-6.957 -6.964 -6.965 -6.967 -6.971 -6.973 -6.974 -6.976 -6.977 -6.978 -6.981 -6.984 -6.984 -6.985 -6.986 -6.986 -6.988 -6.991 -6.991 -6.999 -7.001 -7.003 -7.006 -7.020 -7.021 -7.022 -7.032 -7.034 -7.040 -7.042 -7.047 -7.055 -7.069 -7.076 -7.076 -7.078 -7.086 -7.087 -7.088 -7.088 -7.089 -7.095 -7.096 -7.097 -7.100 -7.105 -7.105 -7.116 -7.116 -7.124 -7.133 -7.137 -7.138 -7.146 -7.147 -7.152 -7.153 -7.160 -7.161 -7.165 -7.168 -7.169 -7.169 -7.171 -7.177 -7.178 -7.180 -7.185 -7.187 -7.189 -7.191 -7.193 -7.193 -7.197 -7.198 -7.201 -7.204 -7.204 -7.207 -7.207 -7.208 -7.209 -7.211 -7.211 -7.212 -7.213 -7.213 -7.213 -7.216 -7.216 -7.217 -7.217 -7.217 -7.219 -7.220 -7.222 -7.230 -7.238 -7.238 -7.240 -7.241 -7.242 -7.243 -7.244 -7.247 -7.254 -7.255 -7.261 -7.262 -7.265 -7.265 -7.270 -7.273 -7.277 -7.280 -7.284 -7.286 -7.288 -7.298 -7.299 -7.300 -7.301 -7.303 -7.304 -7.307 -7.308 -7.310 -7.311 -7.313 -7.314 -7.314 -7.316 -7.316 -7.317 -7.318 -7.319 -7.332 -7.339 -7.342 -7.342 -7.342 -7.348 -7.353 -7.356 -7.370 -7.373 -7.379 -7.386 -7.386 -7.386 -7.387 -7.388 -7.389 -7.390 -7.390 -7.395 -7.395 -7.396 -7.399 -7.403 -7.404 -7.404 -7.405 -7.410 -7.411 -7.411 -7.416 -7.426 -7.427 -7.427 -7.428 -7.428 -7.432 -7.435 -7.437 -7.439 -7.440 -7.442 -7.443 -7.452 -7.452 -7.455 -7.457 -7.462 -7.463 -7.464 -7.466 -7.466 -7.474 -7.475 -7.478 -7.478 -7.479 -7.479 -7.480 -7.481 -7.482 -7.485 -7.490 -7.491 -7.499 -7.502 -7.507 -7.508 -7.509 -7.509 -7.517 -7.518 -7.520 -7.520 -7.521 -7.522 -7.522 -7.525 -7.525 -7.527 -7.528 -7.531 -7.534 -7.540 -7.540 -7.540 -7.540 -7.541 -7.541 -7.543 -7.544 -7.545 -7.551 -7.551 -7.553 -7.553 -7.553 -7.557 -7.558 -7.560 -7.562 -7.564 -7.567 -7.573 -7.573 -7.574 -7.577 -7.579 -7.581 -7.582 -7.584 -7.584 -7.587 -7.591 -7.592 -7.593 -7.596 -7.599 -7.599 -7.601 -7.602 -7.602 -7.605 -7.605 -7.606 -7.609 -7.609 -7.610 -7.610 -7.611 -7.613 -7.614 -7.614 -7.616 -7.618 -7.619 -7.621 -7.622 -7.623 -7.628 -7.629 -7.631 -7.636 -7.637 -7.637 -7.637 -7.638 -7.640 -7.641 -7.642 -7.646 -7.647 -7.647 -7.647 -7.647 -7.649 -7.649 -7.649 -7.649

Part 4.2

Rerunning greedy elimination as in Part 2 selected the following 233 polygrams:

6 letters: people though 5 letters: ation there thing would think about other ction which 4 letters: that they have ally with ting some what ment this ight like tion just from ould ever more king ding over here than does them time inte nder will well know when were comp able ther ning side ound 3 letters: the and ing for you ent ate one are not was ter con all but res pro rea ear use can ive any nce ess ers ant out art ect act ist get ill com ain don ame ake see cou how ine wor ose pre ass ice ion per has ind ost ure who eve age ast ard int way ity app est ite een end ort ies ack day uch did als say ase ide pla off tor ood ous ree ble cha new 2 letters: to it in is ed of on er al re be or an st ar le ic ly at en ad es as if he ol ur ch se un ir lo me de us no ri th do il go im ro am ex so tr ig up ul we id ma po el sh cl ac qu te su ve sp et my bo by li um ra ge ca ho la mo ag ok ne ab ce ug ts op wh gu wo si ck av ah ty ud af ci ut wn ap

Remaining single letters: s i t a m p c e d n b r y f h l o g w u v k j z x ("q" has dropped below the floor after elimination of the "qu" bigram.)

Part 4.3.1

Extending Part 3.1 (scoring function N*(L-1)) with the score threshold to be half the frequency of "z" finds the following 97 additional polygrams:

ss (er)s ag ye qu publ(ic) (an)t (ar)t up ti(me) d(on) (ou)t ul co em a(in) (en)d fe my (ge)t (ar)d bo pl i(st) c(ou) by sa w(or) (wh)e(re) ok mu(ch) (mo)(re) (wh)o p(or)t i(ve) ut h(as) (go)od um d(er) ev(en) p(er) (es)s w(ay) (in)d (pro)b(ab)(ly) ((ac)t)u((al)(ly)) (it)y (ac)k pr (re)d (p(re))tty ck gu (be)fo(re) i(ou)s (th)(an) (do)(es) (in)e (ye)ah op (the)m m(ak)e (id)e a(st) (in)(te)(re)(st) (ic)e d(id) i(es) ke s(ay) w(i(ll)) (pro)b(le)m (we)(ll) (th)((ou)gh) av (qu)e(st)(i(on)) iv (as)s cl gr (an)s t((in)g) (the)r ud d(ay) sp (wh)(en) af ie h(ap)p(en) (ab)(le) wn (we)(re) (v(er))y (of)f (th)o(se)

Part 4.3.2

Extending Part 3.2 (scoring function N*L) with the score threshold to be half the frequency of "z" finds the following 160 additional polygrams:

vi (ne)w ((ac)t)u((al)(ly)) ((th)e)(ir) ah pp (ha)d b(er) vo (on)g (we)(re) (it)e (y(ou))r s(pe) m(it) af a(ge) (ri)((gh)t) p((ar)t) el (p(ro))b(ab)(ly) (ti)(me) u(re) d(is) i((gh)t) i(es) (be)(en) m(ar) d(if)(fe)((re)n) u(se) c((ou)ld) (re)((al)(ly)) (re)e p(le) a(in) (sh)((ou)ld) (as)e (i(on))s (on)(ly) (re)(ad) ld (ye)(ar) ((th)e)n iz (in)(te)(re)(st) i(re) (th)o(se) (le)d v((in)g) ru m(in) o(v(er)) h(is) (be)t c((ou)r) (th)((ou)(gh)) (al)(so) e(st) s(er) (mo)(st) (ig)n pt m(ay) (go)(od) (se)e (at)(ed) c((ou)n) wo fa bu ps w((an)t) oh e(v(en)) (la)w ry (re)(as)(on) (le)(ss) (ic)k ob c(or) (se)(em) f(ir)(st) (be)(li)e(ve) (su)(re) (w(or))k ph sc (ac)h (c(om))p g(re) (ac)e ue (an)(ce) (be)((in)g) (p(re))t(ty) s(ch)o(ol) (le)g (ct)(i(on)) (be)(fo)(re) (en)(ce) p((er)s)(on) (go)((in)g) i(le) ca (s(om)e)((th)((in)g)) m(on) c(er)(ta)(in) (ye)(ah) (un)(d(er)) br (ic)(al) b(us) (to)o hi (ma)(ke) (is)h ef (th)(ou)((gh)t) (po)l s(id) (lo)t (go)t m(an) (st)(i(ll)) (sa)(id) w(on) (go)(v(er))n(m((en)t)) (lo)w (it)s (p(ro))(b(le))m (po)s c(le)(ar) iv g(en)(er) (wh)y cr ((do)(es))n (f(or))m (st)((at)e) b((ac)k) c(ar) (tr)y wr (ar)y (at)(es) c((en)t) s(in)(ce) (ap)p(a(re))nt(ly) (d(id))n s(he) (en)s sy(st)(em) i(al) (ic)t (in)(to) ia (or)y t(ur)n (he)(re)

Part 4.3.3

Extending Part 3.3 (scoring function N*(L-2)) with the score threshold to be half the frequency of "z" finds the following 88 additional polygrams:

ect ain ist use believe get point count (pro)bably don ide est been end re(all)y pretty port ough sion make see only how publi ine our differ(ent) ind wor ose ice ame ass per much article ost has even ure who also der good ast age happen ard o(the)r way first mean ity ite ack act ies school look into ious remember yeah ical change sta place day ree said did say ase go(ver)nm(ent) (pro)blem gener ques(tion) between tor speci off possibl take maybe appar(ent)ly system long new

Part 5

Here is the Haskell source code. The main optimization was to preprocess the corpus into a list of words with word counts with sort | uniq -c.

The Google Books n-grams corpus is another choice, but I don't like how it includes as common words words like "iii" which is probably an artifact of OCR scanning Roman numeral page numbers.

Previous vaguely similar attempt, in Perl.. Previous thoughts invoking Braille contractions. I have heard anecdotes that Braille contractions were derived from frequencies in the Bible.