Solution - Unimaginably Twisted Files

by Joseph DeVincentis

Answer: Click here to reveal

The first step in solving this puzzle is diagnosing just what is wrong with the file. Depending on your browser or text file viewer, you may be able to read certain characters, but for the most part you will just see blanks or replacement characters. The file begins with a UTF-8 byte order mark, but the content of the file is not valid. The key is that the entire file is encoded using what the UTF-8 standard calls overlong sequences.

UTF-8 uses a variable number of bytes to encode each character. The number of bytes used depends on the numerical value of that character. Characters up to 127 are encoded in a single character, making it compatible with ASCII. Characters a little larger than that are encoded with two bytes, using bytes in the range 128-255, where the first byte also encodes how many bytes make up this character. The next range of characters is encoded in three bytes, and beyond that, four bytes are used

Specifically, the way these multi-byte encodings work is as follows:

  • 110xxxxx 10xxxxxx encodes a two-byte character, where the x's represent 11 bits of the character value.
  • 1110xxxx 10xxxxxx 10xxxxxx encodes character values up to 16 bits in three bytes.
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx encodes character values up to 21 bits in four bytes.

The standard goes on to define five-byte and six-byte encodings as well, but subsequently a limit was placed on Unicode which makes those encodings never valid. However, it is possible to use these (or the other encodings above) to encode characters using more bytes than necessary. For instance, you could encode an ASCII space, 00100000, as 11000000 10100000. The standard declares such "overlong sequences" invalid.

The file in this puzzle is a UTF-8 text file in which many characters are encoded using overlong sequences which are just one byte too long. When you manage to convert it to a valid text file, you get the following:

Actually, you get several repetitions of that, each using different lookalike characters which (to varying degrees) resemble the letters they are supposed to be. When you manage to interpret this correctly, what you get is:

Some things are not as they seem to be. This holds true in many aspects of life, as it does here. Sometimes, things are not as they seem due to natural occurrences, but most often, somebody is trying to hide something. For instance, people wear disguises when they want to appear as other people, or simply not be recognized for who they are. You might say that this text is a disguise, but it is an unusually helpful one, as by reading closely, you may discover much about the nature of the true text you have yet to read. "Bad, bad text," you say. You had to fix all those errors. So many complaints about things being too long. Making a character be the right length is nothing. Making it be too long is really something! Don't go overboard, though; just one extra byte's enough. Each one of us contributes just a little bit to the complete disguise, but in the end it's all there. Finally, the disguise is too short, so it is repeated many times over. This may help you deal with parts of the disguise you can't figure out.

What this message is trying to tell you is that there is a hidden text underlying the one you are reading. The encoding works as follows: For each character encoded correctly, write a zero bit. For each character encoded in an overlong sequence, write a one bit.

The result of this is a list of Unicode character names, encoded using lookalike characters again. Ultimately, this results in this list:

  • ARMENIAN CAPITAL LETTER AYB
  • HANGZHOU NUMERAL TEN
  • MALAYALAM LETTER TTHA
  • DIGIT SEVEN
  • CANADIAN SYLLABICS TA
  • LATIN LETTER BILABIAL CLICK
  • COMMA
  • SPACE
  • GEORGIAN CAPITAL LETTER PAR
  • HANGZHOU NUMERAL TEN
  • LATIN CAPITAL LETTER O
  • CANADIAN SYLLABICS CARRIER GHO
  • CYRILLIC CAPITAL LETTER IE
  • DIGIT SIX
  • COMMA
  • SPACE
  • N-ARY UNION
  • HANGZHOU NUMERAL TEN
  • APL FUNCTIONAL SYMBOL ZILDE
  • CYRILLIC CAPITAL LETTER ES
  • Z NOTATION BAG MEMBERSHIP
  • DIGIT SIX
  • COMMA
  • SPACE
  • TAI LE LETTER O
  • HANGZHOU NUMERAL TEN
  • CYRILLIC CAPITAL LETTER FITA
  • REVERSED DOUBLE STROKE NOT SIGN
  • DIGIT TWO
  • TAI THAM LETTER WA
  • COMMA
  • SPACE
  • YI RADICAL DDUR
  • HANGZHOU NUMERAL TEN
  • DIVIDES
  • DIGIT SEVEN
  • CHEROKEE LETTER GV
  • NKO LETTER EE
  • COMMA
  • SPACE
  • ANGSTROM SIGN
  • LISU LETTER NA
  • CANADIAN SYLLABICS CARRIER THE
  • SPACE
  • CANADIAN SYLLABICS TE
  • HANGZHOU NUMERAL TEN
  • APL FUNCTIONAL SYMBOL UP CARET TILDE
  • CANADIAN SYLLABICS CARRIER KHE
  • GREEK LETTER DIGAMMA
  • YI SYLLABLE TU
  • COMMA
  • SPACE
  • ARMENIAN CAPITAL LETTER BEN
  • BUGINESE LETTER SA
  • LISU LETTER ZHA
  • SPACE
  • TIFINAGH LETTER YADD
  • VECTOR OR CROSS PRODUCT
  • TAI THAM LETTER HIGH KHA
  • RUNIC LETTER EHWAZ EH E
  • REVERSED PILCROW SIGN
  • CANADIAN SYLLABICS MA
  • TAI LE LETTER TONE-3
  • SPACE
  • LEFT PARENTHESIS
  • GEORGIAN CAPITAL LETTER CHAR
  • IDEOGRAPHIC ANNOTATION LINKING MARK
  • LISU LETTER SHA
  • SPACE
  • CHEROKEE LETTER TLE
  • EULER CONSTANT
  • CHEROKEE LETTER I
  • YI RADICAL ZZIET
  • VAI SYLLABLE TO
  • CANADIAN SYLLABICS TLHI
  • VAI SYLLABLE POO
  • RIGHT PARENTHESIS
  • FULL STOP

Writing out all of those characters gives:

or more readably:

U+07C0, U+0AE6, U+0CE6, U+0F20, U+17E0, and U+ABF0, for example (six letters).

The characters at the indicated code points are NKO DIGIT ZERO, GUJARATI DIGIT ZERO, KANNADA DIGIT ZERO, TIBETAN DIGIT ZERO, KHMER DIGIT ZERO, and MEETEI MAYEK DIGIT ZERO, so the answer which describes them in six letters is ZEROES (not zeros, which is also a valid plural).