Monday, June 12, 2017

Twin Peaks s03e05 Spoiler Recap

The title of this episode is "Vegas, Baby."

Not the best episode so far.  The dangerous-Agent-Cooper-in-jail story line has moved along an inch or two, when Cooper uses his one phone call to somehow (with touch tones alone) hack into the jail's alarm system.  We see the incapacitated-Agent-Cooper-as-Dougie at work (he's in Life Insurance), but it all seems for comic effect; story line barely moved at all.

Most of the rest of the episode is devoted to short vignettes introducing new characters with new threads, which we know from Lynch' Twin Peaks work in general (especially season 2), might be relevant, or might be false leads.  For some reason Lynch likes producing that kind of fatigue in his audience.

Specifically:  we meet a young couple who is sponging off of Shelly, who still works at the Double R Diner ("NOW, Shelly"...that Shelly), a misogynist/drug dealer who hangs out at the Bang! Bang! bar, and a military career gal who's going to get the "Area 52" branch of the military involved in the ongoing murder investigation in South Dakota.

There is one bright spot in the Dougie story.  Although he has a long way to go to recovery, the incapacitated-Agent-Cooper-as-Dougie shows momentary sparks of recognition.  Every now and then someone says a word or phrase that strikes a chord in Dougie's FBI self, such as "case file":  he stops, repeats the word, and seems to think about it.  The best one is the effect the word "coffee" has on him, that really gets a rise!  He actually gets his hands on a cup of Starbucks and suckles it likes it's a baby bottle containing life itself.

The brightest spot in the episode is our learning of what Dr. Lawrence Jacoby has been doing with himself.  We've seen him receiving a shipment of standard hardware store shovels via UPS.  Later, he painstakingly spraypainted them gold.  And now we see that he hosts, under the name "Dr. Amp",  a periodic video/podcast show about conspiracy theories, the evils of government, and the poisoning of our environment by multinational corporations.  He's got a great audience:  Nadine, the eye-patched wife of Big Ed Hurley watches.  And Jerry Horne, the ne'er-do-well younger brother of Northern Lodge owner Benjamin Horne, tokes up while listening. 

And how does Dr. Amp fund his show?  He offers his audience a solution: they have to "dig themselves out of the shit."  He even shows himself shoveling himself out of a waist-deep pit of brown muck, with what else but his $29.99 Gold Shit-Digging Shovel, available by mail, Order Now!  By God, Lynch still has the comedy touch.

Sunday, June 11, 2017

Twin Peaks s03e04 Spoiler Recap

There's a Lynch humor piece in this episode.  You don't laugh at it.  It's only funny in the way some SNL skits are badly acted but hilarious in their premise.

Back in Twin Peaks, the sheriff's station still has klutzy, whiny-voiced Lucy working the front desk, and hapless (and useless as ever), deputy Andy Brennan, sporting an absurdly high cowlick, along with a middle-aged paunch.  Not being terribly good actors, and lacking their youthful charm of the 90's, their return hasn't been particularly delightful.  The new element, though, is that they have a son named Wally Brando.  In fact, in this episode they are just receiving the news that Wally Brando (always mentioned by full name) has just arrived in town.  You see, Wally Brando is a soul of the road.  He rides his motorcycle hither and yon following his heart on an ongoing discovery of the American spirit.  Lucy and Andy are proud of him for it the same way that the parents would be for an olympic medal or Nobel prize winner.

Joke 1 is that Wally Brando is decked out in a perfect replica of Marlon Brando's biker from The Wild One (1953), and joke 2 is that he's played by Michael Cera.  Joke 3, the cruelest of all, is that Lynch graces us with a five minute (feels like twenty) soliloquy by Wally Brando, motionless, with Andy and Lucy looking on admiringly from his side, a seriously-delivered but cliché- and pablum-laden discourse on the truth and goodness of the great American highway.

Like I say, noone's laughing yet it's funny as hell.  Michael Cera doesn't even ride the motorcycle, just sits on it.  

The rest of the episode is similarly light in flavor, compared to the macabre dimension-travelling of the previous.  Agent Cooper's earthly persona is still mentally incapacitated, and having replaced the Cooper lookalike "Dougie", he is still assumed to be Dougie by those around him.  He pretty much just bumbles around like Peter Seller's Chance the gardener from Being There.

The dangerous hood Cooper character is now in jail, having been found with firearms and drugs in his wrecked car.  This turns out to be the Cooper that the FBI locates, not Dougie.  They have an interview with him, in which the jailed Cooper does a very leaden impression of the real Agent Dale Cooper, with obviously rehearsed lines, which fools the FBI not one bit.  In the meantime, we're given a major reveal about who this Cooper is (although it's been pretty strongly hinted).  It's Bob.  

He is Bob!  Eager for fun.  He wears a smile.  Everybody run.   
- One-Armed Man, Season 1   

Somewhere back in Twin Peaks history there was a scene where Cooper stared in a mirror and suddenly thrust his forehead right into the glass.  In the cracked, bloodied mirror, his reflection is Bob.  We also see the two of them together in the red room, laughing maniacally together like satanic frat brothers.  So the conclusion is that Bob took over Cooper's body, trapped Cooper's real identity in the Red Room, and has been running amok for the last 25 years.  Something's doesn't line up, though; Bob isn't inhabiting a being for evil (such as killing one's own Prom Queen daughter), he's more of a hit man for organized crime.

Andy Brennan, Wally Brando and Lucy Brennan

Twin Peaks s03e03 Spoiler Recap

In episode 3, there are actually three Kyle McLachlan characters.  Dale Cooper, FBI agent wearing his familiar black suit, is trapped in the other-worldly red room.  #2 is the hard boiled criminal character, also referred to as Cooper, and involved in FBI work.  Number three is a kind of duncey character named Dougie who lives in Las Vegas and sees prostitutes, but we've only seen him just lately, and briefly.

The entire episode (1 hour) is dedicated to showing the movement of characters in and out of their dimensions, and how they swap bodies.  We know from other films that Lynch loves to move the camera into other worlds.  The camera will follow the sound coming from a telephone earpiece, and go right through the holes in into the electronics.  Or the camera will travel through walls to show the spaces inbetween rooms, with dust, drywall matter, and rodents.

So here Lynch seems to have put quite a lot of thought into the experience of being in a strange dimension.  The black-suited Cooper leaves the red room and enters into a creepy limbo.  There is a woman with missing eyes who can't speak, in a room with a strange steampunk apparatus in a wall panel, and a metal door somebody is heavily pounding on from the outside. They climb up a ladder, open a hatch and climb out onto a platform suspended in outer space, very much like Le Petit Prince standing on one of those little planets, but more terrifying.  The woman falls of the platform and presumably falls for eternity.  Eventually Cooper climbs back down, and a woman in red by a fireplace advises that he must go NOW.  Cooper ends up passing through the steampunk apparatus, painfully...and leaving behind his shoes.

The other Cooper, and Dougie, back on earth, experience extreme vomiting.  Dougie shrinks to nothing and black-suited Cooper comes out of a wall socket in the form of a black gas, eventually appearing in solid form laying on the floor.  His faculties haven't recovered, and he is a Rain Man-style imbecile.  He ends up in a casino, and starting with a $5 bill, wins enormous payouts from every slot machine he tries.  He eventually gets picked up by the authorities, and he's reported to FBI headquarters where Gordon Cole, the hard-of-hearing senior FBI guy played by David Lynch himself, celebrates the news of Cooper's final return.

The tough-guy Cooper ends up in a car wreck, and we don't see what's become of him physically.

Twin Peaks returns! Season 3, episodes 1-2 spoiler recap

I just finished episode 1 of the new TP, which is two hours of David Lynch amazingness. It's real "Holy Fuck" stuff. However it is not a return to the goofy fun of the original TP. In fact, of the many story lines, few have any evident connection to the Laura Palmer story. The short bits that have original cast members are sometimes weak and perfunctory, as though DL is using a TP season three as a thinly veiled opportunity for doing new work that interests him more. But in other ways not so much...too soon to judge. In sum, this is very possibly one of DL's great works and deserves to be taken seriously. But it's also true that if you're not up for a lot disturbing horror material (masterfully done), TP season 3 may not be for you.

I'll say this about the "Dale Cooper" in this episode. He's not just a linear extrapolation of the quirky, one dimensional FBI man of years ago. He's more the product of someone who has had dramatic life changes and choices over the course of 25 years, i.e. like real life. And in keeping with Kyle McLachlan's acting weight, he is carrying a lot of the film.

Log Lady appears to have been filmed "in time" before the actor's demise, but her illness is evident in her scenes.

Some actresses with girlish charm in their twenties keep it on into their fifties, but Sheryl Lee (Laura Palmer) doesn't seem to be one of them. She and Dale Cooper are in dream sequences in the red room once again, but they don't work as well.

So the thing about Dale Cooper (or whoever Kyle M. is supposed to be...not clear)
is he's now a real "heavy", a very dangerous guy, who has various shady thugs people working for him and "gun moll" style girlfriends, all of whom are under constant threat and intimidation by him. Picture below.  He wears a leather jacket, long hair, speaks in a low-pitched gruff voice, his scenes are all in seedy hotels and restaurants and rustic lodges, and has a bit of a southern air to him, ...someone joked that it's "Twin Peaks meets Duck Dynasty."   So he seems like a criminal, but at the end of episode one he gets on a computer and logs into an FBI portal, so he must actually be in some kind of deep cover.

Log Lady is sad to watch, as you can see from the pic below, she really was very ill when they shot her scenes.  She's holding the log in that pic.  She is instructing "Hawk" (Sheriff Truman's spiritual American Indian deputy) with clues (i.e. what her log is telling her) about where to find things, over the phone.

Anyone of our age group has gotta detest the actor who plays James Hurley's ability to defy age!  23 yrs old then, 50 now, he looks like a hipster single guy in his 30's.

One more thing, there are lots of clips from the original season 1 series, and the very opening clip is original footage of Laura Palmer in the red room saying "I'll see you again in 25 years."  Yes, that really happened!

They also re-enact, in their current characters, a famous bit of Red Room dialog from season 1:
Cooper:  "Are you Laura Palmer?"
Laura:  "If I feel like I know her, but... (in agony) sometimes my arms bend back."

Dale Cooper, the heavy

Log Lady, passing on clues from her Log to Deputy Hawk

Log Lady as we knew and loved

James Hurley, defying age

 Deputy Hawk

 Laura Palmer

Thursday, February 23, 2017

The Washing Machine

I recently started collecting Boogie Woogie tracks off of YouTube and learned some interesting stuff along the way.

I started out thinking about BW as being some early form of rock and got excited about finding songs from 1930's and earlier that had BW traits, thinking they were extraordinary gems.  Then I found out (see Wikipedia for this) that BW goes back to even before the twenties, and indeed, is as old as jazz itself.

Next I learned that BW became kind of a cultural virus around 1947, and virtually everyone was recording at least one BW song.  Almost every song from that period has 'Boogie' in the title, usually "XXX's Boogie".  All the hitmakers of the day have BW songs, Woody Herman, Count Basie, Tommy Dorsey, etc.  Check out YouTube and you'll find scores of BW playlists.

But it gets more interesting than that.  BW began in east Texas logging camps and developed organically with the expansion of railroad.  Ground zero was (is) Marshall, TX, where they celebrate that heritage to this day.  BW style in Marshall remained unchanged, but at each train stop (logging camp) the style became more evolved.  You could estimate the distance from Marshall by how evolved the BW style sounded in the bar you were in.  BW pianists spent their years traveling from camp to camp on the RR.  And to state the obvious, sonically, BW reflected the sound of the engines and clatter of trains.

Pinning down the beginning of rock is nigh impossible.  For some odd reason there are people who want to say that Ike Turner's 'Rocket 88' is the first rock song, apparently due more to its popularity than its innovation.  Infectious dance beat and riotous swing?  Hell, the very earliest BW of the 20's has that.  Sexual suggestivity?  BW lyrics sometimes exhort the women to shake their asses.  Did BW walking basslines get replaced with something else?  Nope, early rock by Elvis still has standard BW basslines.  Electric guitars?  Now you're talking.  I would say that the earliest BW having an electric guitar is a good candidate for pioneer rock. That brings Western Swing (Bob Wills, Spade Cooley) into the fold.

The most insightful statement I've ever heard about the origins of Rock was by Robbie Robertson of The Band, in the film The Last Waltz.  It's's Louisiana gumbo and's blues...all of it.  It's genesis was in religious revival meetings and county fairs, where late into the night after the community leaders had gone home, you'd have musicians throwing these styles all together in unorthodox ways, and doing crazy stage antics (e.g. Chuck Berry's duck walk).

Something I love in the earliest rock is a particular sonic quality in the rhythm section that I call the Washing Machine.  The recipe seems to be double bass (acoustic), piano and drums with brush sticks.  When they're playing really tight, the three fuse into a single rhythm-machine timbre.  A wacky, anthropomorphized Washing Machine drawn by R. Crumb comes to mind.  You hear it on a lot of early Elvis tracks.

Thursday, July 8, 2010

Testing to watch this message go out to twitter, facebook, buzz, gtalk, linkedin, brightkite, and my two blogs!

Friday, May 15, 2009

Solving Character Set Problems: ASCII, ISO-8859, WinLatin1 and UTF-8

Character Sets and Feeds

I work for a company which receives huge incoming text feeds from our clients, which we filter, transform and massage to go out to the major search engines. Much of the work of cleaning incoming feeds consists of replacing undesirable characters (e.g. fancy foreign characters and symbols) that come from the client. For search engines we have to provide the lowest common denominator of character type, plain ASCII. There are many ways these "bad" characters creep into incoming feeds. Some are mainstream ISO 8859 or UTF-8 characters that are accepted by most display systems, but are still undesirable for search engines. Less likely but still possible are errors caused by the use of a legacy non-printing control character. By far the most common offender comes from violation of ISO 8859 conventions that have come into practice, known as "Extended Ascii" (of which Microsoft's WinLatin1 table is the most notorious offender). This page explains the major character set types, along with a little history, and the problems that they cause to feeds.

7-bit Ascii (ISO 646)

The original ASCII table, a 128-value (7-bit) character set is formally known as ISO 646. In fact it remains the only real "Ascii table" (more on this later). This portion has survived unchanged as the beginning portion of all major character sets (even the ones tampered with by Microsoft).

Below is the printable portion of ISO 646, characters 32 through 127 (the end of the table).

Below is the non-visible characters section of ISO 646, the first 32 characters of the table. Note that only a handful are in modern use; the rest are historic relics.

In feeds from our clients, mistaken input of legacy control characters is rare, although recently one client reported seeing 0x1D ("Group separator GS Left Arrow") in a feed.

8-bit Character Tables and ISO 8859

Ascii remains to this day, strictly speaking, a 7-bit table of 128 characters and no more. However, since in modern times characters are rendered with 8-bit bytes, people have wanted to take advantage of what is available on their machine in the area of that top bit. Folklore, common practice, and whatever happens to be on the machine one is using, has mislead many into thinking some universal 8-bit ascii table exists. But in reality the 8th bit portion has been a playground for whatever people wanted it to be.

To answer this havoc, ISO 8859 defined a variety of 8-bit character tables cover most of the world's needs. For every ISO 8859 table, the 7-bit portion is ISO 646 Ascii. The best-known of the ISO 8859 tables, at least among the English speaking and Western European world, is ISO 8859-1, also known as "Latin 1." To some people, this is "the" Ascii table. The eight bit portion is shown below.

Note that there is an unused portion (greyed out) of the table between characters 128 and 159. Take special note of this; as you'll see below this special area becomes a huge area of controversy (Extended Ascii).

Other ascii tables in the ISO 8859 collection include Baltic, Cyrillic, Arabic, Greek, Hebrew and Turkish character sets, allowing each country to adopt the appropriate table, and software manufacturers to write compliant products. (You can see the various ISO 8859 tables here.)

At my company, a feed using ISO 8859 8-bit characters will need to be cleaned before going out to search engines. If they are ISO 8859-1 (Latin1) characters, the mapping to a substitute character is well-known and easily fixed. We do this with Perl scripts and a simple mapping table. Another possible problem would be if we had a client that is using some ISO-8859 table other than Latin1 (for instance, a client from Scandinavia might use the Baltic table, ISO 8859-4). In that case we would have to create a new mapping table. Worse yet, files cannot identify themselves as being any particular ISO-8859 subversion, so if a client doesn't announce the table they are using, we would only find that out by seeing unusual errors. For example, seeing frequent use of a copyright symbol © in unexpected places would be a clue that some non Latin1 ISO-8859 table was in use.

It's not clear why ISO 8859 prohibited this area. In one table I saw, each of the codes were described as being a control character of some sort.

'Extended Ascii' and Windows 1252

Microsoft couldn't resist using that empty area of the ISO-8859 tables, so they went and created their own 'extended version' of the ISO 8859 table set; you can see them all here. Their version of ISO 8859's Latin-1 table goes under the name WinLatin1 or Windows 1252. The 8th-bit portion of this table is shown below, with the characters Microsoft inserted into ISO 8859's empty area marked in yellow.

Due to the popularity of the Microsoft Office family of software products, WinLatin1 has achieved an unfortunate hegemony in the computer world. In the competition for the elusive "8-bit Ascii" table, WinLatin1 is ISO 8859-1's chief competitor. However, a sizable portion of the computer world (Macs and Unix flavored s

ystems) have settled on the real ISO 8859 standard, so problems occur when people paste, email or otherwise transmit text made from a Microsoft product to be displayed on a Unix or Mac box. When you see question marks or 'ascii garbage' in a file when you're not expecting it, you have probably been WinLatin1-rolled.

The term "Extended Ascii" has become the name to describe the format of any file which contains a mixture of ISO 8859 plus characters in the forbidden zone of bytes 128-159.

So what did Microsoft do with that precious block of 32 characters it stole from ISO 8859? Four characters comprise the notorious MS-Word 'smart quotes' (single and double: ', ', " and "). On the other hand they were remarkably prescient in including the Euro currency symbol (€) years before the currency was adopted. There are a number of punctuation and editorial marks or symbols, such as daggers († and ‡), ellipses (...), the list bullet (•) and several others that perhaps are used in European languages (ˆ, ‹, ›, and „). Many choices are totally inscrutable, however. Why define the comma-lookalike "‚"? Of what use can the few foreign alphabetical characters of Œ, œ (used in old German and old Romance languages), Ÿ (used in Greek transcription and rarely in French), Š, š, Ž and ž (Estonian, Finnish and Czech characters) be?

Use of the "forbidden zone" of ISO 8859 accounts for most of the cleanup tasks in clean scripts at my company. The trademark sign (™) is a favorite of many of our clients, since it is unavailable in ISO 8859-1. The list bullet and daggers are also occasionally found in feeds. Most likely the source of these characters is from client use of Windows Office software.

UCS/Unicode - ISO 10646

The next evolutionary step after ISO 8859 is UCS (Universal Character Set, ISO 10646). It dispenses with the approach of multiple table sets for individual languages in favor of a single giant table of all possible characters. Its table size of 2^31 (2,147,483,648) characters encompasses the character sets of virtually all the languages of the world. It goes by the more common name of Unicode, after the name of the consortium that merged with UCS.

The first 256 characters of Unicode are completely backward compatible to ISO-8859-1. Unicode's first 128 characters are classic ISO 646 7-bit Ascii set, while Unicode characters 128-255 is the top half of ISO-8859-1 Latin1.

File Formats vs. Byte Sequences

With a character set of Unicode's size, single byte sequences with a maximum numberic range of 0-255 can obviously no longer be the only way in which a file stores text. UCS has spawned two different byte-sequence conventions, UCS-2 and UCS-4. UCS-2 files have two byte characters, and UCS-4 files have 4-byte characters.


UTF-8 is a byte-based sequence convention for representing Unicode. It can have 1-, 2-, 3-, 4-, 5- or 6-byte long characters, changing when and where it needs to. It is an efficient solution for English and other Western European languages that spend much of their time in the first 128, one-byte wide unicode characters. In fact, English UTF-8 files will often consist of nothing but one-byte sequences, no different than an Ascii file. Because of this UTF-8 is nicely backwards compatible to ASCII, meaning a file can be UTF-8 but still work on older systems that support ISO-8859 or earlier standards. Even when a UTF-8 file does use larger width Unicode characters, it is still a "one octet encoding unit" (single byte) encoding standard, as opposed to UCS-2 and UCS-4 which are two- and four-octet encoding unit standards.

UTF-8 should always be spelled as shown. Referring to UTF-8 as utf-8, utf8, UTF8, etc. is considered bad form.

Unlike UCS-2 and UCS-4, which represent the higher order values of Unicode directly in binary with 16-bit and 32-bit data units, UTF-8 remains a "one octet encoding unit" (single byte). How does it represent Unicode values higher than 255? It encodes them with byte sequences of variable length. The values for these encoding bytes all lie within the range 128-255, perhaps not coincidentally the high bit portion of the 8-bit byte. In UTF-8, none of the values in the 128-255 range represent an actual character. This range is broken down to sub-ranges of bytes which are designated as "signifiers" for the beginning of 2-, 3-, 4-, 5- or 6-byte sequences. See the table below. Bytes in the red region, hexadecimal C2 through DF, are used to announce the beginning of a two-byte sequence; bytes in the blue region a three-byte sequence; and so on, to the orange region for 6-byte sequences. The characters in the purple region are the data values that can follow the signifiers. By using just these values, the complete set of Unicode characters that comprise two or more b yte s can be constructed.

The greyed out areas are prohibited values in UTF-8 (although there is an exception that will be discussed later).

For single byte characters, UTF-8 simply uses the first 128 Unicode characters, the ones that correspond exactly to classic Ascii 7-bit ISO 646. No special signifier announces a single byte sequence.

An easier way to understand the multi-byte signifier approach is by looking at the bit patterns in their binary values, as shown below. Two-byte sequences can be announced with with any byte having 110 as the top three bits; three-byte sequences can be announced with any byte having 1110 as the top four bits; and so on.

Cleaning a file of any UTF-8 multibyte sequences is complicated by the fact that you have to "look ahead" in the byte stream to determine whether a byte is the start of a multibyte UTF-8 sequence, or an isolated ISO 8859 character. When a UTF-8 sequence is found, the n-byte sequence must be replaced with a single ascii character (if the appropriate substitute is known), or simply removed (if no substitute is known). In practice, our clients use a very small repertoire of symbols that are easily replaced (trademark, registered, bullet, etc.).
If one wanted to to extract the Unicode sequence numbers that fall in the above 255 range from a UTF-8 file, reverse computation would be required, since the Unicode sequence numbers above 255 are encoded.


Working UTF-8 Sequences

Copyright symbol: 0xC2 0xA9

Not-equals sign: 0xE2 0x89 0xA0 (Unicode char U+2262)

Korean text: 0xED 0x95 0x9C 0xEA 0xB5 0xAD 0xEC 0x96 0xB4 (Unicode chars U+D55C U+AD6D U+C5B4)

Japanese text: 0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E (Unicode chars U+65E5 U+672C U+8A9E)

Working ISO 8852-1 (Latin1) Sequences

0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, æ, Ø, Æ)

Mangled Sequences

0x31 0x32 0x33 0x99 (1,2,3, winlatin1 trademark)

0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, three Latin1 chars, one winlatin1 char)

0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E 0xE6 0xE8 0x31 (UTF-8 Japanese text followed by misused signifiers)

How Do You Detect a File's Encoding?

None of the character tables discussed so far (with a possible exception of UTF-8, discussed below) have any convention for a self-identifying start block or header. It has to be done based on some sort of pattern. In Unix, the 'file' utility (or 'type' on some versions of *nix) is a pretty good tool for identifying the character table type. Here's the pattern that 'file' seems to use:

When: Unix 'file' Reports:
All characters are 7 bit "ASCII text"
ISO 8859 8-bit characters (with or without 7-bit ascii mixed in) "ISO-8859 text"
Any use of bytes in the ISO-8859 forbidden range (0x80-0x9F) which aren't part of valid UTF-8 sequences "Non-ISO extended-ASCII text". (This is how WinLatin1 files will be identified.)
Consistent correct use of multi-byte UTF-8 sequences "UTF-8 Unicode text"

How 'file' handles some edge cases:

When: Unix 'file' Reports:
All 7-bit, but containing odd combinations of control characters from the 0x00 to 0x1F range "Data"
Correct UTF-8 code mixed with any 8-bit ISO 8859 characters, forbidden or non forbidden "Non-ISO extended-ASCII text"

What 'file' cannot do is read your intentions. It makes the simplest judgment call it can. For example:

  1. A file of all 7-bit characters is always "ASCII text"...never mind that you think of the file as being UTF-8 or ISO 8859 compliant (both of which are also true).
  2. ISO 8859 detection is based only on byte values alone, not syntactic patterns; it can't tell whether you want it rendered as Latin1, Baltic, or Cyrillic, etc.). It reports ISO 8859, not ISO 8859-1, or ISO-8859-2, etc. This would take some fairly elaborate dictionary lookups from multiple languages to accomplish.
  3. Any UTF-8 multi-byte sequence not containing bytes 0x80 to 0x9F is hypothetically correct ISO 8859. Because these are syntactically rare, 'file' choses to identify this at UTF-8 (assuming everything else about the file is UTF-8 compliant). It cannot read your mind that you intended a series of weird ISO 8859 characters. For example you might want the sequence 0xB5 0xAD to read µ­ (micro sign, soft hyphen) in Latin1, but since it is a valid UTF-8 sequence (≠, not equals), 'file' will report it as "UTF-8 Unicode text" if everything else about the file is consistent with UTF-8.

Regarding the potential for confusion between ISO 8859 and UTF-8, this is what others have to say:

  1. "the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length" (UTF-8 RFC).
  2. "The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII [sic] is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence." (Wikipedia UTF-8 page) (see )

If you're dealing with a page you suspect is not rendering with the correct ISO 8859 subtable, both Firefox and Safari give you the option to change the interpretation. In Safari, check the View menu, Text Encoding. In Firefox, check the View menu, Character Encoding.

BOM: A UTF-8 File Format?

There is a convention called the "Byte Order Mark" (aka BOM) which (hypothetically) can be used to self-identify a UTF-8 file. This consists of beginning a file with the sequence 0xEF 0xBB 0xBF. Both Unix 'file' and 'vi' seem to be pretty good at interpreting it. When a file is pure 7-bit ASCII, but preceded by the BOM, it forces 'file' to report it as "UTF-8 Unicode text". An interesting edge case: if a file containing ISO 8859 8-bit characters that cannot be interpreted as UTF-8 multi-byte sequences is preceded by the BOM, neither Unix 'file' or 'vi' can be fooled: 'file' will report the three BOM characters as "ISO-8859 text", and 'vi' will display the BOM characters (as ).

Othan than BOM, refrain from ever calling ASCII, UTF-8, ISO-8859 or WinLatin1 a "file format." None have a standardized header or starting block They are more properly called "byte sequence conventions."

Unicode HTML Entities

The entire corpus of Unicode is hypothetically available to browsers. For example the Unicode characters of Japanese that were mentioned earlier, U+65E5 U+672C U+8A9E, should successfully render on your browser when I put &#x65E5 ; &#x672C ; &#x8A9E ; in the HTML. Here it goes: 日 本 語. These work for me on Firefox. In practice, I believe that no browsers are 100% capable of correctly rendering the entire 2^31 Unicode character set via HTML entities.

Question: does the presence of the foreign characters you see above effectively make this page Unicode or UTF-8 format? No. The characters I used to make them were merely ampersand, pound, x, and digits, all 7-it ASCII characters. HTML Entities are merely instructions to browsers what to do with the characters.

Do we want to use HTML Entities in the cleaned feeds we send to search engines? Do we want to figure out the intent of various WinLatin1 manglings and UTF-8 sequences that come from client feeds and turn them into HTML entities that will render safely on a browser? This is a subject still open to question. Our YSSP search engines can handle them, while our Comparison Shopping Engines (CSE's) do not. Currently we are leaning towards aiming for the lowest common denominator, 7-bit ASCII for all search engines.
WinLatin1's tragic legacy has, unfortunately, been immortalized in HTML entities. HTML entities 128 through 159 will render as their corresponding WinLatin1 characters in many browsers. (Since Unicode entries 128-159 are prohibited, HTML entities are departing from their Unicode basis in this area.) However, you should not consider this reliable; the W3C HTML 4 recommendations exclude these codes from their list.

HTML Page Encoding

Can an HTML file be in actual UTF-8 format? "Yes we can." If it contains UTF-8 byte sequences, it gets interpreted by your browser as UTF-8. Here are two nice examples of "UTF-8 Test pages" out there that are just full of exotic Unicode in UTF-8 encoding: Markus Khun's sample file, and Frank da Cruz's UTF-8 Sampler. (How well your browser shows the page depends on how good the browser is. For me, Firefox and Safari got most of it, but IE6 failed on large numbers of the characters.)

After you look at one of these pages, try viewing source. What do you see? If your source viewer is UTF-8 capable, you'll see, well, the UTF-8 characters. Surprised? Expected to see the numeric bytes, did you? If you want to see the numeric value of a character under the cursor in vi, press ga. For UTF-8 characters higher than 8 bits you will see large values like 0xFFDD. Or if you want to get really deep into the binary encoding of a file, you could get a binary editor. On Unix, try 'od', or better yet, 'bvi' (you have to download it).

When an HTML page puts this in the block, what does it do?

"Content-Type" content="text/html; charset=UTF-8" />

For those most part, it simply communicates intent: a tip to the consuming client what to do with the page. It doesn't change the way the bytes go down the wire, or the way your HTML is stored locally on your disk. It's still just bytes. In practice, you can even leave the declaration out and even though the file is chock full of UTF-8 binary sequences, a good browser figures it out anyway. (Note that Frank da Cruz's file above had no meta direction at all.)

How do you compose (i.e. edit and save) an HTML file in UTF-8? Well, if you're using all 7-bit ascii characters you're fine, that's still valid UTF-8. But really we mean UTF-8 with higher level Unicode characters. Well, unless you want to enter the binary directly, you'll need a special editor for that; vi, textpad, etc. won't do it for you. Try Google.

UTF-8 and XML

XML is UTF-8 by default; the following declaration is actually redundant:

 "1.0" encoding="UTF-8"?>

You'd have to override it with a different value for the encoding attribute to make it non UTF-8. Like the HTML meta tag content charset setting mentioned above, the encoding declaration is essentially a tip to the consumer of the XML what to expect.

Some Fun Historical Facts

Bob Bemer (1920-2004) is said to be the "father of ASCII". His Ford Explorer's vanity plate read 'ASCII' surrounded by a plate holder reading "Yes! I'm the Father of".

Wikipedia has a picture of one of the earliest published ASCII tables, from 1968.

Ken Thompson, the Unix pioneer, invented UTF-8. "It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat" (source). You can also read the original AT&T UTF-8 Tech Report.