Friday, May 15, 2009

Solving Character Set Problems: ASCII, ISO-8859, WinLatin1 and UTF-8

Character Sets and Feeds

I work for a company which receives huge incoming text feeds from our clients, which we filter, transform and massage to go out to the major search engines. Much of the work of cleaning incoming feeds consists of replacing undesirable characters (e.g. fancy foreign characters and symbols) that come from the client. For search engines we have to provide the lowest common denominator of character type, plain ASCII. There are many ways these "bad" characters creep into incoming feeds. Some are mainstream ISO 8859 or UTF-8 characters that are accepted by most display systems, but are still undesirable for search engines. Less likely but still possible are errors caused by the use of a legacy non-printing control character. By far the most common offender comes from violation of ISO 8859 conventions that have come into practice, known as "Extended Ascii" (of which Microsoft's WinLatin1 table is the most notorious offender). This page explains the major character set types, along with a little history, and the problems that they cause to feeds.

7-bit Ascii (ISO 646)

The original ASCII table, a 128-value (7-bit) character set is formally known as ISO 646. In fact it remains the only real "Ascii table" (more on this later). This portion has survived unchanged as the beginning portion of all major character sets (even the ones tampered with by Microsoft).

Below is the printable portion of ISO 646, characters 32 through 127 (the end of the table).

Below is the non-visible characters section of ISO 646, the first 32 characters of the table. Note that only a handful are in modern use; the rest are historic relics.

In feeds from our clients, mistaken input of legacy control characters is rare, although recently one client reported seeing 0x1D ("Group separator GS Left Arrow") in a feed.

8-bit Character Tables and ISO 8859

Ascii remains to this day, strictly speaking, a 7-bit table of 128 characters and no more. However, since in modern times characters are rendered with 8-bit bytes, people have wanted to take advantage of what is available on their machine in the area of that top bit. Folklore, common practice, and whatever happens to be on the machine one is using, has mislead many into thinking some universal 8-bit ascii table exists. But in reality the 8th bit portion has been a playground for whatever people wanted it to be.

To answer this havoc, ISO 8859 defined a variety of 8-bit character tables cover most of the world's needs. For every ISO 8859 table, the 7-bit portion is ISO 646 Ascii. The best-known of the ISO 8859 tables, at least among the English speaking and Western European world, is ISO 8859-1, also known as "Latin 1." To some people, this is "the" Ascii table. The eight bit portion is shown below.

Note that there is an unused portion (greyed out) of the table between characters 128 and 159. Take special note of this; as you'll see below this special area becomes a huge area of controversy (Extended Ascii).

Other ascii tables in the ISO 8859 collection include Baltic, Cyrillic, Arabic, Greek, Hebrew and Turkish character sets, allowing each country to adopt the appropriate table, and software manufacturers to write compliant products. (You can see the various ISO 8859 tables here.)

At my company, a feed using ISO 8859 8-bit characters will need to be cleaned before going out to search engines. If they are ISO 8859-1 (Latin1) characters, the mapping to a substitute character is well-known and easily fixed. We do this with Perl scripts and a simple mapping table. Another possible problem would be if we had a client that is using some ISO-8859 table other than Latin1 (for instance, a client from Scandinavia might use the Baltic table, ISO 8859-4). In that case we would have to create a new mapping table. Worse yet, files cannot identify themselves as being any particular ISO-8859 subversion, so if a client doesn't announce the table they are using, we would only find that out by seeing unusual errors. For example, seeing frequent use of a copyright symbol © in unexpected places would be a clue that some non Latin1 ISO-8859 table was in use.

It's not clear why ISO 8859 prohibited this area. In one table I saw, each of the codes were described as being a control character of some sort.

'Extended Ascii' and Windows 1252

Microsoft couldn't resist using that empty area of the ISO-8859 tables, so they went and created their own 'extended version' of the ISO 8859 table set; you can see them all here. Their version of ISO 8859's Latin-1 table goes under the name WinLatin1 or Windows 1252. The 8th-bit portion of this table is shown below, with the characters Microsoft inserted into ISO 8859's empty area marked in yellow.

Due to the popularity of the Microsoft Office family of software products, WinLatin1 has achieved an unfortunate hegemony in the computer world. In the competition for the elusive "8-bit Ascii" table, WinLatin1 is ISO 8859-1's chief competitor. However, a sizable portion of the computer world (Macs and Unix flavored s

ystems) have settled on the real ISO 8859 standard, so problems occur when people paste, email or otherwise transmit text made from a Microsoft product to be displayed on a Unix or Mac box. When you see question marks or 'ascii garbage' in a file when you're not expecting it, you have probably been WinLatin1-rolled.

The term "Extended Ascii" has become the name to describe the format of any file which contains a mixture of ISO 8859 plus characters in the forbidden zone of bytes 128-159.

So what did Microsoft do with that precious block of 32 characters it stole from ISO 8859? Four characters comprise the notorious MS-Word 'smart quotes' (single and double: ', ', " and "). On the other hand they were remarkably prescient in including the Euro currency symbol (€) years before the currency was adopted. There are a number of punctuation and editorial marks or symbols, such as daggers († and ‡), ellipses (...), the list bullet (•) and several others that perhaps are used in European languages (ˆ, ‹, ›, and „). Many choices are totally inscrutable, however. Why define the comma-lookalike "‚"? Of what use can the few foreign alphabetical characters of Œ, œ (used in old German and old Romance languages), Ÿ (used in Greek transcription and rarely in French), Š, š, Ž and ž (Estonian, Finnish and Czech characters) be?

Use of the "forbidden zone" of ISO 8859 accounts for most of the cleanup tasks in clean scripts at my company. The trademark sign (™) is a favorite of many of our clients, since it is unavailable in ISO 8859-1. The list bullet and daggers are also occasionally found in feeds. Most likely the source of these characters is from client use of Windows Office software.

UCS/Unicode - ISO 10646

The next evolutionary step after ISO 8859 is UCS (Universal Character Set, ISO 10646). It dispenses with the approach of multiple table sets for individual languages in favor of a single giant table of all possible characters. Its table size of 2^31 (2,147,483,648) characters encompasses the character sets of virtually all the languages of the world. It goes by the more common name of Unicode, after the name of the consortium that merged with UCS.

The first 256 characters of Unicode are completely backward compatible to ISO-8859-1. Unicode's first 128 characters are classic ISO 646 7-bit Ascii set, while Unicode characters 128-255 is the top half of ISO-8859-1 Latin1.

File Formats vs. Byte Sequences

With a character set of Unicode's size, single byte sequences with a maximum numberic range of 0-255 can obviously no longer be the only way in which a file stores text. UCS has spawned two different byte-sequence conventions, UCS-2 and UCS-4. UCS-2 files have two byte characters, and UCS-4 files have 4-byte characters.

UTF-8

UTF-8 is a byte-based sequence convention for representing Unicode. It can have 1-, 2-, 3-, 4-, 5- or 6-byte long characters, changing when and where it needs to. It is an efficient solution for English and other Western European languages that spend much of their time in the first 128, one-byte wide unicode characters. In fact, English UTF-8 files will often consist of nothing but one-byte sequences, no different than an Ascii file. Because of this UTF-8 is nicely backwards compatible to ASCII, meaning a file can be UTF-8 but still work on older systems that support ISO-8859 or earlier standards. Even when a UTF-8 file does use larger width Unicode characters, it is still a "one octet encoding unit" (single byte) encoding standard, as opposed to UCS-2 and UCS-4 which are two- and four-octet encoding unit standards.

UTF-8 should always be spelled as shown. Referring to UTF-8 as utf-8, utf8, UTF8, etc. is considered bad form.

Unlike UCS-2 and UCS-4, which represent the higher order values of Unicode directly in binary with 16-bit and 32-bit data units, UTF-8 remains a "one octet encoding unit" (single byte). How does it represent Unicode values higher than 255? It encodes them with byte sequences of variable length. The values for these encoding bytes all lie within the range 128-255, perhaps not coincidentally the high bit portion of the 8-bit byte. In UTF-8, none of the values in the 128-255 range represent an actual character. This range is broken down to sub-ranges of bytes which are designated as "signifiers" for the beginning of 2-, 3-, 4-, 5- or 6-byte sequences. See the table below. Bytes in the red region, hexadecimal C2 through DF, are used to announce the beginning of a two-byte sequence; bytes in the blue region a three-byte sequence; and so on, to the orange region for 6-byte sequences. The characters in the purple region are the data values that can follow the signifiers. By using just these values, the complete set of Unicode characters that comprise two or more b yte s can be constructed.

The greyed out areas are prohibited values in UTF-8 (although there is an exception that will be discussed later).

For single byte characters, UTF-8 simply uses the first 128 Unicode characters, the ones that correspond exactly to classic Ascii 7-bit ISO 646. No special signifier announces a single byte sequence.

An easier way to understand the multi-byte signifier approach is by looking at the bit patterns in their binary values, as shown below. Two-byte sequences can be announced with with any byte having 110 as the top three bits; three-byte sequences can be announced with any byte having 1110 as the top four bits; and so on.

Cleaning a file of any UTF-8 multibyte sequences is complicated by the fact that you have to "look ahead" in the byte stream to determine whether a byte is the start of a multibyte UTF-8 sequence, or an isolated ISO 8859 character. When a UTF-8 sequence is found, the n-byte sequence must be replaced with a single ascii character (if the appropriate substitute is known), or simply removed (if no substitute is known). In practice, our clients use a very small repertoire of symbols that are easily replaced (trademark, registered, bullet, etc.).
If one wanted to to extract the Unicode sequence numbers that fall in the above 255 range from a UTF-8 file, reverse computation would be required, since the Unicode sequence numbers above 255 are encoded.

Examples

Working UTF-8 Sequences

Copyright symbol: 0xC2 0xA9

Not-equals sign: 0xE2 0x89 0xA0 (Unicode char U+2262)

Korean text: 0xED 0x95 0x9C 0xEA 0xB5 0xAD 0xEC 0x96 0xB4 (Unicode chars U+D55C U+AD6D U+C5B4)

Japanese text: 0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E (Unicode chars U+65E5 U+672C U+8A9E)

Working ISO 8852-1 (Latin1) Sequences

0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, æ, Ø, Æ)

Mangled Sequences

0x31 0x32 0x33 0x99 (1,2,3, winlatin1 trademark)

0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, three Latin1 chars, one winlatin1 char)

0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E 0xE6 0xE8 0x31 (UTF-8 Japanese text followed by misused signifiers)

How Do You Detect a File's Encoding?

None of the character tables discussed so far (with a possible exception of UTF-8, discussed below) have any convention for a self-identifying start block or header. It has to be done based on some sort of pattern. In Unix, the 'file' utility (or 'type' on some versions of *nix) is a pretty good tool for identifying the character table type. Here's the pattern that 'file' seems to use:

When: Unix 'file' Reports:
All characters are 7 bit "ASCII text"
ISO 8859 8-bit characters (with or without 7-bit ascii mixed in) "ISO-8859 text"
Any use of bytes in the ISO-8859 forbidden range (0x80-0x9F) which aren't part of valid UTF-8 sequences "Non-ISO extended-ASCII text". (This is how WinLatin1 files will be identified.)
Consistent correct use of multi-byte UTF-8 sequences "UTF-8 Unicode text"

How 'file' handles some edge cases:

When: Unix 'file' Reports:
All 7-bit, but containing odd combinations of control characters from the 0x00 to 0x1F range "Data"
Correct UTF-8 code mixed with any 8-bit ISO 8859 characters, forbidden or non forbidden "Non-ISO extended-ASCII text"

What 'file' cannot do is read your intentions. It makes the simplest judgment call it can. For example:

  1. A file of all 7-bit characters is always "ASCII text"...never mind that you think of the file as being UTF-8 or ISO 8859 compliant (both of which are also true).
  2. ISO 8859 detection is based only on byte values alone, not syntactic patterns; it can't tell whether you want it rendered as Latin1, Baltic, or Cyrillic, etc.). It reports ISO 8859, not ISO 8859-1, or ISO-8859-2, etc. This would take some fairly elaborate dictionary lookups from multiple languages to accomplish.
  3. Any UTF-8 multi-byte sequence not containing bytes 0x80 to 0x9F is hypothetically correct ISO 8859. Because these are syntactically rare, 'file' choses to identify this at UTF-8 (assuming everything else about the file is UTF-8 compliant). It cannot read your mind that you intended a series of weird ISO 8859 characters. For example you might want the sequence 0xB5 0xAD to read µ­ (micro sign, soft hyphen) in Latin1, but since it is a valid UTF-8 sequence (≠, not equals), 'file' will report it as "UTF-8 Unicode text" if everything else about the file is consistent with UTF-8.

Regarding the potential for confusion between ISO 8859 and UTF-8, this is what others have to say:

  1. "the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length" (UTF-8 RFC).
  2. "The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII [sic] is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence." (Wikipedia UTF-8 page) (see )

If you're dealing with a page you suspect is not rendering with the correct ISO 8859 subtable, both Firefox and Safari give you the option to change the interpretation. In Safari, check the View menu, Text Encoding. In Firefox, check the View menu, Character Encoding.

BOM: A UTF-8 File Format?

There is a convention called the "Byte Order Mark" (aka BOM) which (hypothetically) can be used to self-identify a UTF-8 file. This consists of beginning a file with the sequence 0xEF 0xBB 0xBF. Both Unix 'file' and 'vi' seem to be pretty good at interpreting it. When a file is pure 7-bit ASCII, but preceded by the BOM, it forces 'file' to report it as "UTF-8 Unicode text". An interesting edge case: if a file containing ISO 8859 8-bit characters that cannot be interpreted as UTF-8 multi-byte sequences is preceded by the BOM, neither Unix 'file' or 'vi' can be fooled: 'file' will report the three BOM characters as "ISO-8859 text", and 'vi' will display the BOM characters (as ).

Othan than BOM, refrain from ever calling ASCII, UTF-8, ISO-8859 or WinLatin1 a "file format." None have a standardized header or starting block They are more properly called "byte sequence conventions."

Unicode HTML Entities

The entire corpus of Unicode is hypothetically available to browsers. For example the Unicode characters of Japanese that were mentioned earlier, U+65E5 U+672C U+8A9E, should successfully render on your browser when I put &#x65E5 ; &#x672C ; &#x8A9E ; in the HTML. Here it goes: 日 本 語. These work for me on Firefox. In practice, I believe that no browsers are 100% capable of correctly rendering the entire 2^31 Unicode character set via HTML entities.

Question: does the presence of the foreign characters you see above effectively make this page Unicode or UTF-8 format? No. The characters I used to make them were merely ampersand, pound, x, and digits, all 7-it ASCII characters. HTML Entities are merely instructions to browsers what to do with the characters.

Do we want to use HTML Entities in the cleaned feeds we send to search engines? Do we want to figure out the intent of various WinLatin1 manglings and UTF-8 sequences that come from client feeds and turn them into HTML entities that will render safely on a browser? This is a subject still open to question. Our YSSP search engines can handle them, while our Comparison Shopping Engines (CSE's) do not. Currently we are leaning towards aiming for the lowest common denominator, 7-bit ASCII for all search engines.
WinLatin1's tragic legacy has, unfortunately, been immortalized in HTML entities. HTML entities 128 through 159 will render as their corresponding WinLatin1 characters in many browsers. (Since Unicode entries 128-159 are prohibited, HTML entities are departing from their Unicode basis in this area.) However, you should not consider this reliable; the W3C HTML 4 recommendations exclude these codes from their list.

HTML Page Encoding

Can an HTML file be in actual UTF-8 format? "Yes we can." If it contains UTF-8 byte sequences, it gets interpreted by your browser as UTF-8. Here are two nice examples of "UTF-8 Test pages" out there that are just full of exotic Unicode in UTF-8 encoding: Markus Khun's sample file, and Frank da Cruz's UTF-8 Sampler. (How well your browser shows the page depends on how good the browser is. For me, Firefox and Safari got most of it, but IE6 failed on large numbers of the characters.)

After you look at one of these pages, try viewing source. What do you see? If your source viewer is UTF-8 capable, you'll see, well, the UTF-8 characters. Surprised? Expected to see the numeric bytes, did you? If you want to see the numeric value of a character under the cursor in vi, press ga. For UTF-8 characters higher than 8 bits you will see large values like 0xFFDD. Or if you want to get really deep into the binary encoding of a file, you could get a binary editor. On Unix, try 'od', or better yet, 'bvi' (you have to download it).

When an HTML page puts this in the block, what does it do?

"Content-Type" content="text/html; charset=UTF-8" />

For those most part, it simply communicates intent: a tip to the consuming client what to do with the page. It doesn't change the way the bytes go down the wire, or the way your HTML is stored locally on your disk. It's still just bytes. In practice, you can even leave the declaration out and even though the file is chock full of UTF-8 binary sequences, a good browser figures it out anyway. (Note that Frank da Cruz's file above had no meta direction at all.)

How do you compose (i.e. edit and save) an HTML file in UTF-8? Well, if you're using all 7-bit ascii characters you're fine, that's still valid UTF-8. But really we mean UTF-8 with higher level Unicode characters. Well, unless you want to enter the binary directly, you'll need a special editor for that; vi, textpad, etc. won't do it for you. Try Google.

UTF-8 and XML

XML is UTF-8 by default; the following declaration is actually redundant:

 "1.0" encoding="UTF-8"?>

You'd have to override it with a different value for the encoding attribute to make it non UTF-8. Like the HTML meta tag content charset setting mentioned above, the encoding declaration is essentially a tip to the consumer of the XML what to expect.

Some Fun Historical Facts

Bob Bemer (1920-2004) is said to be the "father of ASCII". His Ford Explorer's vanity plate read 'ASCII' surrounded by a plate holder reading "Yes! I'm the Father of".

Wikipedia has a picture of one of the earliest published ASCII tables, from 1968.

Ken Thompson, the Unix pioneer, invented UTF-8. "It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat" (source). You can also read the original AT&T UTF-8 Tech Report.


Sunday, May 10, 2009

You on a Diet: book review

During my various attempts to get in good shape over the years I've wasted my share of money on various diet & exercise books that try to tell you what exercise you should do, what muscles they benefit, what foods you should eat and why, and recipes that will make it more enjoyable. Big surprise, lack of willpower has tanked many of my efforts to stay on any kind of healthy eating program. But I really don't think that willpower is the whole story for me or other dieters. What about our ingrained habits of using food for reward, food to fix a poor mood, hedonistic eating, addiction to various kinds of foods (like my favorite, the Sticky Pecan Roll from Au Bon Pan)?

One diet approach that I thought took a step forward in talking about the psychology of diet was the Low-Carb diet (or "Atkins" in its most popular form). It talks about the role that carbs have in making us feel full, and thereby feel good. But I think you could do a lot better. I want to know the mechanics behind appetite, satisfaction and digestion: what actually occurs in the body to make you want to eat huge portions of your favorite foods, what happens as that urge becomes satisfied, and how you can use that knowledge to change your behavior.

I discovered the book You On a Diet while standing in line at a Jamba Juice one day. Using entertaining illustrations, it nails exactly what I was looking for. Some of the points in makes on the mechanics of body chemistry, food and appetite include:
  • The lower gut contains 95% of the body's serotonin, which suggests that eating is self-medicating.
  • Our caveman ancestors were more fit than us because the stresses of survival kept them lean. Stress, like fight-or-flight stress, means that Peptide NPY is inhibited and you don't feel like eating.
  • When fat makes the liver work extra hard, it prevents glucose from getting to our cells, and produces hunger.
  • Fiber is good for diet because it slows "the transit of food across the ileocecal valve, keeping your stomach fuller for longer." (p. 68)
The book identifies these foods that make you feel full, or suppresses appetite in some way:
  • Nuts
  • Cinnamon
  • Whole Grains
  • Fiber
  • Red Peppers
  • Smell of grapefruit (p. 88)
  • Brightly-colored food
  • Mint breath strips (p. 239)
  • Fiber supplements (e.g. 1 tbsp Psyllium Husks with a glass of water)
Other eating strategies the book recommends revolve around easing the process of digestion. When the intestines are breaking down food, separating the good nutrients (to go into the bloodstream) from the bad nutrients (to be eliminated), a natural process of inflammation occurs. The intestines do their job well when that inflammation is kept to a healthy level. Bad foods increase the inflammation, and the separation of good vs. bad is impaired. Inflammation-fighting foods include:
  • Lactobaccillus CG, a healthy bacteria found in yogurt
  • Omega-3 fatty acids, such as Fish oil, walnuts
  • Green tea
  • Beer (hops)
  • Tumeric
  • Jojoba beans (available in supplements)
  • Soybeans (isoflavins)
  • Lignans, such as Flaxseed oil, rye bread
  • Polyphenols, such as tea, coffee, fruit, vegetables
  • Glucosinulates, such as broccoli, kale, cauliflower
  • Carnosol, found in Rosemary
  • Resveratrol, found in red grapes, juice and wine
  • Dark Cacao
  • Quercetin, as found in cabbage, spinich, garlic, capers, apples, tea, red onion, red grapes, citrus fruit, tomato, broccoli, leafy green vegetables, cherry, raspberry, and lingonberry
  • Antioxidants, as in vegetables and fruits (especially bananas), Vitamin E, Vegetable oils, Tea, coffee, soy, fruit, olive oil, chocolate, cinnamon, oregano and red wine
The book covers some interesting facts on the mechanics of exercise and weight loss:
  • Weight loss improves cholesterol by a factor of three. For example, a 7% weight loss leads to a 20% improvement in cholesterol levels. (p. 120)
  • The beneficial effect of exercise in producing weight loss is greater than the detrimental effect of eating in producing weight gain. So even if you're getting in only a little exercise each day, the effect is significant. (pp. 141-142)
  • Without exercise, we lose 5% of our muscle mass every 10 years after the age of 35. If you don't exercise (rebuild muscle) every 10 years, you need to eat 120-420 fewer calories a day to maintain your current weight. (p. 142)
  • Focus on reducing the size of your belly, not weight loss. Exercise not only reduced fat but bulks up muscles, which can result in a net loss of zero.
The material on "good fats" vs. "bad fats" is helpful:

  • Bad Fats are the ones that stay solid at room temperature: animal fat, butter, stick margarine & lard. Food manufacturers push these because they have a long shelf life.
  • Good Fats are the ones that are liquid at room temperature are the good, omega-3 and -6 fats: olive, vegetable, sesame & canola oil, fish oil, flaxseed, avocados, nuts (especially walnuts). Nutrients that fight bad fats are: niacin and vitamin B5
Everyone know that whole grains are good for you, although we use the term so frequently it's helpful to review what this actually means. In a whole grain, "the grain still has all three of its original elements: the outer shell or gran, which contains fiber and B vitamins; the germ, which contains phytochemicals and B vitamins; and the endosperm which contains carbohydrates and protein." (p. 256) "'Refined' grains means only endosperm is enclosed." (p. 257)
Here's a fun subject the book covers: what gives you gas? It's important because it's a byproduct of the way you eat and what you are forcing digestion to do for you:

  • Gas is a normal result of the intestinal inflammation during digestion, as good nutrients go to the bloodstream and bad nutrients to to the lower intestines (and produce gas). So bad gas can be attributed to too much bad foods in your diet. Also, when inflammation is too high, some bad nutrients get into the bloodstream, leading to cholesterol.
  • Sulfur-rich foods such as eggs, meat, beer, beans and cauliflower make gas smell worse
  • Drinking cola means swallowing more air, which means more gas
Other interesting things I found in the book:

  • No evidence yet shows that artificial sweeteners are unhealthy (p. 97)
  • It's the fat around our waist that gets us into trouble. Fat in other parts of your body cause relatively little harm to your health or eating chemistry. (p. 102)
  • Alcholic drinks fight bad fats (p. 123)
  • The liver is the heaviest organ in the body (p. 77)
  • Your deoderant can make you gain weight, if it contains aluminum or polychlorobiphenols (p. 92)
  • The more brightly colored the food, the better it is for you (p. 95)
So I can give a strong recommendation for You On a Diet if want to learn alot about the mechanics behind appetite, satisfaction and digestion.

Sunday, April 12, 2009

iCrossing is hiring a Java developer in Chicago

My employer, iCrossing, has opened a search for a new member for the Merchantize team. Here's the description. To apply for the position, visit http://www.iCrossing.com/careers, select U.S. Career Offerings, Jobs by Location, then Jobs in Chicago, IL.

Java Software Engineer (Open Source / Web Analytics / ETL)

JOB DESCRIPTION

We’re a people business.

People are the heart and soul of our company, working every day to make our clients’ marketing programs successful.

At iCrossing, we combine experienced talent with world-class technologies to efficiently create marketing programs that truly perform. With more than 620 professionals in 15 offices in the U.S. and Europe, we are equipped to service the digital marketing needs of large enterprises and growing companies alike.

We’re seeking the talented, the experienced and the exceptional to give our clients the most creative and successful solutions for an ever-changing industry. When we find them, we offer a dynamic working environment, competitive compensation, the opportunity to work on exciting client programs, and occasional bagels.
We are seeking a highly motivated and technically proficient JEE Software Engineer / Software Developer to work on our industry leading and mission critical Paid Media Management (Search Engine Marketing, bid management) product.



Features of the position:
• Work on a high-visibility, high performance product that supports iCrossing’s industry leading SEM practice in a growing and fast moving industry.
• Work closely with all of the major search engines (Google, Yahoo, MSN, Ask, AOL) and their APIs.
• Work in a fast moving and forward thinking development environment that is constantly researching and rapidly implementing the latest technologies.
• Research and participate in the advancement and implementation of open source frameworks and architectures such as SOA/ESB, MapReduce, Grid and Cloud computing, and others.
• Work with an experienced Agile Software Development team in a highly collaborative environment.
• Modern Java Enterprise open source based product stack, Java 6, Spring, Hibernate 3, Webworks/Struts 2, JMS, JUnit, MySQL and more.
• Learn current software development best practices (continuous integration, build automation, test driven development, pair programming, agile estimating and planning, etc)
• Apple MacBook Pro, 24” widescreen monitor, IntelliJ or Eclipse.
• A casual, fun, and creative work environment
Major Job Responsibilities / Accountabilities:
• Write test driven quality code.
• Work closely with your dev team.
• Follow and encourage development best practices.
• Develop knowledge of Search Engine Marketing (SEM) principles and techniques.

Skills/Requirements:
Required Technologies (At least one or more of the following)
• Spring
• Hibernate
• SQL scripts
• Shell Scripting
• Webwork (Struts 2.0)
• Linux / Unix admin
• Junit (required) or TDD (preferred)
• Grid Computing (GridGain preferred)

Bonus Technologies (Preferred any of these)
• MySQL (especially advanced knowledge of replication, storage engines, backup and recovery)
• PERL
• Data warehousing design concepts, ETL
• Mondrian OLTP
• JMS
• Amazon EC2 / S3 / AWS

Knowledge / Skills / Abilities:
• BS in Computer Science or equivalent level of experience
• Understanding and/or appreciation for Agile software development methodologies.
• 1+ yrs of professional development experience.
• Familiarity with source control using Subversion
• Familiarity with IDE tools such as Eclipse or IntelliJ
• Must possess effective interpersonal and communication skills and ability to work successfully in a team environment.
• Good organizational and time-management skills.

Do Not Apply if you:
• Do not know Java
• Have no interest in Agile, TDD or Unit testing
• Are close-minded and don’t want to learn new technologies.
• Are more comfortable working on the same technology you did last year.

*ICROSSING IS NOT ACCEPTING RESUMES FROM STAFFING AGENCY PARTNERS AT THIS TIME. THANK YOU.

Friday, March 27, 2009

Hacking a Linksys NSLU2



I bought a Linksys NSLU2 a while back, which is a low cost (about $99) appliance for turning two USB disk drives into a Network Attached Share (NAS) system. This lets me set up file storage centrally located on my LAN (as opposed to attaching it to one computer on the LAN and setting up a share).

What's inside is a small Linux computer mounted on a single circuitboard. And that's where the fun comes in. As the device's Wikipedia page points out, since the internal Linux is licensed with a GNU General Public License, Linksys was required to release their source code. This has enabled third parties to develop firmware upgrades to the device. One popular upgrade is the Unlung SlugOS, which among various things, enables the device to accept telnet connections.

Here is my network cabinet at home. From left is my DSL modem, a 360 GB USB disk drive, the NSLU2, and my Linksys WRT54G Wireless router. If you know this router, you can see by the size that the NSLU2 is not much bigger than a deck of cards.

Like a router, an NSLU2 hooked up to your LAN will have its own web administration page, which is reachable by http://192.168.1.77.

Upgrading to the Unslung firmware went exactly as the directions described it. After restarting, the NSLU2's admin page had a few additions, as you can see in the screen grab below. It added an unslung logo on the upper left, and a "Manage Telnet" link on the right. Once I enabled telnet I was able to log in and get a prompt by telnetting to 192.168.1.77.



















A short tour of what is inside the box is shown in the telnet session below. There's cool stuff inside! A nice handful of basic unix commands, the web server that serves up the admin site (above), and even a wget command (which I demonstrate by getting the yahoo.com homepage HTML.

The next thing I wanted to do is to "de-underclock" the device. The CPU is 266 mHz but for unknown reasons, Linksys clocked it down by half with a tiny little resistor. Here's the board:
Simply removing the resistor clocks it up to 266 mHz. Following the helpful instructions on the Unslung site, I geared myself up with some needle nose pliers, a static wrist guard and gloves, a magnifying glass and a geeky miner's light (see pic at left). All I had to do was get a grip on that tiny, tiny resistor (about 1/4 the size of a grain of rice)...and I crunched it! After all, I wasn't going to need it again.

When I put the card back inside the case, reconnected it and restarted it...it worked! And I got the proof the the clock speed doubled in my telnet session:


Whoa, it says it is 2.22 mHz short of 266 Hz, I wonder why? Such are the mysteries of computer hardware.

Tuesday, March 17, 2009

Apple iPhone OS 3.0 Announcement summary

Here are highlights of what was announced for the iPhone OS 3.0 release early this afternoon from Apple:
  • Cut and paste: it was worth the wait, the touch interaction to do this looks very cool (see picture at right). Works across applications and does undo.
  • Multimedia messaging: you can attach a picture to a text message
  • Ability to choose a group of photos and send them in a single email
  • Push email notification
  • Landscape mode text entry (so what)
  • Turn-by-turn GPS navigation.
  • Available in the summer. That's as detailed as it gets. No doubt will be linked to the new iPhone model coming out in July.
  • Virtually all of the new features will work with the original, pre-3G iPhone (exceptions: multimedia messaging and stereo bluetooth)
  • Peer-to-peer linkups between individual iPhones for games, file sharing, etc. This exists now with things like AirSharing and Holdem, but those companies probably rolled their own; now it's part of the API
  • API support for applications that connect to external devices. Demonstrations with medical devices were given (see pic at right). Medical applications of new technology are always a big win in corporate presentations; the real news here is that this will open up remoting of all sorts of sophisticated devices for music, video, information systems, anything you can imagine.
  • Ability to search in your emails on the server side, and search in your calendar items
  • Search your iPhone contents with Spotlight (well-known to Mac users)
  • The Sims 3 will run on the iPhone (see pic at right)

Monday, March 16, 2009

iPhone SDK Presentation at CJUG 2/17/09

Our local Java User's Group chapter, CJUG recently hosted a presentation by Rakesh Vidyadharan titled iPhone SDK: Java Developers Perspective (link to PDF). It wasn't so much an immersion in iPhone development per se, but an introduction to development in Objective C.

Some of the things I found interesting:

  • The SDK requires an Intel-based Mac
  • The MVC approach is pretty much baked into the framework. Not everyone likes that.
  • Development with an emulator is a breeze, but pushing an app to a real iPhone is time consuming
  • Getting apps considered for inclusion in the app store is well-documented, but a convoluted process
Some things about Objective C:
  • Very similar to TCL scripting
  • Weak typing and dynamically-bound variables like javascript, ruby and php
  • There's no namespaces or packages, which means every class has to have a complete unique name. To group variables, developers adopt precursors, like CUreader, CUwriter, CUcreator, etc. And the language has several precursors reserved for the language core. For example, you can't define any classes or variables beginning with NS, IB, or UI.
  • Parameter names are part of the signature of a method. For example, foo(first_name: "Greg", last_name: "Sandell")
  • The code is visibly very different from Java, or even C and C++. Many lines start with a plus or minus sign.
Object-Oriented characteristics of Objective C:
  • Much less a "real" Objected Oriented language than C++
  • Objects don't automatically inherit a base Object as in java. You have to explicitly extend NSObject
  • Objects automatically have setters and getters, like ruby
  • Like C, you completely manage your own memory. OsX since Tiger has a garbage collector, but Objective C doesn't use it
  • Memory is managed by a incremental counting approach called refcount. Each alloc increments refcount, each release decrements it
  • Dealloc is like finalize in java
  • Messaging is a big part of the language. For example, methods are invoked via messages.

Friday, March 13, 2009

More ways to lose Twitter followers

I read an article today called How to lose Twitter followers in 10 steps. I've probably read about 20 such Twitter Ettiquette articles, and most are pretty good. They don't even all say the same things, which means there are many, many ways to be an obnoxious Twit. Twitterer. Ahem.

And yet there are still more ways to lose followers, because they have yet to cover MY list. Here are things you can do to get me to stop following you:
  1. Tweet long sentences over multiple tweets. Listen, it's a 140-character medium okay? Don't try to bypass it with a continuation.
  2. Be a celebrity-turned-Twitterholic. Hey I love your movies, and hearing about your glorious life was fun for a while, but now you twitter every tiny thought and movement you make as if I have nothing else to do but follow your 140-character fanzine.
  3. Tweet a link without information. People do this...a naked link, its identity obscured by a tinyurl, and you think we worship you so much to follow you into a blind alley.
  4. Tweet linkless information that needs a link. The other day someone announced a really cool sounding conference but gave no link.
  5. Make your followers listen to your side of the one-on-one conversation you're having with a friend. Hey get a room. A chat room. This is Twitter.
  6. Twitter using a context that only a tiny number of your followers know about. Recently I stopped following a colleague who kept tweeting things like "is excited about the new website" and "posted my first article to the new site. Can't wait for the reactions." What website would that be?
  7. Tweet with the same hyped-up, caffeine fueled excitement and optimism day after day. You got tickets to SXSW, FTW! You ate at a new five star on Randolph, FTW! You got tickets to Coldplay, FTW! Okay, okay I give up, I hate myself and I want to be you!
  8. DM everyone on your list with generic messages. People tend to have their twitter set up to get emails when they are DM'd. How out of it can you be about Twitter that you immediately turn it into a vehicle for spam?
Read on Twitter today from someone at SXSW: "Everyone starts out on the internet as a douchebag. Then you do something to move above or below that line."