Showing posts with label _technical. Show all posts

Friday, May 15, 2009

Solving Character Set Problems: ASCII, ISO-8859, WinLatin1 and UTF-8

Character Sets and Feeds

I work for a company which receives huge incoming text feeds from our clients, which we filter, transform and massage to go out to the major search engines. Much of the work of cleaning incoming feeds consists of replacing undesirable characters (e.g. fancy foreign characters and symbols) that come from the client. For search engines we have to provide the lowest common denominator of character type, plain ASCII. There are many ways these "bad" characters creep into incoming feeds. Some are mainstream ISO 8859 or UTF-8 characters that are accepted by most display systems, but are still undesirable for search engines. Less likely but still possible are errors caused by the use of a legacy non-printing control character. By far the most common offender comes from violation of ISO 8859 conventions that have come into practice, known as "Extended Ascii" (of which Microsoft's WinLatin1 table is the most notorious offender). This page explains the major character set types, along with a little history, and the problems that they cause to feeds.

7-bit Ascii (ISO 646)

The original ASCII table, a 128-value (7-bit) character set is formally known as ISO 646. In fact it remains the only real "Ascii table" (more on this later). This portion has survived unchanged as the beginning portion of all major character sets (even the ones tampered with by Microsoft).

Below is the printable portion of ISO 646, characters 32 through 127 (the end of the table).

Below is the non-visible characters section of ISO 646, the first 32 characters of the table. Note that only a handful are in modern use; the rest are historic relics.

In feeds from our clients, mistaken input of legacy control characters is rare, although recently one client reported seeing 0x1D ("Group separator GS Left Arrow") in a feed.

8-bit Character Tables and ISO 8859

Ascii remains to this day, strictly speaking, a 7-bit table of 128 characters and no more. However, since in modern times characters are rendered with 8-bit bytes, people have wanted to take advantage of what is available on their machine in the area of that top bit. Folklore, common practice, and whatever happens to be on the machine one is using, has mislead many into thinking some universal 8-bit ascii table exists. But in reality the 8th bit portion has been a playground for whatever people wanted it to be.

To answer this havoc, ISO 8859 defined a variety of 8-bit character tables cover most of the world's needs. For every ISO 8859 table, the 7-bit portion is ISO 646 Ascii. The best-known of the ISO 8859 tables, at least among the English speaking and Western European world, is ISO 8859-1, also known as "Latin 1." To some people, this is "the" Ascii table. The eight bit portion is shown below.

Note that there is an unused portion (greyed out) of the table between characters 128 and 159. Take special note of this; as you'll see below this special area becomes a huge area of controversy (Extended Ascii).

Other ascii tables in the ISO 8859 collection include Baltic, Cyrillic, Arabic, Greek, Hebrew and Turkish character sets, allowing each country to adopt the appropriate table, and software manufacturers to write compliant products. (You can see the various ISO 8859 tables here.)

At my company, a feed using ISO 8859 8-bit characters will need to be cleaned before going out to search engines. If they are ISO 8859-1 (Latin1) characters, the mapping to a substitute character is well-known and easily fixed. We do this with Perl scripts and a simple mapping table. Another possible problem would be if we had a client that is using some ISO-8859 table other than Latin1 (for instance, a client from Scandinavia might use the Baltic table, ISO 8859-4). In that case we would have to create a new mapping table. Worse yet, files cannot identify themselves as being any particular ISO-8859 subversion, so if a client doesn't announce the table they are using, we would only find that out by seeing unusual errors. For example, seeing frequent use of a copyright symbol © in unexpected places would be a clue that some non Latin1 ISO-8859 table was in use.

It's not clear why ISO 8859 prohibited this area. In one table I saw, each of the codes were described as being a control character of some sort.

'Extended Ascii' and Windows 1252

Microsoft couldn't resist using that empty area of the ISO-8859 tables, so they went and created their own 'extended version' of the ISO 8859 table set; you can see them all here. Their version of ISO 8859's Latin-1 table goes under the name WinLatin1 or Windows 1252. The 8th-bit portion of this table is shown below, with the characters Microsoft inserted into ISO 8859's empty area marked in yellow.

Due to the popularity of the Microsoft Office family of software products, WinLatin1 has achieved an unfortunate hegemony in the computer world. In the competition for the elusive "8-bit Ascii" table, WinLatin1 is ISO 8859-1's chief competitor. However, a sizable portion of the computer world (Macs and Unix flavored s

ystems) have settled on the real ISO 8859 standard, so problems occur when people paste, email or otherwise transmit text made from a Microsoft product to be displayed on a Unix or Mac box. When you see question marks or 'ascii garbage' in a file when you're not expecting it, you have probably been WinLatin1-rolled.

The term "Extended Ascii" has become the name to describe the format of any file which contains a mixture of ISO 8859 plus characters in the forbidden zone of bytes 128-159.

So what did Microsoft do with that precious block of 32 characters it stole from ISO 8859? Four characters comprise the notorious MS-Word 'smart quotes' (single and double: ', ', " and "). On the other hand they were remarkably prescient in including the Euro currency symbol (€) years before the currency was adopted. There are a number of punctuation and editorial marks or symbols, such as daggers († and ‡), ellipses (...), the list bullet (•) and several others that perhaps are used in European languages (ˆ, ‹, ›, and „). Many choices are totally inscrutable, however. Why define the comma-lookalike "‚"? Of what use can the few foreign alphabetical characters of Œ, œ (used in old German and old Romance languages), Ÿ (used in Greek transcription and rarely in French), Š, š, Ž and ž (Estonian, Finnish and Czech characters) be?

Use of the "forbidden zone" of ISO 8859 accounts for most of the cleanup tasks in clean scripts at my company. The trademark sign (™) is a favorite of many of our clients, since it is unavailable in ISO 8859-1. The list bullet and daggers are also occasionally found in feeds. Most likely the source of these characters is from client use of Windows Office software.

UCS/Unicode - ISO 10646

The next evolutionary step after ISO 8859 is UCS (Universal Character Set, ISO 10646). It dispenses with the approach of multiple table sets for individual languages in favor of a single giant table of all possible characters. Its table size of 2^31 (2,147,483,648) characters encompasses the character sets of virtually all the languages of the world. It goes by the more common name of Unicode, after the name of the consortium that merged with UCS.

The first 256 characters of Unicode are completely backward compatible to ISO-8859-1. Unicode's first 128 characters are classic ISO 646 7-bit Ascii set, while Unicode characters 128-255 is the top half of ISO-8859-1 Latin1.

File Formats vs. Byte Sequences

With a character set of Unicode's size, single byte sequences with a maximum numberic range of 0-255 can obviously no longer be the only way in which a file stores text. UCS has spawned two different byte-sequence conventions, UCS-2 and UCS-4. UCS-2 files have two byte characters, and UCS-4 files have 4-byte characters.

UTF-8

UTF-8 is a byte-based sequence convention for representing Unicode. It can have 1-, 2-, 3-, 4-, 5- or 6-byte long characters, changing when and where it needs to. It is an efficient solution for English and other Western European languages that spend much of their time in the first 128, one-byte wide unicode characters. In fact, English UTF-8 files will often consist of nothing but one-byte sequences, no different than an Ascii file. Because of this UTF-8 is nicely backwards compatible to ASCII, meaning a file can be UTF-8 but still work on older systems that support ISO-8859 or earlier standards. Even when a UTF-8 file does use larger width Unicode characters, it is still a "one octet encoding unit" (single byte) encoding standard, as opposed to UCS-2 and UCS-4 which are two- and four-octet encoding unit standards.

UTF-8 should always be spelled as shown. Referring to UTF-8 as utf-8, utf8, UTF8, etc. is considered bad form.

Unlike UCS-2 and UCS-4, which represent the higher order values of Unicode directly in binary with 16-bit and 32-bit data units, UTF-8 remains a "one octet encoding unit" (single byte). How does it represent Unicode values higher than 255? It encodes them with byte sequences of variable length. The values for these encoding bytes all lie within the range 128-255, perhaps not coincidentally the high bit portion of the 8-bit byte. In UTF-8, none of the values in the 128-255 range represent an actual character. This range is broken down to sub-ranges of bytes which are designated as "signifiers" for the beginning of 2-, 3-, 4-, 5- or 6-byte sequences. See the table below. Bytes in the red region, hexadecimal C2 through DF, are used to announce the beginning of a two-byte sequence; bytes in the blue region a three-byte sequence; and so on, to the orange region for 6-byte sequences. The characters in the purple region are the data values that can follow the signifiers. By using just these values, the complete set of Unicode characters that comprise two or more b yte s can be constructed.

The greyed out areas are prohibited values in UTF-8 (although there is an exception that will be discussed later).

For single byte characters, UTF-8 simply uses the first 128 Unicode characters, the ones that correspond exactly to classic Ascii 7-bit ISO 646. No special signifier announces a single byte sequence.

An easier way to understand the multi-byte signifier approach is by looking at the bit patterns in their binary values, as shown below. Two-byte sequences can be announced with with any byte having 110 as the top three bits; three-byte sequences can be announced with any byte having 1110 as the top four bits; and so on.

Cleaning a file of any UTF-8 multibyte sequences is complicated by the fact that you have to "look ahead" in the byte stream to determine whether a byte is the start of a multibyte UTF-8 sequence, or an isolated ISO 8859 character. When a UTF-8 sequence is found, the n-byte sequence must be replaced with a single ascii character (if the appropriate substitute is known), or simply removed (if no substitute is known). In practice, our clients use a very small repertoire of symbols that are easily replaced (trademark, registered, bullet, etc.).

If one wanted to to extract the Unicode sequence numbers that fall in the above 255 range from a UTF-8 file, reverse computation would be required, since the Unicode sequence numbers above 255 are encoded.

Examples

Working UTF-8 Sequences

Not-equals sign: 0xE2 0x89 0xA0 (Unicode char U+2262)

Korean text: 0xED 0x95 0x9C 0xEA 0xB5 0xAD 0xEC 0x96 0xB4 (Unicode chars U+D55C U+AD6D U+C5B4)

Japanese text: 0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E (Unicode chars U+65E5 U+672C U+8A9E)

Working ISO 8852-1 (Latin1) Sequences

0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, æ, Ø, Æ)

Mangled Sequences

0x31 0x32 0x33 0x99 (1,2,3, winlatin1 trademark)

0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, three Latin1 chars, one winlatin1 char)

0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E 0xE6 0xE8 0x31 (UTF-8 Japanese text followed by misused signifiers)

How Do You Detect a File's Encoding?

None of the character tables discussed so far (with a possible exception of UTF-8, discussed below) have any convention for a self-identifying start block or header. It has to be done based on some sort of pattern. In Unix, the 'file' utility (or 'type' on some versions of *nix) is a pretty good tool for identifying the character table type. Here's the pattern that 'file' seems to use:

When:	Unix 'file' Reports:
All characters are 7 bit	"ASCII text"
ISO 8859 8-bit characters (with or without 7-bit ascii mixed in)	"ISO-8859 text"
Any use of bytes in the ISO-8859 forbidden range (0x80-0x9F) which aren't part of valid UTF-8 sequences	"Non-ISO extended-ASCII text". (This is how WinLatin1 files will be identified.)
Consistent correct use of multi-byte UTF-8 sequences	"UTF-8 Unicode text"

How 'file' handles some edge cases:

When:	Unix 'file' Reports:
All 7-bit, but containing odd combinations of control characters from the 0x00 to 0x1F range	"Data"
Correct UTF-8 code mixed with any 8-bit ISO 8859 characters, forbidden or non forbidden	"Non-ISO extended-ASCII text"

What 'file' cannot do is read your intentions. It makes the simplest judgment call it can. For example:

A file of all 7-bit characters is always "ASCII text"...never mind that you think of the file as being UTF-8 or ISO 8859 compliant (both of which are also true).
ISO 8859 detection is based only on byte values alone, not syntactic patterns; it can't tell whether you want it rendered as Latin1, Baltic, or Cyrillic, etc.). It reports ISO 8859, not ISO 8859-1, or ISO-8859-2, etc. This would take some fairly elaborate dictionary lookups from multiple languages to accomplish.
Any UTF-8 multi-byte sequence not containing bytes 0x80 to 0x9F is hypothetically correct ISO 8859. Because these are syntactically rare, 'file' choses to identify this at UTF-8 (assuming everything else about the file is UTF-8 compliant). It cannot read your mind that you intended a series of weird ISO 8859 characters. For example you might want the sequence 0xB5 0xAD to read µ (micro sign, soft hyphen) in Latin1, but since it is a valid UTF-8 sequence (≠, not equals), 'file' will report it as "UTF-8 Unicode text" if everything else about the file is consistent with UTF-8.

Regarding the potential for confusion between ISO 8859 and UTF-8, this is what others have to say:

"the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length" (UTF-8 RFC).
"The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII [sic] is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence." (Wikipedia UTF-8 page) (see )

If you're dealing with a page you suspect is not rendering with the correct ISO 8859 subtable, both Firefox and Safari give you the option to change the interpretation. In Safari, check the View menu, Text Encoding. In Firefox, check the View menu, Character Encoding.

BOM: A UTF-8 File Format?

There is a convention called the "Byte Order Mark" (aka BOM) which (hypothetically) can be used to self-identify a UTF-8 file. This consists of beginning a file with the sequence 0xEF 0xBB 0xBF. Both Unix 'file' and 'vi' seem to be pretty good at interpreting it. When a file is pure 7-bit ASCII, but preceded by the BOM, it forces 'file' to report it as "UTF-8 Unicode text". An interesting edge case: if a file containing ISO 8859 8-bit characters that cannot be interpreted as UTF-8 multi-byte sequences is preceded by the BOM, neither Unix 'file' or 'vi' can be fooled: 'file' will report the three BOM characters as "ISO-8859 text", and 'vi' will display the BOM characters (as ï»¿).

Othan than BOM, refrain from ever calling ASCII, UTF-8, ISO-8859 or WinLatin1 a "file format." None have a standardized header or starting block They are more properly called "byte sequence conventions."

Unicode HTML Entities

The entire corpus of Unicode is hypothetically available to browsers. For example the Unicode characters of Japanese that were mentioned earlier, U+65E5 U+672C U+8A9E, should successfully render on your browser when I put &#x65E5 ; &#x672C ; &#x8A9E ; in the HTML. Here it goes: 日本語. These work for me on Firefox. In practice, I believe that no browsers are 100% capable of correctly rendering the entire 2^31 Unicode character set via HTML entities.

Question: does the presence of the foreign characters you see above effectively make this page Unicode or UTF-8 format? No. The characters I used to make them were merely ampersand, pound, x, and digits, all 7-it ASCII characters. HTML Entities are merely instructions to browsers what to do with the characters.

Do we want to use HTML Entities in the cleaned feeds we send to search engines? Do we want to figure out the intent of various WinLatin1 manglings and UTF-8 sequences that come from client feeds and turn them into HTML entities that will render safely on a browser? This is a subject still open to question. Our YSSP search engines can handle them, while our Comparison Shopping Engines (CSE's) do not. Currently we are leaning towards aiming for the lowest common denominator, 7-bit ASCII for all search engines.

WinLatin1's tragic legacy has, unfortunately, been immortalized in HTML entities. HTML entities 128 through 159 will render as their corresponding WinLatin1 characters in many browsers. (Since Unicode entries 128-159 are prohibited, HTML entities are departing from their Unicode basis in this area.) However, you should not consider this reliable; the W3C HTML 4 recommendations exclude these codes from their list.

HTML Page Encoding

Can an HTML file be in actual UTF-8 format? "Yes we can." If it contains UTF-8 byte sequences, it gets interpreted by your browser as UTF-8. Here are two nice examples of "UTF-8 Test pages" out there that are just full of exotic Unicode in UTF-8 encoding: Markus Khun's sample file, and Frank da Cruz's UTF-8 Sampler. (How well your browser shows the page depends on how good the browser is. For me, Firefox and Safari got most of it, but IE6 failed on large numbers of the characters.)

After you look at one of these pages, try viewing source. What do you see? If your source viewer is UTF-8 capable, you'll see, well, the UTF-8 characters. Surprised? Expected to see the numeric bytes, did you? If you want to see the numeric value of a character under the cursor in vi, press ga. For UTF-8 characters higher than 8 bits you will see large values like 0xFFDD. Or if you want to get really deep into the binary encoding of a file, you could get a binary editor. On Unix, try 'od', or better yet, 'bvi' (you have to download it).

When an HTML page puts this in the block, what does it do?

"Content-Type" content="text/html; charset=UTF-8" />

For those most part, it simply communicates intent: a tip to the consuming client what to do with the page. It doesn't change the way the bytes go down the wire, or the way your HTML is stored locally on your disk. It's still just bytes. In practice, you can even leave the declaration out and even though the file is chock full of UTF-8 binary sequences, a good browser figures it out anyway. (Note that Frank da Cruz's file above had no meta direction at all.)

How do you compose (i.e. edit and save) an HTML file in UTF-8? Well, if you're using all 7-bit ascii characters you're fine, that's still valid UTF-8. But really we mean UTF-8 with higher level Unicode characters. Well, unless you want to enter the binary directly, you'll need a special editor for that; vi, textpad, etc. won't do it for you. Try Google.

UTF-8 and XML

XML is UTF-8 by default; the following declaration is actually redundant:

 "1.0" encoding="UTF-8"?>

You'd have to override it with a different value for the encoding attribute to make it non UTF-8. Like the HTML meta tag content charset setting mentioned above, the encoding declaration is essentially a tip to the consumer of the XML what to expect.

Some Fun Historical Facts

Bob Bemer (1920-2004) is said to be the "father of ASCII". His Ford Explorer's vanity plate read 'ASCII' surrounded by a plate holder reading "Yes! I'm the Father of".

Wikipedia has a picture of one of the earliest published ASCII tables, from 1968.

Ken Thompson, the Unix pioneer, invented UTF-8. "It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat" (source). You can also read the original AT&T UTF-8 Tech Report.

Sunday, April 12, 2009

iCrossing is hiring a Java developer in Chicago

My employer, iCrossing, has opened a search for a new member for the Merchantize team. Here's the description. To apply for the position, visit http://www.iCrossing.com/careers, select U.S. Career Offerings, Jobs by Location, then Jobs in Chicago, IL.

Java Software Engineer (Open Source / Web Analytics / ETL)

JOB DESCRIPTION

We’re a people business.

People are the heart and soul of our company, working every day to make our clients’ marketing programs successful.

At iCrossing, we combine experienced talent with world-class technologies to efficiently create marketing programs that truly perform. With more than 620 professionals in 15 offices in the U.S. and Europe, we are equipped to service the digital marketing needs of large enterprises and growing companies alike.

We’re seeking the talented, the experienced and the exceptional to give our clients the most creative and successful solutions for an ever-changing industry. When we find them, we offer a dynamic working environment, competitive compensation, the opportunity to work on exciting client programs, and occasional bagels.
We are seeking a highly motivated and technically proficient JEE Software Engineer / Software Developer to work on our industry leading and mission critical Paid Media Management (Search Engine Marketing, bid management) product.

Features of the position:
• Work on a high-visibility, high performance product that supports iCrossing’s industry leading SEM practice in a growing and fast moving industry.
• Work closely with all of the major search engines (Google, Yahoo, MSN, Ask, AOL) and their APIs.
• Work in a fast moving and forward thinking development environment that is constantly researching and rapidly implementing the latest technologies.
• Research and participate in the advancement and implementation of open source frameworks and architectures such as SOA/ESB, MapReduce, Grid and Cloud computing, and others.
• Work with an experienced Agile Software Development team in a highly collaborative environment.
• Modern Java Enterprise open source based product stack, Java 6, Spring, Hibernate 3, Webworks/Struts 2, JMS, JUnit, MySQL and more.
• Learn current software development best practices (continuous integration, build automation, test driven development, pair programming, agile estimating and planning, etc)
• Apple MacBook Pro, 24” widescreen monitor, IntelliJ or Eclipse.
• A casual, fun, and creative work environment
Major Job Responsibilities / Accountabilities:
• Write test driven quality code.
• Work closely with your dev team.
• Follow and encourage development best practices.
• Develop knowledge of Search Engine Marketing (SEM) principles and techniques.

Skills/Requirements:
Required Technologies (At least one or more of the following)
• Spring
• Hibernate
• SQL scripts
• Shell Scripting
• Webwork (Struts 2.0)
• Linux / Unix admin
• Junit (required) or TDD (preferred)
• Grid Computing (GridGain preferred)

Bonus Technologies (Preferred any of these)
• MySQL (especially advanced knowledge of replication, storage engines, backup and recovery)
• PERL
• Data warehousing design concepts, ETL
• Mondrian OLTP
• JMS
• Amazon EC2 / S3 / AWS

Knowledge / Skills / Abilities:
• BS in Computer Science or equivalent level of experience
• Understanding and/or appreciation for Agile software development methodologies.
• 1+ yrs of professional development experience.
• Familiarity with source control using Subversion
• Familiarity with IDE tools such as Eclipse or IntelliJ
• Must possess effective interpersonal and communication skills and ability to work successfully in a team environment.
• Good organizational and time-management skills.

Do Not Apply if you:
• Do not know Java
• Have no interest in Agile, TDD or Unit testing
• Are close-minded and don’t want to learn new technologies.
• Are more comfortable working on the same technology you did last year.

*ICROSSING IS NOT ACCEPTING RESUMES FROM STAFFING AGENCY PARTNERS AT THIS TIME. THANK YOU.

Friday, March 27, 2009

Hacking a Linksys NSLU2

I bought a Linksys NSLU2 a while back, which is a low cost (about $99) appliance for turning two USB disk drives into a Network Attached Share (NAS) system. This lets me set up file storage centrally located on my LAN (as opposed to attaching it to one computer on the LAN and setting up a share).

What's inside is a small Linux computer mounted on a single circuitboard. And that's where the fun comes in. As the device's Wikipedia page points out, since the internal Linux is licensed with a GNU General Public License, Linksys was required to release their source code. This has enabled third parties to develop firmware upgrades to the device. One popular upgrade is the Unlung SlugOS, which among various things, enables the device to accept telnet connections.

Here is my n

etwork cabinet at home. From left is my DSL modem, a 360 GB USB disk drive, the NSLU2, and my Linksys WRT54G Wireless router. If you know this router, you can see by the size that the NSLU2 is not much bigger than a deck of cards.

Like a router, an NSLU2 hooked up to your LAN will have its own web administration page, which is reachable by http://192.168.1.77.

Upgrading to the Unslung firmware went exactly as the directions described it. After restarting, the NSLU2's admin page had a few additions, as you can see in the screen grab below. It added an unslung logo on the upper left, and a "Manage Telnet" link on the right. Once I enabled telnet I was able to log in and get a prompt by telnetting to 192.168.1.77.

A short tour of what is inside the box is shown in the telnet session below. There's cool stuff inside! A nice handful of basic unix commands, the web server that serves up the admin site (above), and even a wget command (which I demonstrate by getting the yahoo.com homepage HTML.

The next thing I wanted to do is to "de-underclock" the device. The CPU is 266 mHz but for unknown reasons, Linksys clocked it down by half with a tiny little resistor. Here's the board:

Simply removing the resis

tor clocks it up to 266 mHz. Following the helpful instructions on the Unslung site, I geared myself up with some needle nose pliers, a static wrist guard and gloves, a magnifying glass and a geeky miner's light (see pic at left). All I had to do was get a grip on that tiny, tiny resistor (about 1/4 the size of a grain of rice)...and I crunched it! After all, I wasn't going to need it again.

When I put the card back inside the case, reconnected it and restarted it...it worked! And I got the proof the the clock speed doubled in my telnet session:

Whoa, it says it is 2.22 mHz short of 266 Hz, I wonder why? Such are the mysteries of computer hardware.

Tuesday, March 17, 2009

Apple iPhone OS 3.0 Announcement summary

Here are highlights of what was announced for the iPhone OS 3.0 release early this afternoon from Apple:

Cut and paste: it was worth the wait, the touch interaction to do this looks very cool (see picture at right). Works across applications and does undo.
Multimedia messaging: you can attach a picture to a text message
Ability to choose a group of photos and send them in a single email
Push email notification
Landscape mode text entry (so what)
Turn-by-turn GPS navigation.
Available in the summer. That's as detailed as it gets. No doubt will be linked to the new iPhone model coming out in July.
Virtually all of the new features will work with the original, pre-3G iPhone (exceptions: multimedia messaging and stereo bluetooth)
Peer-to-peer linkups between individual iPhones for games, file sharing, etc. This exists now with things like AirSharing and Holdem, but those companies probably rolled their own; now it's part of the API
API support for applications that connect to external devices. Demonstrations with medical devices were given (see pic at right). Medical applications of new technology are always a big win in corporate presentations; the real news here is that this will open up remoting of all sorts of sophisticated devices for music, video, information systems, anything you can imagine.
Ability to search in your emails on the server side, and search in your calendar items
Search your iPhone contents with Spotlight (well-known to Mac users)
The Sims 3 will run on the iPhone (see pic at right)

Monday, March 16, 2009

iPhone SDK Presentation at CJUG 2/17/09

Our local Java User's Group chapter, CJUG recently hosted a presentation by Rakesh Vidyadharan titled iPhone SDK: Java Developers Perspective (link to PDF). It wasn't so much an immersion in iPhone development per se, but an introduction to development in Objective C.

Some of the things I found interesting:

The SDK requires an Intel-based Mac
The MVC approach is pretty much baked into the framework. Not everyone likes that.
Development with an emulator is a breeze, but pushing an app to a real iPhone is time consuming
Getting apps considered for inclusion in the app store is well-documented, but a convoluted process

Some things about Objective C:

Very similar to TCL scripting
Weak typing and dynamically-bound variables like javascript, ruby and php
There's no namespaces or packages, which means every class has to have a complete unique name. To group variables, developers adopt precursors, like CUreader, CUwriter, CUcreator, etc. And the language has several precursors reserved for the language core. For example, you can't define any classes or variables beginning with NS, IB, or UI.
Parameter names are part of the signature of a method. For example, foo(first_name: "Greg", last_name: "Sandell")
The code is visibly very different from Java, or even C and C++. Many lines start with a plus or minus sign.

Object-Oriented characteristics of Objective C:

Much less a "real" Objected Oriented language than C++
Objects don't automatically inherit a base Object as in java. You have to explicitly extend NSObject
Objects automatically have setters and getters, like ruby
Like C, you completely manage your own memory. OsX since Tiger has a garbage collector, but Objective C doesn't use it
Memory is managed by a incremental counting approach called refcount. Each alloc increments refcount, each release decrements it
Dealloc is like finalize in java
Messaging is a big part of the language. For example, methods are invoked via messages.

Friday, October 24, 2008

CloudCamp Chicago meeting Oct 21, 2008

Last Tuesday (Oct 21, 2008) was the Cloud Computing one-night conference "CloudCamp" hosted by Tech Cocktail. My company has some really time consuming web analytics tasks that take days to run, and we're exploring using GridGain to distribute the work over several servers, so it was a good chance for me to get an acquaintance with this field.

I've included a few photos from the event, that come from the conference's Flickr photogallery.

The meeting was alright. I didn't see anyone I knew there but had reasonably enjoyable small talk with mostly non-technical people during the drinking hour.

It wasn't a conference in any traditional sense of the word. There were no scheduled topic, no scheduled speakers. They did use the format used by O'reilly's "Foo Camps". They have a grid of sessions and meeting rooms on a whiteboard. All the squares are empty. Then they ask everybody who has a topic they are interested in to write it in a square on the board. Presto...you are now the moderator of that session.

I volunteered the topic "Software Engineering and Grid Computing". Eight really smart people showed up, including two physicists from Italy, two doctoral students, a guy from UBS, and a consultant from CohesiveFT, a Chicago company specializing in cloud computing.

Physics people have been doing grid computing for years, so they were levels above me. But interestingly, a lot of their problems have to do with resource sharing. There can be other research teams that also want to use the grid, and maybe they don't want the nodes installed with the same software you do, and the people with the biggest grants tend to win out.

The most interesting guy there was the consultant from CohesiveFT, Pat Kerpan. He had two pieces of memorable wisdom. (1) Rule of thumb: count on a 30% performance penalty imposed from the overhead of grid enabling your problem. (2) It's easier to bring the computing to the data than to bring the data to the computing.

He talked about the stuff their company uses for their clients called "Open Source Sun Hypervisor". This has an interface that allows you to trick out your nodes with whatever setup you want (e.g. pick and choose between java, tomcat, flavors of linux, struts, etc.) and get a multinode environment all set up in six minutes.

Several of the people spoke knowingly of "paravirtualization". Pat distinguished between problems that are "compute bound" vs. "data bound".

A few people referred to Hadoop. No one had ever heard of GridGain, but I don't think that Java development was strongly represented in that collection of people.

People have different aims in cloud computing. For a lot of people, they don't mind if a lot of virtual nodes are spread over one machine.

Virtualization was recommended as a convention even when you are doing one node per machine.

In many commercial applications, 4 virtual nodes per machine is typical.

Many people responded to my description of what we are trying to do at iCrossing with "why don't you just use Amazon's cloud computing"? To hear them describe it, Amazon gives you the flexibility to do whatever you want.

I could have attended some of the other sessions if I wanted to stay two more hours, but I split after mine. The other sessions were on pretty soft- or business-focussed topics. One guy led a session called “What color is your cloud?” There were two Microsoft people who found each other and made their own Microsoft-focussed session ("Cloud Computing in Windows 8 and SQL Server").

Thursday, October 23, 2008

Recruiter tips: finding software developers

Recently a friend from a company outside of Chicago asked me some advice on how they should go about hiring a java developer. I found myself offering advice on screening techniques for technical people and how not to blow it, and found myself thinking, damn, these are things that professional recruiters should hear.

Despite this economy, the market for programmers seems very hot right now. (Seller's market.) I get contacted by a lot of recruiters all the time. When I'm actually on the job market, I get more of these cold-call emails/phonecalls than I can possibly consider returning. So unfortunately I have to go on their voicemail or email as to whether they are worth my time or not. (First impression is a big deal.) Many people say things that send bad messages, like the job description doesn't make sense (they want strange mixtures of technical and business skills), or there's too much detail on company operations and history at the expense of core details (e.g., what the software development environment is like, whether it's Windows vs Linux vs Mac, what webserver they use, etc.), or they want you to complete some coding test before you even get a how-do-you-do. When recruiters get the message wrong like this, the more senior and experienced people know to stay away, and you get more junior or unqualified people who are willing to go along...and you waste a lot of time discovering that, or even make the mistake of making a bad hiring choice.

So here's my advice. Before you contact any software developer, get a complete job description. Keep the description of the nature of the company's business to no more than 1/3 of the description; the rest should all be technical specifics of the programming core responsibilities. Make sure that part is written by, or at least with the collaboration with, a person qualified to manage that person. Get coached by that person on how to describe that position over the phone. When you find a person to contact, either get them the description in their hands before the phone call, or try to send it to their email address during the phone call. On the phone, don't try to say more than you know. Don't bore the candidate with the history of your recruiting firm---trust me, all recruiting firms histories and mission sound completely identical. Don't ask the candidate to read their resume over the phone to you; do your due diligence and show them that you've digested it already. If you meet the candidate first, don't insult them by making the meeting be about nothing. The recruiters who did the best work for me never asked me to meet them first. Hope that helps!

Wednesday, September 24, 2008

Job hunter tips: questions to ask employers

If you're a web software developer and in the market for a job right now, this list of questions to ask your potential employer could be helpful:

http://docs.google.com/Doc?id=dcntd65k_129dshbbncs

Now it's not a perfect list, since it's obviously skewed towards a open-source/java enterprise web developer. But if you find it useful, here it is! Good luck.

Monday, April 28, 2008

iPodia - iPhone-optimized WikiPedia

I just replaced the Wikipedia Safari link on my iPhone with a link to iPodia. It will bring up a WikiPedia gateway that renders everything in iPhone-friendly output (see the pic of the home page). It's not perfect...it loads kind of slow and it interprets input a little differently than the real Wikipedia...but definitely very useful.

Another one I just found is Wapedia.

While I was searching for this, I also found an iPhone app called Geopedia that will use the internal GPS to bring up Wikipedia references to items that are currently geographically close to you.

Saturday, April 26, 2008

Becoming a Google convert

I've just "got religion" on Google docs, and I think I'm going to go whole hog into google photos, google spreadsheets and blogspot now. I am dead sick of the hassle it has been to manage my own webserver & Moveable Type server for my blogs OnTheCode, Atom Smashing, White Hot Thing, and Unsuspecting Air Molecules. I'm turning it over to Google, and here's my first post!

Friday, November 2, 2007

The Zune, Surprisingly Enjoyable

I bought a Zune off of woot.com at a huge discount and tried it out. It's really a nice little box that compares favorably with the iPod.

The prices of Zunes have dropped to the point where I couldn't resist buying a 30 gig refurbished model off of woot.com the other day. (My 2004 vintage iPod is only 10 gig.) They've always seemed like attractive little devices to me, and lately the buzz has been that Microsoft did a pretty nice job with the latest software release.

Packaging and Out-of-Box Experience

Ever seen the parody of what Microsoft would do to the iPod packaging if their marketing had a chance? (I think it is here on youTube.) Word has it that Microsoft itself made that parody. Well, that bit of self-knowledge shows in the packaging for the Zune, which for Microsoft is surprisingly subtle, artsy and tasteful (except maybe for the silly "Come to the Social" slogan that appears on the inside). The Zune itself is in the center of the box, and the headphones, USB cable and CD-ROM are tucked away in clever little compartments of their own.

Hardware and Casing

It comes with no separate AC cord/plug...you use the USB cable for recharging.

Of course it has a big color display which you have to see to appreciate. When it shows pictures or movies, it flips to portrait mode, and the axes of the control buttons switch too (nice). There also many ways in which the display shows multiple dimensions of information at once (see my description of scrolling below), or overlays one set of information with another (like putting track information on top of the album cover). Very attractive.

It's lighter weight and has less heft than the iPod. The audio quality seems like it might be a notch better than my iPod, although in fairness my iPod is not a very recent model.

I deliberately chose the brown model. Although woot.com seems to think brown is absurdly ugly (and they slashed $20 off their price recently if you chose a brown one), I think the design is quite beautiful. The brown plastic is translucent and has green trim that is differently visible depending on how you hold it. As their marketing material shows, it is quite nice looking against a natural background (e.g. grainy woods).

Controls

The buttons pretty much replicated what an iPod has. There is a pause button that also turns the unit off (although you have to hold it down much longer than on an iPod). A left-arrow button replicates the Menu button on the iPod. There is a large center button, that resembles the iPod wheel. The left, right, up, down and center parts of the wheel are clickable like the ipod, but there is no wheel control itself (which must be patented on the iPod?).

While you are in track mode, right/left are forward/back one track, or fast forward/reverse when held down (but you can't hear the music in fast mode); up/down change the volume. Center button gives you more detail on the song and gives you access to ratings. It's nice that control over volume, track change and track location are always there for you, as opposed to the three-way mode on the iPod. On the down side, setting a rating (if you're into that kind of thing) is more of a bother, requiring more clicks than the iPod.

In other modes, the wheel-like button provides customary click-to-select, up/down, left/right navigation through the various menus.

At first I thought the lack of a wheel control would mean slower, more tedious navigation through a long list of songs or artists. Not so. Similar to wheel twirling on the iPod, the longer you hold the button up or down, the faster it goes (just like twirling the ipod wheel) and...here's a brilliant creative stroke...a large display showing the first letter of the item at the current scroll position, so you see the alphabet flying past as you scroll. It's a bit hard to describe. But I find it much easier to use than iPod's scroll.

The on/off functionality of the pause button has slightly different behavior from the iPod: one click turns the device back on in a neutral state, while a second click resumes to the track you were last playing before you shut down.

Pressing the next or previous buttons to change songs changes the audio right away, but leaves the information of the previous song on the screen for a few seconds. This seemed strange at first, but I appreciate it now as a smart feature: usually you change tracks after your player has been idle for a while, and therefore the screen is unilluminated; this way, you see where you were at first before switching.

The battery charge icon, when being charged, not only shows a "now charging" animation, but it shows the amount of charge currently attained as well; I haven't seen that on other portable devices.

The Zune gives you control over randomizing the same way as the iPod (randomize song or artist or album, etc.) but it improves on one thing that drives me crazy about the iPod. When you start a playlist, the iPod always picks the same starting song to randomize on; the Zune always picks a new first song.

Synchronizing and the Zune Client App

The Zune client app is not as easy to use as iTunes, or at least it seems that way to me so far. It has the ability to distinguish between songs that are in your "library" (i.e. the sum total of all your music on your hard drives) and what you actually put on your Zune, but controlling that is somewhat mysterious. With iTunes, adding new tracks to your library or the iPod defaults to being essentially the same thing; with the Zune, you add to the library by dragging files to one place, and add to the Zune by dragging to somewhere else. When something is on your Zune, and you don't want it there any more, exactly how you're supposed to remove it is also obscure. I tried to remove some tracks several times in a row...they appeared to go away at first, but then they came right back again. Eventually through trial and error I got the tracks removed, but the app is definitely lacking in intuitiveness.

Synchonization speed seems strangely slow compared to the iPod, even when I've confirmed that USB 2 is in use. Visually, the Zune app provides more information during sync, in the form of animated progress bars for each song, which is visually cool. But it made me wonder if iTunes batches its songs to sync, and Zune is slower because it does each one in sequence.

On the very first sync, the Zune copied my playlists from my iTunes...obviously it snuck into iTune's external XML file to do that...which was kind of a nice way for me to bridge my listening habits from iPod to Zune. But for this to work, you have to allow Zune to process your entire hard drive(s) of music into its Zune library, which is amazingly slow.

iTunes provides nice Finder or Windows Explorer style navigation panes and sorting capabilities. The Zune app seems to offer less information. For example, I can't see any way to sort by ratings.

Speaking of ratings, there are five stars available like the iPod, but only the iPod offers a "no stars" setting. So effectively the Zune has one less rating possibility than the iPod. For a new track, the Zune defaults to a three star rating, whereas the iPod defaults to no stars. I happen to be particularly attached to ratings...I use them to manage which tracks I want to eventually remove from my iPod...so I'm not excited about having to re-tool my habits for the Zune.

Radio and Wireless

The onboard FM radio is a nice touch that gives your player extra possibilities. I've noticed that when I show my Zune to people who are frequent radio listeners, they immediately begin coveting the Zune.

I can't quite imagine what Microsoft had in mind when they put on wireless capabilities on the Zune. All it lets you do is (1) find if there are other Zunes near you and (2) share songs with them. Their slogan "Come to the Social" seems to imply that they enivision people gathering at Starbuck's and squirting songs to each other...uh, right. It seems to me that they should have put texting ability on there as well; that way you could actually make friends with your Zune.

Am I Joining the Other Team?

The Zune is just about nice enough for me to make it my fulltime player. Certainly the cheap price, the 30 gig capacity and the nifty visual features are a temptation over my older 10 gig iPod. But when all is said and done, iTunes is a better app and makes more sense to me. I may use this experience as an excuse to upgrade to a better, and larger capacity iPod. But I wouldn't have any trouble recommending the Zune as a player of choice to someone who wasn't habituated by iTunes.

Friday, October 19, 2007

The SHARC Timbre Dataset v. 2.0: XML Format

SHARC is a dataset of musical timbre information that I collected by analyzing over 1300 orchestral musical instrument notes. Specifically, the information is amplitude and phase data from a selected steady-state portion of each note. The dataset is now available in XML format.

Some time ago, when I was a grad student, and while holding various fellowships after I got my PhD, I did research in music, human hearing and digital audio (see my publications). One of the projects I undertook was to compile a collection of information on musical instrument tones, which I called SHARC ("Sandell Harmonic Archive").

I've described SHARC in a few places before: in an article from 1991, and in the release notes from the original distribution. Briefly, though, what I did was this. I had a collection of CDs consisting of individually performed notes of all the standard instruments of the orchestra, one recording for each note in the respective instrument's playable range. For each note, I chose a middle portion of the recording, during the note's steady state, and performed a spectrum analysis. I saved the amplitudes and phases of all the harmonics of the pitch's fundamental frequency up to a ceiling of 10kHz.

In my first version of the distribution (which you can still download in compressed tar format), SHARC consisted of a series of files, one for each note that was analyzed, organized into directories by instrument. That was 1994; since then, XML has come into being and I've now released SHARC in an XML format.

I'm calling this SHARC's "2.0" release, and back-versioning the original distribution to "1.0" (even though I timidly referred to it at as version 0.921 at the time). In this blog article, I'll describe the design of this 2.0 version, for the convenience of anyone who would like to work with it.

Let's consider the XML that specifies a single instrument and all of its notes, and their harmonics. The rough outline of the XML is:


<instrument>
  <note>   <!-- first note -->
    <a/>   <!-- harmonic 1 -->
    <a/>   <!-- harmonic 2 -->
    ...etc...
  </note>
  <note>   <!-- second note -->
    <a/>
    <a/>
    ...etc...
  </note>
...etc...
</instrument>

The <instrument> element has the following attributes:

id: the instrument's short name, containing no spaces, suitable for variable names and querystring parameters
name: the instrument's longer, more descriptive name
source: the cd from which the tone originated
cd: the volume of cd
track: the track on the cd
numNotes: the number of notes for this instrument

Here is a sample <instrument> element:


<instrument
    id="CB_pizz" name="Contrabass (pizzicato)"
    source="McGill" cd="1" track="18"
    numNotes="41">

The <note> element has the following attributes:

pitch: the notes pitch and octave number, e.g. c4 = middle C. Sharps are specified with the letter s, e.g. 'fs4' rather than 'f#4'.
seq: the sequential order number of the note in the series (i.e. starting at 1 with the first note)
keyNum: numerical location of the pitch on a piano keyboard, where middle C = 48
fundHz: the frequency of the note's fundamental (e.g. a4 = 440)
numHarms: the number of harmonics (i.e. the number of <a> elements to follow)

Here is a sample note element:


<note pitch="cs1" seq="2" keyNum="13"
    fundHz="34.648" numHarms="287">

Finally we have the harmonic data itself, contained in the <a> element. The harmonic amplitude value is the text node of the element, expressed as a linear value (i.e. not in dB). The attributes for the <a> element are:

n: the sequential order number of the harmonic in the series (i.e. starting at 1 with the first harmonic)
p: phase, expressed in the range between negative and positive pi

Here is a sample sequence of a few <a> elements:


  <a n="1" p="-1.686">32.91</a>
  <a n="2" p="0.309">2131.69</a>
  <a n="3" p="1.764">5878.0</a>

Using the brief names 'n' and 'p' keeps the size of the XML document lower. For similar reasons, the frequency of each harmonic is not given. To obtain the frequency of the harmonic, you simply multiply the value of the "n" attribute by the value of the "fundHz' attribute of the 'note' element.

As I said, that is a rough sketch of the XML; to simplify the explanation I left out some of the detail. In addition to what I have discussed so far, each instrument element, and each note element, has a sibling element <ranges> which contains useful metadata. Here is a sample <ranges> element for an <instrument> element:


<ranges>
  <lowest>
      <harmonicFreq harmNum="1" keyNum="12"
           pitch="c1">32.7
      </harmonicFreq>
      <pitch
         fundHz="32.7" keyNum="12">c1</pitch>
      <amplitude freqHz="8449.15" keyNum="22"
         pitch="as1" fundHz="58.27"
         harmNum="145">0.0</amplitude>
  </lowest>
  <highest>
      <pitch fundHz="349.22"
          keyNum="53">f4</pitch>
      <harmonicFreq harmNum="151" keyNum="25"
         pitch="cs2">10463.69</harmonicFreq>
      <amplitude freqHz="261.62" keyNum="48"
          harmNum="1" pitch="c4"
          fundHz="261.626">15389.0</amplitude>
  </highest>
  <pitches>c1 cs1 d1 ds1 e1 f1 fs1 gs1 a1 as1
     b1 c2 cs2 d2 ds2 e2 f2 fs2 g2 gs2 a2 as2
     b2 c3 cs3 d3 ds3 e3 f3 fs3 g3 gs3 a3 as3
     b3 c4 cs4 d4 ds4 e4 f4
  </pitches>
</ranges>

The logic behind the <ranges> element is mostly convenience for applications that will be constructing graphic plots from the data. For example, having the highest and lowest frequency specified here, rather than making it necessary to traverse through the data to find it, makes it easier for a program to set up the minimum and maximum for a graphic plot. The <pitches> element is another convenience that keeps the user from having to issue a thorny xpath query just to get a list of all the instrument's pitches.

Let's drill down into the details of this <ranges> element. The text node of the ranges/lowest/harmonicFreq element is the lowest frequency of any harmonic in the entire instrument's collection. Obviously, this is always harmonic 1 of the instrument's lowest note. The attributes for harmonicFreq convey this, as well as the pitch (c1) and keyNum (12). The element ranges/lowest/pitch contains the same information, but described in terms of the lowest pitch and its fundamental frequency. This redundancy has little impact since it is occurs just once for the instrument. Information about the lowest amplitude harmonic to be found in the instrument is given in the ranges/lowest/amplitude element. For the instrument in question, this honor goes to a#1 (keyNum of 22, fundamental frequency of 58.27 Hz), 10 semitones above the instrument's lowest note, the 145th harmonic (frequency of 8449.15 Hz).

The ranges/highest element provides equivalent data for the highest harmonic frequency, highest pitch and highest amplitude.

Here is a sample <ranges> element for an <note> element:


<ranges>
  <lowest>
      <amplitude freqHz="6475.19"
          harmNum="198">0.0</amplitude>
      <harmonicFreq
           harmNum="1">32.7</harmonicFreq>
  </lowest>
  <highest>
      <amplitude freqHz="98.1"
          harmNum="3">2335.0</amplitude>
      <harmonicFreq
          harmNum="303">9909.0</harmonicFreq>
  </highest>
</ranges>

This element provides data similar to instrument/ranges, but in terms of the highest/lowest frequency and amplitude harmonics for the note in question.

The XML was designed in a way that the entire SHARC dataset could be combined into a single XML file (i.e. as a series of instrument elements), and this file is in fact available for download in zip format. However, this file is quite large (nearly 3 meg), which will put quite a burden on parsers, and especially DOM parsers. For more efficient processing, I have placed each instrument into its own dataset file.

For a summary, here is a shorthand showing the overall design of the xml, with attributes shown in red and text nodes in blue:


tree

  instrument (id, name, source, cd, track, numNotes)
      ranges
          lowest
              harmonicFreq (harmNum, keyNum, pitch) [frequency]
              pitch (fundHz, keyNum) [pitch]
              amplitude (freqHz, keyNum, pitch, fundHz, harmNum) [amplitude]
          highest
              (all same as lowest)
      note (pitch, seq, keyNum, fundHz, numHarms)
          ranges
              lowest
                  amplitude (freqHz, harmNum) [amplitude]
                  harmonicFreq (harmNum) [frequency]
              highest
                  all same as lowest)
          a (n, p) [amplitude]

I'm not attached to this particular XML design, and I may come out with a 3.0 version some day. One change I expect to make in a future version is to move a lot of information that is in attributes to elements, which means that more queries would return element nodes that could be further processed. Another idea I have is to make a secondary, "bare bones" release, that would have no metadata, for quicker processing.

Enjoy playing with the data!

Wednesday, August 22, 2007

Syncing a handheld with Vista

I own an iPaq rz1710 running Windows Mobile 2003 Second Edition and I have been syncing it to Outlook data on Win XP for several years now. Before that, in the Win2k days, I had Win CE devices that I synced up. With Vista I couldn't get the Outlook data to sync up...at first. After fighting with it for a while, I finally figured it out.

A few basics: ActiveSync on the desktop has been replaced by "Windows mobile Device Center" (WMDC), while ActiveSync remains on the handheld. Your Vista machine may already have WMDC, but there is a critical upgrade you need to get from Microsoft (easily found if you search on WMDC). Whether you're installing WMDC or upgrading it, have your handheld connected by USB (or whatever you use) during the upgrade.

After getting WMDC upgraded, my desktop and handheld formed a parternership right away, but the Outlook data was not coming over. I may be a special freak case because I maintain multiple .pst files. My main .pst file actually resides on a jump drive, while the .pst that resides in the conventional location on the C: drive is just an empty shell. I noticed that when I added a new contact on the handheld and then sync'd up, the new contact did show up on my desktop Outlook, but in the C: drive .pst file. That's because Outlook regards the C: location of the .pst as primary and anything else secondary.

So I fixed it by (1) setting the jump drive .pst file as the primary; that wasn't enough to fix it so I also had to (2) delete the reference to the C: drive .pst file from Outlook. Microsoft has you do (1) in a funny way. You go to outlook, tools menu, email accounts, view or change existing accounts, then select the preferred .pst file from the dropdown titled "Deliver new email to the following location". Step (2) is done by navigating: Tools -> Options -> Mail Setup -> Data Files, select .pst file to remove reference to, and click Remove.

After that I had to shut down WMDC and Outlook, then restart WMDC. Now, when I pressed sync, I saw the little bugger go to work and download all my Outlook contacts, calendar items, notes and tasks to the handheld.

One other annoying thing about the Vista approach to handhelds is that the client software is really split between two areas, the "Sync Center" of the Control Panel, and WMDC itself. A lot of tasks seem to start in the Sync Center, but the real work (such as selecting what items you want sync'd up) is done in WMDC. Microsoft makes it hard for you to launch WMDC from the Sync Center, but here's the secret: go to View Sync Partnerships, then double click on your device.

Good luck!

Wednesday, August 15, 2007

IDEA with Tomcat 6 Integration

If you like running Tomcat from within IDEA and you want to be a Tomcat version 6, you need to stick with Tomcat 6.0.10 for a while. Any later version causes this complaint to come up when you launch Tomcat: "Error running Tomcat6: Cannot find configuration of jsp built-in servlet in C:\Users\greg\.IntelliJIdea60\system\tomcat_Unnamed_7dbqbe5b1\web.xml". I noticed just the other day that Tomcat 6.0.14 came out and I confirmed that this version has the problem too.

Friday, July 27, 2007

Vista Guinea Pig

I just bought myself a Lenovo desktop machine for my home office, and it came with Vista Business. This is the first time I've submitted myself to being a guinea pig for a new, pre-service-pack OS. Here are a few reactions, gripes and maybe even some left-handed praises.

It started out of the box okay, after answering all the usual first-time-start questions of name, timezone, etc. Early on, I started transferring files from one of my USB drive to the new disk, and I was appalled how slow it was going. Despite being a 7200 rpm disk drive, the time it took for the file transfer seemed about four times longer than it would have been on my Toshiba laptop, which is a 5400 rpm drive. Eventually I figured out that indexing was turned on for optimizing search and the disk was churning constantly. Since turning it off, file transfers copies are must more reasonable, although I have yet to try a side-by-side comparison. You can find instructions on disabling indexing on the web.

In the course of loading up my customary developer software, I had to use the Explorer a lot, set environment variables, etc. (Note that whenever I say 'Explorer' I always mean the file browsing app 'Windows Explorer', not the web browser 'Internet Explorer.') The customary alienation that one gets trying do to routine things in a new OS's GUI was running pretty high for me. Like every MS-Windows incarnation before it, if you don't want to blindly follow Microsoft's vision of where your files should be (i.e. "My Documents"), you have to work a lot harder. After a few evening's work, I know how to get around, in the course of which I learned two disappointments about Vista.

Disappointment 1: Vista is just a big shiny wrapper around MS Windows XP. Once you've dug deep enough, you find that the Explorer does little more than it did before, and all the Control Panel applets offer all the same functionality as before.

Disappointment 2: I'm guessing that the motivation for the Shiny Wrapper came out of a need to "keep up with the Jobses" :-) and give Windows a glassy, 3-d look like the Mac. But the imitation is so shallow and naive. I get the impression that it was designed by people who don't actually "get" the Mac. It's like they made decisions like "the Mac uses shiny red buttons in the lower corners, let's do that and then they'll like us too"...but the end result is an incoherent mess. With the clever GUIs that Apple makes for iPods, Macs, iPhones and the like, you immerse, understand and say Wow. The Vista folks wanted Wow, but all they're going to get is, "Sigh. Why?"

Okay, having gotten that gripe out of the way, I've noticed a few good things. I'm having no trouble loading open source and developer software on the machine. I've got Tomcat 6 with JDK 1.5 running. Ant, Vim, Cygwin, Gimp and Intellij IDEA are fine. I installed all of Office 2003 and so far Word, Excel and Outlook run correctly. But I've had some problems too. My cheap-o Visioneer scanner won't load. A favorite convenience app of mine, Shortcuts Map, will load and run, but I can't close the app without using the task manager.

My user 'home' directory are now c:\Users\greg instead of the old, space-character plagued c:\Documents and Settings\greg. As far as names go, I can see actually using that as my 'home' directory, except that it is filled with the usual junk that is unrelated to what I actually use my computer for: 'My Documents', 'My Music', etc. And not surprisingly, Microsoft still presents it in Explorer as though its a special entity, like Desktop and My Computer, and not just an ordinary folder, which it is.

Another good thing is that Explorer is now remembing recently used locations. It makes it much faster to get to your stuff that way. Nice to know that Microsoft finally found a way to do something the Mac has been doing for 20 years already.

Back to what I wrote about at the top, the indexing that slowed down the hard drive by a factor of four...I guess Microsoft, showing its usual insecurity over competitor's innovations, figured they needed to make Vista like Google, i.e. searchable. And they bet the farm on it to the point that they hoped that users wouldn't mind if the first 7 hours of their Vista experience with a disk drive constantly churning and taking away productivity. Can they really be so clueless? Indexing, whether for a 160 gigabyte drive, or a giant corporate website, should be done in the early morning hours, when noone is at work, or at least on dedicated machines. Oh, they could have included some instructions to this effect: "After you finish using your new PC for the day, we suggest that you run Index Manager (tm) and leave your machine on overnight. The next time you use your machine you will find that you can search the entire computer quickly and easily." But I don't think that fits in with Microsoft's estimation of their user base's intelligence.

The conventional wisdom I've read on the net about Vista, and which I now agree is: don't be a guinea pig, stick with XP until Vista's first service pack comes out. But if you're buying a new machine, and Vista is forced upon you, and you can afford a few days to re-tool, Vista is fine. You'll just be that much more on top of things when the first service pack comes out and you'll be wanting to switch...because presumably Vista has a bunch of features that we'll be wanting. As I discover what they are, I'll write another blog entry about it.

Friday, July 13, 2007

Maven2 Introduction part 1: the Coordinate System

Maven is gaining traction as the premiere form of Java code organization and managing builds. The entire world won't convert overnight, but adoption is likely to steadily increase once more people get over the learning curve and conceptual difference with ant scripts, the previous prevailing model of build management. In this article I'm going to try to take a bit of the steepness off of the learning curve for you.

The big sell for Maven is the dependency management and the coherence it brings to both your individual projects, and your code development overall. And by dependency management, I mostly mean where and which jar files you use. For ant users reading this, this is a way more than what the "depends" attribute of an ant target gets you. It's a way of:

Avoiding confusion between different versions of the same jars

Maintaining only one copy of the same jar on your computer (instead of having one copy for each of your projects that use it)

Having a mechanism that retrieves jars (and making sure it is the right version as well) from the internet for you without you having to think about where it should be stored

Maven does this my having a highly structured approach to dependencies, and importantly, the adherence to this framework by the community who uses Maven. In this part of the article I'll start by covering the cornerstone of the Maven approach, the "Coordinate System," then we'll move on to Maven repositories.

The Coordinate System: groupId, artifactId and version

At the core of Maven 2 is its method of identifying resources (mostly jar files) by a strictly followed practice of file and directory naming. The goal is similar to that of XML namespaces and the java packaging conventions: to define items as distinct points in space according to a unversally followed set of conventions. It's simply these four identifiers:

groupId: usually a reversed domain name such as com.lowagie.

artifactId: a common name for the resource, such as itext.

version: a version indicator such as 1.4. Numbers and decimals are typical, but not required, values for the version.

packaging: the type of end product which could be ear or war, but is most often jar (and therefore the default, so packaging need not be specified)

(A side note about the groupId: there are many jars out there that do not use their organization's reverse domain name. In fact, they comprise some of the most widely used jars out there: log4j, jdom, ant, and xalan to name a few. All they use for their groupId is their simple well-known name (log4j, jdom, ant, and xalan for the examples just mentioned), and their artifactId is the same. These famous jars just happen to have been on the scene during an earlier version of Maven before the convention of using reversed domain names took hold; they held on to their old coordinate locations instead of updating.)

Here is the location of a jar file named itest-1.4.jar in a proper Maven repository. The initial part is chosen by the individual user (c:/.m2/repository) but everything following that is dictated by the coordinate system:

c:/.m2/repository/com/lowagie/itext/1.4/itext-1.4.jar

In the same directory as itest-1.4.jar, you will find the file itest-1.4.pom. This is an XML file containing the following:
<project > <modelVersion>4.0.0</modelVersion> <groupId>com.lowagie</groupId> <artifactId>itext</artifactId> <packaging>jar</packaging> <version>1.4</version> </project>

In the project's root directory, you will find a file called pom.xml, and that file will contain the same lines as above, but wrapped inside a <dependency> element:
<dependency > <modelVersion>4.0.0</modelVersion> <groupId>com.lowagie</groupId> <artifactId>itext</artifactId> <packaging>jar</packaging> <version>1.4</version> </dependency>

Retrieval of resources over the internet is integral to maven. A jar can be referred to by its URL on a known repository. The first part of the URL is specific to the repository, whereas the rest follows the file structure of the coordinate system. Here is a URL for the location of a jar at the well-known repository ibiblio:
http://mirrors.ibiblio.org/pub/mirrors/maven2/com/lowagie/itext/1.4/itext-1.4.jar

Now here is a Maven command line statement. It's purpose is to install a jar in a repository, but don't worry about that right now; just notice how the coordinate system manifests itself on a typical command line statement.
mvn install:install-file -DgroupId=com.lowagie \ -DartifactId=itext \ -Dversion=1.4 \ -Dpackaging=jar \ -Dfile=itext-1.4.jar
Occasionally a point in space is referenced with a single line of text, using the format

groupId:artifactId:packaging:version, as in:

com.lowagie:itext:jar:1.4

These 5 situations show you most of the ways in which jars are referenced in the Maven world. There is a maddening consistency and pervasiveness to the Coordinate System throughout Maven. The more you learn about Maven, the more you discover you've already learned it.

Sunday, October 22, 2006

S-Corps for Software Contracters

Do you prefer 1099 or corp-corp?

Ever been asked that by a recruiter or HR? Did you freeze like a deer in the headlights?

If so, it's probably because you've spent your entire career up until now in Full-Time, 'Permanent' positions, and you're about to do your first real independent contract. In many situations going 1099 is the least costly and least complicated route to take. But, if a client ever insists that you do "corp-to-corp" (as happened to me last year), you'll need to incorporate and take on a few added responsibilities and expenses.

If you ever find yourself needing to do corp-to-corp, this guide will give you a real leg-up on the process. Before I start though, I need to lay out some disclaimers, which I assume you will keep in mind at every point as you read this article:

I am neither a lawyer nor accountant. I am not giving professional advice, only sharing information from my personal experience. I provide no assurance that it is accurate or complete. Before you replicate anything from my experience you should get input from a qualified lawyer and a qualified accountant and consider their advice primary. I am not liable for any legal or financial woes you incur from following or failing to follow this.

My company is in Illinois and is incorporate in that state. I did no research to find how tax & corporate law works in other states. Your accountant and lawyer will have to advise you on the peculiarities of your state.

That being said, let's get started.

The Gory Truth, All At Once

Here is an executive summary of incorporating and running a corporation, at lightspeed. The usual choice for an individual is the S-Corp. The idea is you are setting up a corporation in which you are the sole employee. The legal process of drawing up Article of Incorporation must be undertaken; your lawyer can do it, or you can do it yourself on the web. You're going to put yourself on payroll. You're going to do all the deductions, and file and pay them yourself to the Fed and your State. Some of these deductions have to be matched by the company, which means you pay them twice. You also pay the state and the fed unemployment insurance. The filing and payment of all these various taxes happen on different timescales: some are monthly, some are quarterly, and one of them is yearly. Plus, big corporate clients will require you take out insurance as well.

Why Incorporate

You heard right: double the usual medicare & Social Security, state and federal unemployment, fees for filing articles of incorporation, and business insurance. So why take this on, then? Because an employer may require corp-to-corp: it's how they keep their nose clean with the IRS. The IRS comes down pretty hard on tax-evading companies who try to hide part-time employees by calling them "contracters". Companies know that this all costs you extra money, so your hourly rate is adjusted appropriately.

Or you may want to be an S-corp because you expect your business to grow (i.e. hire employees), or perhaps the legal formality of the corp. status raises your standing in the profession. S-corp is not the only choice; an LLC (Limited Liability Corporation) is also a possibility that meets the Corp-to-corp requirements. LLC's are usually for when you have one or more partners in addition to yourself, so if incorporation is really just for you to run a company of one, S-corp is the obvious choice.

Articles of Incorporation

It is not strictly necessary to engage a lawyer. You can actually incorporate online with sites like www.BizFilings.com. I think there are several good reasons, however, for choosing a lawyer. Your lawyer will have the best advice for your state, and he/she will be available to answer your questions. A website that serves a national clientele can't answer questions about your state. Unless you are comfortable with legalese, plowing your way through all the documents will be a major distraction. The cost of a lawyer and the online costs are pretty much the same. It cost me $600 to incorporate.

The process itself not a big deal, really. With a lawyer you'll take care of the whole thing with one phone call, a few days wait, and one in-person meeting. The lawyer asks you a few questions, you sign a few forms, you get a stack of papers, and your 9-digit FEIN (Federal Employment Insurance Number).

Your FEIN becomes the key to everything. It shows your client that you're incorporated; it sets you up for paying taxes; it allows you to create an account with a bank in your company's name. Pay close attention to that last one: if you bill your client for two weeks, and he writes the check to XYZ Corp, the bank ain't gonna let you cash it on your personal account. Get your FEIN and your company bank account in order before you start billing.

Creating Your Payroll

Running a payroll is more records-keeping than anything else. You do not need to buy some Quicken product or engage a professional to handle payroll; you want to roll up your sleeves, handle this yourself, and know exactly what is going on with your money. First make a template for a pay stub. Keep it simple, just copy a paystub from one of your recent W2 employers, including deductions for Medicare, Social Security, Federal taxes, and State taxes. Distinguish between gross and net pay. You can do all those Year-to-Date columns on the gross pay and deductions too.

Pick a frequency of pay. What makes sense here? First of all keep in mind that your invoicing (billing) to your client is completely decoupled from your salary, and there's no need or benefit to keeping them in lock step. I chose bi-weekly just so I could recreate the amount of cashflow I had at my last W2 position. Whether you do bi-weekly or something else, it would be wise to stick to it and pay yourself on every pay period; otherwise, having payroll tax filings that change wildly from month-to-month doesn't look like you're really running a business.

Pick your salary. All your taxes combined are going to run about 36% of your gross, so for sure don't pay yourself more than that. But you should still probably pay yourself less than the remaining 64% because you'll have other business expenses (which I'll covere later). One self-employed computer contractor I know said to me "I pay myself like a secretary." Why not pay yourself cheap? The money that stays in the company account is still yours anyway. You can give yourself a quarterly bonus to make up for shortfalls. (And yes, all the taxes apply to bonuses as well.) The key advice is, be conservative on this from the start, and you won't find yourself in a panic later.

Your employee taxes

Here are the figures for calculating the various taxes that I used in 2006 (in my state of Illinois) for each pay stub. Social Security is 6.2% of gross; Medicare 1.45% of gross; Federal tax is (25% of gross) - $336.55; State tax is (3% of (gross - $1000)) + $30. Where do these figures come from?

Federal tax figures: The IRS website, www.irs.gov gives you the "Tables for Percentage Method of Withholding" (see links below) and apply them to your frequency of pay. These tables cover circumstances such as being single or married and number of exemptions.

State tax figures: The calculation process is similiar, and possibly more simple. In Illinois, once your gross for the pay period is over $1000 you follow one simple formula. See the links at the end for the location of Illinois tables.

SS & Medicare: These figures rarely change; the two rates of 6.2% and 1.45% have been constant since at least 1997.

Those "other" taxes

Now as a business you have to duplicate those Medicare and Social Security taxes for each employee (i.e. you) and pay them AGAIN as the "company contribution". Your W2 employers have been doing all along for you, and now you're doing it for your own company.

Next there is unemployment insurance. I don't mean a policy that you take out from State Farm; I mean a mandatory tax you pay to the govenment. In case you didn't know this, your most recent employer is the one footing the bill if you go on unemployment. If you work for ABC Corp for a year, then lose your job and go on unemployment and get a few hundred dollars in benefits from the state every two weeks, that is ABC Corp who is paying for the lion's share of that through their unemployment insurance contributions. And your corporation will have to make those contributions now. Just think of it, taking money out of your own gross pay to pay yourself if you go on unemployment! (Seriously though, you might want to talk to your local Department of Employment Security before you try that one.)

There are both State and Federal unemployment taxes to pay, but the Federal one is negligable...a fixed amount in the area of $60 for the whole year. In Illinois, the rate is 4.2% and figured quarterly, on a maximum of $11,000. So let's say you gross $8000 one quarter, your State unemployment tax is 4.2% of $8000. If you gross over $11,000, say $12,500, you only pay 4.2% on $11,000.

So What Should My Hourly Rate Be?

Very good question! Perhaps learning about all these extra costs has made you you run away screaming from the idea of doing corp-to-corp. Don't. These are all known expenses of running a business. Work them into your hourly. If a company has you doing corp-to-corp then you've let them off the hook for medicare, social security and unemployment insurance, and they know full well that you're shouldering it in their place. Plus that company is not paying medical insurance or retirements benefits costs for you either. So don't be shy about upping your hourly. But how much is reasonable?

Perhaps you know your market value in terms of a yearly salary in a normal W2 job. What is the reasonable mathematical equivalent for a corp-to-corp? That will be useful in establishing a baseline. Perhaps there are reasons for a contract to pay you much more than that; maybe it is a short contract with huge demands and a high penalty for failure, and located 1000 miles away. But that's for you to figure out. For now, we'll just talk about an hourly that leaves you just as well off as if you had a W2 at your current market value.

Imagine a hypothetical worker named Sue earning $80k gross on a W2. (We're going to leave medical and retirement benefits out of the picture, now but more on that later.) That works to about $38.47/hour. Sue's net is coming to about $59.5k/year because her Medicare, SS, State & Federal taxes come to about $20.5k/year. Now, let's transform that into a corp-to-corp position. The company portion of medicare & SS, and the state and federal unemployment taxes will come to an additional $8.9k/year and Sue's net annual salary has gone down to $38.6k/year. So her $38.47/hour puts Sue way below what she was earning on a W2. Boost her hourly up to $53.45, though, and then Sue hits that net target of $59.5k/year.

But now Sue wants medical coverage, which, say, costs her $1000 a month. At her last W2 she had a nice medical plan that she paid $200 a month toward, so the value of what she is sacrificing by going corp-to-corp is $800 a month. Translate to an hourly for a year's worth of coverage, and that's $4.62/hour. On a W2, medical coverage is pre-tax, but Sue will have to pay tax on an extra $4.62/hour of income, so it should be bumped up to $6.28/hour. Sue is now asking her client for a rate of $59.73/hour.

Now Sue is still out customary W2 benefits like Retirement Plan, Life Insurance, Disability Insurance, paid vacation, and who knows whatever perks her client offers their full-timers, like Health Club membership. And she's got her business expenses, like indemnification insurance, keeping her laptop and home network in good condition, printer toner cartridges, and so on. Sue might feel justified in adding $3-5 more to her rate. But I would advise Sue against sharing these calculations with the client up front. Clients do not feel they are responsible for every aspect of your financial safety net and company operations; what they would say is, "hey, you're a contractor, those are your issues." Instead, Sue keeps the information in reserve. If they balk at her rate of, say, $62/hour, she can just refer to all those other things in passing: "I haven't even mentioned all the other expenses I am shouldering like business costs and insurance, so I think my rate is fair."

Filing and Paying Taxes

Okay now. Decoupling the money your company makes from the salary you pay yourself is a real trip, right? It gets better still when it comes to filing and paying taxes.

Looking at the paystub from your last W2 employer, you might get the feeling that paying taxes is all sort of automatic and done for each paycheck. Not at all. "Payroll taxes" include Federal & state withholdings and both the employee and employer Medicare & Social Security contributions. The State portion covers only the state withholdings; everything else is considered the Federal portion. You file these figures once a quarter, but pay them once a month. If you fall behind on either (I think the grace period is about 15 days), you'll be penalized and have to file separate forms to cover the penalty...nothing crippling, but unpleasant still. The fact that you pay earlier than you tell them how much you owe them, and then can be penalized for getting it wrong, underscores how important it is to keep extremely clear payroll records.

At the time of this writing, Federal filing can only be done with a paper & snail-mail process. Go to the IRS website and download Federal Form 941. You'll notice that it does not distinguish between the employee and employer contributions for SS & medicare; you lump them together. You'll see that you can also send your payment along with this form, but you have the option to handle the payment online, which I'll cover in a moment.

State tax withholdings can be filed in Illinois via the Illinois TaxNet system. (Hopefully your state has a similar system.)

Payment of payroll taxes is once per month. When your W2 employer takes deductions for tax, medicare, SS, etc., they're actually banking that money until payroll tax time at the end of the month. Both state and federal can be paid online. The federal portions (tax, medicare and SS) can be paid with the EFTS system (see links below) and Illinois state tax can be paid with Illinois TaxNet.

Now for unemployment insurance. In Illinois, you both file and pay state taxes quarterly, using Illinois TaxNet. For Federal, you also file and pay together on EFTS, and only once, at the end of the year. And remember, it's a very small amount (0.8% of gross earnings).

Summary of Times and Activities

Here's quick summary of everything that happens over time at various points:

Invoices are submitted to your client, frequency of your choosing.

Client auto-deposits or mails you a check written to your company bank account.

You generate a paystub for yourself every two weeks indicating the gross earnings, the tax withholdings, and the net amount. You write a company check to yourself (your personal name) for the net amount and deposit it into your personal account.

At the end of each month (with 15 days grace) you pay state and federal payroll taxes for the amounts you withheld in salary to yourself during that month.
At the end of each quarter (March 31, June 30, September 30, and December 31, with 15 days grace for each), you file the state and federal payroll taxes deducted during that period. You also file, and pay, state unemployment tax at each of these quarter ends.

At the end of the year, you file and pay federal unemployment tax.

Why Incorporate, More Reasons

With an S-corp you've got all the apparatus for growing into a larger company. Your payroll mechanism scales to multiple employees with no necessary changes.

You come to understand how companies work, and your role vis-a-vis clients and the middlemen that separate you.

If you go back to W2 employment at some future time, when you get your paystubs, you can go over the figures to see if their withholdings are correct, because you know how they're supposed to be calculated.

Using your company as a formality for keeping track of job-related expenses. Need a technical book, or a new wireless router, or some paper clips? Charge it to the company card, then write it off. Incorporation gives you some added authenticity in the eyes of the IRS.

Spreadsheet Tips

My major advice here is to develop a passion for keeping very exact records, and get handy with MS-Excel if you aren't already. Whatever flavor of spreadsheet you use, here are some ideas for you to get started.

In one worksheet, I maintain one row for each pay period, with formulae for converting each gross pay to each of the withholdings in separate columns. I copy this row to another worksheet that uses those fields to generate a paystub.

Divide the pay period rows into four separate groups corresponding to the quarter of the year. Your state unemployment insurance payments are computed according to quarter.

Use another part of your worksheet to keep tracks of each of the categories of tax payments that you make. Label them according to the month and quarter they were for. The steady bi-weekly, monthly and quarterly paperwork, filings and payments become a blur after a while, so keep good records. The 15th day of the month into a new quarter can be a panicky day if you're unsure of your filings and payments.

Use another part of your worksheet to show what taxes are currently owed by comparing the payroll rows with the Tax Payments Made records. And finally you can calculate the financial state of your company: money in the bank after you've made payroll, minus the taxes not yet paid.

Other Tips

For each transaction relating to salary, keep together copies of the pay stub, check, cancelled check and bank statement line item. Staple them all together and file them away. Do the same for each transaction relating to invoicing and client payment. You'll have so much order in your records, you'll be wanting to dare the IRS to audit you.

Keep track of your expenditures using your company card or checks for things other than salary. Now here is a case where I recommend using something like Quicken instead of rolling your own Excel spreadsheet. Don't go overboard and invest in QuickBooks or anything, although Quicken "Premier Home and Business" is worth the extra few bucks.

It's inevitable that your personal and business bank accounts will intermingle on occasion. Here's two things that happen to me a few times a year: I buy groceries with the company card because I forgot the personal card at home. Or, I absent-mindedly use my personal card to buy a piece of networking hardware that I want to be a business expense. I de-mingle these as they occur, and keep a paper trail. I have a separate forms that I use for payment back to the company, or payment from the company, and I copy the check, cancelled check, and bank statement line item for each one, as I do with other expenses.

Helpful keywords

Words and phrases for search engines that can help you find documents that you need are:
payroll tax, bi-weekly payroll period, FICA witholdings, Social Security tax rate, Medicare tax rate,