Thursday, February 23, 2017
Thursday, July 8, 2010
Friday, May 15, 2009
Character Sets and Feeds
I work for a company which receives huge incoming text feeds from our clients, which we filter, transform and massage to go out to the major search engines. Much of the work of cleaning incoming feeds consists of replacing undesirable characters (e.g. fancy foreign characters and symbols) that come from the client. For search engines we have to provide the lowest common denominator of character type, plain ASCII. There are many ways these "bad" characters creep into incoming feeds. Some are mainstream ISO 8859 or UTF-8 characters that are accepted by most display systems, but are still undesirable for search engines. Less likely but still possible are errors caused by the use of a legacy non-printing control character. By far the most common offender comes from violation of ISO 8859 conventions that have come into practice, known as "Extended Ascii" (of which Microsoft's WinLatin1 table is the most notorious offender). This page explains the major character set types, along with a little history, and the problems that they cause to feeds.
7-bit Ascii (ISO 646)
The original ASCII table, a 128-value (7-bit) character set is formally known as ISO 646. In fact it remains the only real "Ascii table" (more on this later). This portion has survived unchanged as the beginning portion of all major character sets (even the ones tampered with by Microsoft).
|In feeds from our clients, mistaken input of legacy control characters is rare, although recently one client reported seeing 0x1D ("Group separator GS Left Arrow") in a feed.
Ascii remains to this day, strictly speaking, a 7-bit table of 128 characters and no more. However, since in modern times characters are rendered with 8-bit bytes, people have wanted to take advantage of what is available on their machine in the area of that top bit. Folklore, common practice, and whatever happens to be on the machine one is using, has mislead many into thinking some universal 8-bit ascii table exists. But in reality the 8th bit portion has been a playground for whatever people wanted it to be.
To answer this havoc, ISO 8859 defined a variety of 8-bit character tables cover most of the world's needs. For every ISO 8859 table, the 7-bit portion is ISO 646 Ascii. The best-known of the ISO 8859 tables, at least among the English speaking and Western European world, is ISO 8859-1, also known as "Latin 1." To some people, this is "the" Ascii table. The eight bit portion is shown below.
Note that there is an unused portion (greyed out) of the table between characters 128 and 159. Take special note of this; as you'll see below this special area becomes a huge area of controversy (Extended Ascii).
Other ascii tables in the ISO 8859 collection include Baltic, Cyrillic, Arabic, Greek, Hebrew and Turkish character sets, allowing each country to adopt the appropriate table, and software manufacturers to write compliant products. (You can see the various ISO 8859 tables here.)
|At my company, a feed using ISO 8859 8-bit characters will need to be cleaned before going out to search engines. If they are ISO 8859-1 (Latin1) characters, the mapping to a substitute character is well-known and easily fixed. We do this with Perl scripts and a simple mapping table. Another possible problem would be if we had a client that is using some ISO-8859 table other than Latin1 (for instance, a client from Scandinavia might use the Baltic table, ISO 8859-4). In that case we would have to create a new mapping table. Worse yet, files cannot identify themselves as being any particular ISO-8859 subversion, so if a client doesn't announce the table they are using, we would only find that out by seeing unusual errors. For example, seeing frequent use of a copyright symbol © in unexpected places would be a clue that some non Latin1 ISO-8859 table was in use.|
It's not clear why ISO 8859 prohibited this area. In one table I saw, each of the codes were described as being a control character of some sort.
'Extended Ascii' and Windows 1252Microsoft couldn't resist using that empty area of the ISO-8859 tables, so they went and created their own 'extended version' of the ISO 8859 table set; you can see them all here. Their version of ISO 8859's Latin-1 table goes under the name WinLatin1 or Windows 1252. The 8th-bit portion of this table is shown below, with the characters Microsoft inserted into ISO 8859's empty area marked in yellow.
Due to the popularity of the Microsoft Office family of software products, WinLatin1 has achieved an unfortunate hegemony in the computer world. In the competition for the elusive "8-bit Ascii" table, WinLatin1 is ISO 8859-1's chief competitor. However, a sizable portion of the computer world (Macs and Unix flavored s
ystems) have settled on the real ISO 8859 standard, so problems occur when people paste, email or otherwise transmit text made from a Microsoft product to be displayed on a Unix or Mac box. When you see question marks or 'ascii garbage' in a file when you're not expecting it, you have probably been WinLatin1-rolled.
The term "Extended Ascii" has become the name to describe the format of any file which contains a mixture of ISO 8859 plus characters in the forbidden zone of bytes 128-159.
So what did Microsoft do with that precious block of 32 characters it stole from ISO 8859? Four characters comprise the notorious MS-Word 'smart quotes' (single and double: ', ', " and "). On the other hand they were remarkably prescient in including the Euro currency symbol (€) years before the currency was adopted. There are a number of punctuation and editorial marks or symbols, such as daggers († and ‡), ellipses (...), the list bullet (•) and several others that perhaps are used in European languages (ˆ, ‹, ›, and „). Many choices are totally inscrutable, however. Why define the comma-lookalike "‚"? Of what use can the few foreign alphabetical characters of Œ, œ (used in old German and old Romance languages), Ÿ (used in Greek transcription and rarely in French), Š, š, Ž and ž (Estonian, Finnish and Czech characters) be?
|Use of the "forbidden zone" of ISO 8859 accounts for most of the cleanup tasks in clean scripts at my company. The trademark sign (™) is a favorite of many of our clients, since it is unavailable in ISO 8859-1. The list bullet and daggers are also occasionally found in feeds. Most likely the source of these characters is from client use of Windows Office software.|
The next evolutionary step after ISO 8859 is UCS (Universal Character Set, ISO 10646). It dispenses with the approach of multiple table sets for individual languages in favor of a single giant table of all possible characters. Its table size of 2^31 (2,147,483,648) characters encompasses the character sets of virtually all the languages of the world. It goes by the more common name of Unicode, after the name of the consortium that merged with UCS.
The first 256 characters of Unicode are completely backward compatible to ISO-8859-1. Unicode's first 128 characters are classic ISO 646 7-bit Ascii set, while Unicode characters 128-255 is the top half of ISO-8859-1 Latin1.
With a character set of Unicode's size, single byte sequences with a maximum numberic range of 0-255 can obviously no longer be the only way in which a file stores text. UCS has spawned two different byte-sequence conventions, UCS-2 and UCS-4. UCS-2 files have two byte characters, and UCS-4 files have 4-byte characters.
UTF-8 is a byte-based sequence convention for representing Unicode. It can have 1-, 2-, 3-, 4-, 5- or 6-byte long characters, changing when and where it needs to. It is an efficient solution for English and other Western European languages that spend much of their time in the first 128, one-byte wide unicode characters. In fact, English UTF-8 files will often consist of nothing but one-byte sequences, no different than an Ascii file. Because of this UTF-8 is nicely backwards compatible to ASCII, meaning a file can be UTF-8 but still work on older systems that support ISO-8859 or earlier standards. Even when a UTF-8 file does use larger width Unicode characters, it is still a "one octet encoding unit" (single byte) encoding standard, as opposed to UCS-2 and UCS-4 which are two- and four-octet encoding unit standards.
|UTF-8 should always be spelled as shown. Referring to UTF-8 as utf-8, utf8, UTF8, etc. is considered bad form.|
Unlike UCS-2 and UCS-4, which represent the higher order values of Unicode directly in binary with 16-bit and 32-bit data units, UTF-8 remains a "one octet encoding unit" (single byte). How does it represent Unicode values higher than 255? It encodes them with byte sequences of variable length. The values for these encoding bytes all lie within the range 128-255, perhaps not coincidentally the high bit portion of the 8-bit byte. In UTF-8, none of the values in the 128-255 range represent an actual character. This range is broken down to sub-ranges of bytes which are designated as "signifiers" for the beginning of 2-, 3-, 4-, 5- or 6-byte sequences. See the table below. Bytes in the red region, hexadecimal C2 through DF, are used to announce the beginning of a two-byte sequence; bytes in the blue region a three-byte sequence; and so on, to the orange region for 6-byte sequences. The characters in the purple region are the data values that can follow the signifiers. By using just these values, the complete set of Unicode characters that comprise two or more b yte s can be constructed.
For single byte characters, UTF-8 simply uses the first 128 Unicode characters, the ones that correspond exactly to classic Ascii 7-bit ISO 646. No special signifier announces a single byte sequence.
An easier way to understand the multi-byte signifier approach is by looking at the bit patterns in their binary values, as shown below. Two-byte sequences can be announced with with any byte having 110 as the top three bits; three-byte sequences can be announced with any byte having 1110 as the top four bits; and so on.
|Cleaning a file of any UTF-8 multibyte sequences is complicated by the fact that you have to "look ahead" in the byte stream to determine whether a byte is the start of a multibyte UTF-8 sequence, or an isolated ISO 8859 character. When a UTF-8 sequence is found, the n-byte sequence must be replaced with a single ascii character (if the appropriate substitute is known), or simply removed (if no substitute is known). In practice, our clients use a very small repertoire of symbols that are easily replaced (trademark, registered, bullet, etc.).|
Copyright symbol: 0xC2 0xA9
Not-equals sign: 0xE2 0x89 0xA0 (Unicode char U+2262)
Korean text: 0xED 0x95 0x9C 0xEA 0xB5 0xAD 0xEC 0x96 0xB4 (Unicode chars U+D55C U+AD6D U+C5B4)
Japanese text: 0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E (Unicode chars U+65E5 U+672C U+8A9E)
0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, æ, Ø, Æ)
0x31 0x32 0x33 0x99 (1,2,3, winlatin1 trademark)
0x31 0x32 0x33 0xE6 0xD8 0xC6 (1,2,3, three Latin1 chars, one winlatin1 char)
0xE6 0x97 0xA5 0xE6 0x9C 0xAC 0xE8 0xAA 0x9E 0xE6 0xE8 0x31 (UTF-8 Japanese text followed by misused signifiers)
None of the character tables discussed so far (with a possible exception of UTF-8, discussed below) have any convention for a self-identifying start block or header. It has to be done based on some sort of pattern. In Unix, the 'file' utility (or 'type' on some versions of *nix) is a pretty good tool for identifying the character table type. Here's the pattern that 'file' seems to use:
|When:||Unix 'file' Reports:|
|All characters are 7 bit||"ASCII text"|
|ISO 8859 8-bit characters (with or without 7-bit ascii mixed in)||"ISO-8859 text"|
|Any use of bytes in the ISO-8859 forbidden range (0x80-0x9F) which aren't part of valid UTF-8 sequences||"Non-ISO extended-ASCII text". (This is how WinLatin1 files will be identified.)|
|Consistent correct use of multi-byte UTF-8 sequences||"UTF-8 Unicode text"|
How 'file' handles some edge cases:
|When:||Unix 'file' Reports:|
|All 7-bit, but containing odd combinations of control characters from the 0x00 to 0x1F range||"Data"|
|Correct UTF-8 code mixed with any 8-bit ISO 8859 characters, forbidden or non forbidden||"Non-ISO extended-ASCII text"|
What 'file' cannot do is read your intentions. It makes the simplest judgment call it can. For example:
- A file of all 7-bit characters is always "ASCII text"...never mind that you think of the file as being UTF-8 or ISO 8859 compliant (both of which are also true).
- ISO 8859 detection is based only on byte values alone, not syntactic patterns; it can't tell whether you want it rendered as Latin1, Baltic, or Cyrillic, etc.). It reports ISO 8859, not ISO 8859-1, or ISO-8859-2, etc. This would take some fairly elaborate dictionary lookups from multiple languages to accomplish.
- Any UTF-8 multi-byte sequence not containing bytes 0x80 to 0x9F is hypothetically correct ISO 8859. Because these are syntactically rare, 'file' choses to identify this at UTF-8 (assuming everything else about the file is UTF-8 compliant). It cannot read your mind that you intended a series of weird ISO 8859 characters. For example you might want the sequence 0xB5 0xAD to read µ (micro sign, soft hyphen) in Latin1, but since it is a valid UTF-8 sequence (≠, not equals), 'file' will report it as "UTF-8 Unicode text" if everything else about the file is consistent with UTF-8.
Regarding the potential for confusion between ISO 8859 and UTF-8, this is what others have to say:
- "the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length" (UTF-8 RFC).
- "The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII [sic] is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence." (Wikipedia UTF-8 page) (see )
If you're dealing with a page you suspect is not rendering with the correct ISO 8859 subtable, both Firefox and Safari give you the option to change the interpretation. In Safari, check the View menu, Text Encoding. In Firefox, check the View menu, Character Encoding.
There is a convention called the "Byte Order Mark" (aka BOM) which (hypothetically) can be used to self-identify a UTF-8 file. This consists of beginning a file with the sequence 0xEF 0xBB 0xBF. Both Unix 'file' and 'vi' seem to be pretty good at interpreting it. When a file is pure 7-bit ASCII, but preceded by the BOM, it forces 'file' to report it as "UTF-8 Unicode text". An interesting edge case: if a file containing ISO 8859 8-bit characters that cannot be interpreted as UTF-8 multi-byte sequences is preceded by the BOM, neither Unix 'file' or 'vi' can be fooled: 'file' will report the three BOM characters as "ISO-8859 text", and 'vi' will display the BOM characters (as ï»¿).
Othan than BOM, refrain from ever calling ASCII, UTF-8, ISO-8859 or WinLatin1 a "file format." None have a standardized header or starting block They are more properly called "byte sequence conventions."
The entire corpus of Unicode is hypothetically available to browsers. For example the Unicode characters of Japanese that were mentioned earlier, U+65E5 U+672C U+8A9E, should successfully render on your browser when I put 日 ; 本 ; 語 ; in the HTML. Here it goes: 日 本 語. These work for me on Firefox. In practice, I believe that no browsers are 100% capable of correctly rendering the entire 2^31 Unicode character set via HTML entities.
Question: does the presence of the foreign characters you see above effectively make this page Unicode or UTF-8 format? No. The characters I used to make them were merely ampersand, pound, x, and digits, all 7-it ASCII characters. HTML Entities are merely instructions to browsers what to do with the characters.
|Do we want to use HTML Entities in the cleaned feeds we send to search engines? Do we want to figure out the intent of various WinLatin1 manglings and UTF-8 sequences that come from client feeds and turn them into HTML entities that will render safely on a browser? This is a subject still open to question. Our YSSP search engines can handle them, while our Comparison Shopping Engines (CSE's) do not. Currently we are leaning towards aiming for the lowest common denominator, 7-bit ASCII for all search engines.|
Can an HTML file be in actual UTF-8 format? "Yes we can." If it contains UTF-8 byte sequences, it gets interpreted by your browser as UTF-8. Here are two nice examples of "UTF-8 Test pages" out there that are just full of exotic Unicode in UTF-8 encoding: Markus Khun's sample file, and Frank da Cruz's UTF-8 Sampler. (How well your browser shows the page depends on how good the browser is. For me, Firefox and Safari got most of it, but IE6 failed on large numbers of the characters.)
After you look at one of these pages, try viewing source. What do you see? If your source viewer is UTF-8 capable, you'll see, well, the UTF-8 characters. Surprised? Expected to see the numeric bytes, did you? If you want to see the numeric value of a character under the cursor in vi, press ga. For UTF-8 characters higher than 8 bits you will see large values like 0xFFDD. Or if you want to get really deep into the binary encoding of a file, you could get a binary editor. On Unix, try 'od', or better yet, 'bvi' (you have to download it).
When an HTML page puts this in the block, what does it do?For those most part, it simply communicates intent: a tip to the consuming client what to do with the page. It doesn't change the way the bytes go down the wire, or the way your HTML is stored locally on your disk. It's still just bytes. In practice, you can even leave the declaration out and even though the file is chock full of UTF-8 binary sequences, a good browser figures it out anyway. (Note that Frank da Cruz's file above had no meta direction at all.)
How do you compose (i.e. edit and save) an HTML file in UTF-8? Well, if you're using all 7-bit ascii characters you're fine, that's still valid UTF-8. But really we mean UTF-8 with higher level Unicode characters. Well, unless you want to enter the binary directly, you'll need a special editor for that; vi, textpad, etc. won't do it for you. Try Google.
XML is UTF-8 by default; the following declaration is actually redundant:You'd have to override it with a different value for the encoding attribute to make it non UTF-8. Like the HTML meta tag content charset setting mentioned above, the encoding declaration is essentially a tip to the consumer of the XML what to expect.
Some Fun Historical Facts
Bob Bemer (1920-2004) is said to be the "father of ASCII". His Ford Explorer's vanity plate read 'ASCII' surrounded by a plate holder reading "Yes! I'm the Father of".
Wikipedia has a picture of one of the earliest published ASCII tables, from 1968.
Ken Thompson, the Unix pioneer, invented UTF-8. "It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat" (source). You can also read the original AT&T UTF-8 Tech Report.
Sunday, May 10, 2009
During my various attempts to get in good shape over the years I've wasted my share of money on various diet & exercise books that try to tell you what exercise you should do, what muscles they benefit, what foods you should eat and why, and recipes that will make it more enjoyable. Big surprise, lack of willpower has tanked many of my efforts to stay on any kind of healthy eating program. But I really don't think that willpower is the whole story for me or other dieters. What about our ingrained habits of using food for reward, food to fix a poor mood, hedonistic eating, addiction to various kinds of foods (like my favorite, the Sticky Pecan Roll from Au Bon Pan)?
One diet approach that I thought took a step forward in talking about the psychology of diet was the Low-Carb diet (or "Atkins" in its most popular form). It talks about the role that carbs have in making us feel full, and thereby feel good. But I think you could do a lot better. I want to know the mechanics behind appetite, satisfaction and digestion: what actually occurs in the body to make you want to eat huge portions of your favorite foods, what happens as that urge becomes satisfied, and how you can use that knowledge to change your behavior.
I discovered the book You On a Diet while standing in line at a Jamba Juice one day. Using entertaining illustrations, it nails exactly what I was looking for. Some of the points in makes on the mechanics of body chemistry, food and appetite include:
- The lower gut contains 95% of the body's serotonin, which suggests that eating is self-medicating.
- Our caveman ancestors were more fit than us because the stresses of survival kept them lean. Stress, like fight-or-flight stress, means that Peptide NPY is inhibited and you don't feel like eating.
- When fat makes the liver work extra hard, it prevents glucose from getting to our cells, and produces hunger.
- Fiber is good for diet because it slows "the transit of food across the ileocecal valve, keeping your stomach fuller for longer." (p. 68)
- Whole Grains
- Red Peppers
- Smell of grapefruit (p. 88)
- Brightly-colored food
- Mint breath strips (p. 239)
- Fiber supplements (e.g. 1 tbsp Psyllium Husks with a glass of water)
- Lactobaccillus CG, a healthy bacteria found in yogurt
- Omega-3 fatty acids, such as Fish oil, walnuts
- Green tea
- Beer (hops)
- Jojoba beans (available in supplements)
- Soybeans (isoflavins)
- Lignans, such as Flaxseed oil, rye bread
- Polyphenols, such as tea, coffee, fruit, vegetables
- Glucosinulates, such as broccoli, kale, cauliflower
- Carnosol, found in Rosemary
- Resveratrol, found in red grapes, juice and wine
- Dark Cacao
- Quercetin, as found in cabbage, spinich, garlic, capers, apples, tea, red onion, red grapes, citrus fruit, tomato, broccoli, leafy green vegetables, cherry, raspberry, and lingonberry
- Antioxidants, as in vegetables and fruits (especially bananas), Vitamin E, Vegetable oils, Tea, coffee, soy, fruit, olive oil, chocolate, cinnamon, oregano and red wine
- Weight loss improves cholesterol by a factor of three. For example, a 7% weight loss leads to a 20% improvement in cholesterol levels. (p. 120)
- The beneficial effect of exercise in producing weight loss is greater than the detrimental effect of eating in producing weight gain. So even if you're getting in only a little exercise each day, the effect is significant. (pp. 141-142)
- Without exercise, we lose 5% of our muscle mass every 10 years after the age of 35. If you don't exercise (rebuild muscle) every 10 years, you need to eat 120-420 fewer calories a day to maintain your current weight. (p. 142)
- Focus on reducing the size of your belly, not weight loss. Exercise not only reduced fat but bulks up muscles, which can result in a net loss of zero.
- Bad Fats are the ones that stay solid at room temperature: animal fat, butter, stick margarine & lard. Food manufacturers push these because they have a long shelf life.
- Good Fats are the ones that are liquid at room temperature are the good, omega-3 and -6 fats: olive, vegetable, sesame & canola oil, fish oil, flaxseed, avocados, nuts (especially walnuts). Nutrients that fight bad fats are: niacin and vitamin B5
Here's a fun subject the book covers: what gives you gas? It's important because it's a byproduct of the way you eat and what you are forcing digestion to do for you:
- Gas is a normal result of the intestinal inflammation during digestion, as good nutrients go to the bloodstream and bad nutrients to to the lower intestines (and produce gas). So bad gas can be attributed to too much bad foods in your diet. Also, when inflammation is too high, some bad nutrients get into the bloodstream, leading to cholesterol.
- Sulfur-rich foods such as eggs, meat, beer, beans and cauliflower make gas smell worse
- Drinking cola means swallowing more air, which means more gas
- No evidence yet shows that artificial sweeteners are unhealthy (p. 97)
- It's the fat around our waist that gets us into trouble. Fat in other parts of your body cause relatively little harm to your health or eating chemistry. (p. 102)
- Alcholic drinks fight bad fats (p. 123)
- The liver is the heaviest organ in the body (p. 77)
- Your deoderant can make you gain weight, if it contains aluminum or polychlorobiphenols (p. 92)
- The more brightly colored the food, the better it is for you (p. 95)
Sunday, April 19, 2009
I'm reading George Soros' 'The Crash of 2008 and What it Means; the New Financial Paradigm' . George Soros is the 29th richest person in the world, and a champion of progressive causes.
The thesis of this book isn't easy to quickly summarize, but it's a really good one. He says there are two ways of interacting with the financial world, as an observer or as a participant. The observational role entails looking at phenomena, and drawing rational conclusions. The participatory role involves manipulating the system and acting in your own self-interest. Now you would think that that the two states of interaction could be operate completely independently, so (for example) a stock trader would read the news, draw rational conclusions that would inform their trading activities. But the reality, according to Soros, is that human participants can't decouple the two. If you participate, your ability to be a detached observer is affected by what you experience firsthand. You come to believe that the rules that work for you are better universal rules than anything a detached observer would come up with. Maybe that works fine as long as you keep getting richer. But what if you and all the other participants screw up the whole market, and you've got a George Bush style administration that still thinks that these bozos are the 'experts' on the economy? The way out of the mess is to restore a balance with rational observation, and clear the air of old beliefs.
That's about as few lines as I could express it in!!
Soros includes some autobiography over the course of the book. He studied philosophy a lot when he was younger, especially Karl Popper, and he claims to not be a philosopher, but the book comes off as philosophy...actually quite good philosophy. I think Soros' way of viewing financial interaction as a kind of "bi-camerel mind" is pretty original stuff.
Sunday, April 12, 2009
My employer, iCrossing, has opened a search for a new member for the Merchantize team. Here's the description. To apply for the position, visit http://www.iCrossing.com/careers, select U.S. Career Offerings, Jobs by Location, then Jobs in Chicago, IL.
We’re a people business.
People are the heart and soul of our company, working every day to make our clients’ marketing programs successful.
At iCrossing, we combine experienced talent with world-class technologies to efficiently create marketing programs that truly perform. With more than 620 professionals in 15 offices in the U.S. and Europe, we are equipped to service the digital marketing needs of large enterprises and growing companies alike.
We’re seeking the talented, the experienced and the exceptional to give our clients the most creative and successful solutions for an ever-changing industry. When we find them, we offer a dynamic working environment, competitive compensation, the opportunity to work on exciting client programs, and occasional bagels.
We are seeking a highly motivated and technically proficient JEE Software Engineer / Software Developer to work on our industry leading and mission critical Paid Media Management (Search Engine Marketing, bid management) product.
Features of the position:
• Work on a high-visibility, high performance product that supports iCrossing’s industry leading SEM practice in a growing and fast moving industry.
• Work closely with all of the major search engines (Google, Yahoo, MSN, Ask, AOL) and their APIs.
• Work in a fast moving and forward thinking development environment that is constantly researching and rapidly implementing the latest technologies.
• Research and participate in the advancement and implementation of open source frameworks and architectures such as SOA/ESB, MapReduce, Grid and Cloud computing, and others.
• Work with an experienced Agile Software Development team in a highly collaborative environment.
• Modern Java Enterprise open source based product stack, Java 6, Spring, Hibernate 3, Webworks/Struts 2, JMS, JUnit, MySQL and more.
• Learn current software development best practices (continuous integration, build automation, test driven development, pair programming, agile estimating and planning, etc)
• Apple MacBook Pro, 24” widescreen monitor, IntelliJ or Eclipse.
• A casual, fun, and creative work environment
Major Job Responsibilities / Accountabilities:
• Write test driven quality code.
• Work closely with your dev team.
• Follow and encourage development best practices.
• Develop knowledge of Search Engine Marketing (SEM) principles and techniques.
Required Technologies (At least one or more of the following)
• SQL scripts
• Shell Scripting
• Webwork (Struts 2.0)
• Linux / Unix admin
• Junit (required) or TDD (preferred)
• Grid Computing (GridGain preferred)
Bonus Technologies (Preferred any of these)
• MySQL (especially advanced knowledge of replication, storage engines, backup and recovery)
• Data warehousing design concepts, ETL
• Mondrian OLTP
• Amazon EC2 / S3 / AWS
Knowledge / Skills / Abilities:
• BS in Computer Science or equivalent level of experience
• Understanding and/or appreciation for Agile software development methodologies.
• 1+ yrs of professional development experience.
• Familiarity with source control using Subversion
• Familiarity with IDE tools such as Eclipse or IntelliJ
• Must possess effective interpersonal and communication skills and ability to work successfully in a team environment.
• Good organizational and time-management skills.
Do Not Apply if you:
• Do not know Java
• Have no interest in Agile, TDD or Unit testing
• Are close-minded and don’t want to learn new technologies.
• Are more comfortable working on the same technology you did last year.
*ICROSSING IS NOT ACCEPTING RESUMES FROM STAFFING AGENCY PARTNERS AT THIS TIME. THANK YOU.
Monday, April 6, 2009
I've been listening to music outside my usual comfort zones to see if there's anything I missed. For example, I never paid the slightest notice to the group Queen, but am listening to all their classics and finding there some good stuff in there. I'm also going through Sinatra.
I've always associated Sinatra with more affluent school friends' Dads who had paneled, shuttered studies, wore golf sportswear (which was pretty atrocious in the time period I'm thinking of) on the weekends and showed you their smug liking of Sinatra in their leather chair with a glass of scotch. Now that I'm actually listening, I see what about Sinatra imparted a feeling of "this music makes me feel upper class" to these guys. The orchestral introductions, for the ballads at least, have really brilliant and orchestration are often reminicient of late 19th/early 20th century Strauss, Mahler and Debussy. Because they're so immaculately performed and recorded, and more closely mic'd than classical recordings, they're pretty interesting to listen to. Then Sinatra comes in, and he might as well be singing "Hey buddy, remember the Music Appreciation classes we had to take in college, and how all the classy broads were suckers for that longhair stuff? Now you and me, buddy, we got class too." Now if you liked Sinatra's song as well...which still doesn't do much for me...you'd be in heaven.
There's usually a middle section with more of the orchestral material, and a few bars at the end. It would be interesting to splice the orchestral parts of all the songs together just to appreciate the fine orchestral writing.