Phishing through homographs: Letters that look alike but lead you astray in some browsers

An old, unsolved problem with non-Roman characters in domain names raises it head again, but you can deter it.

thinkstock web http
Credit: Thinkstock

When is an “a” not necessary the “a” you think it is? When a browser shows it as part of the URL in the location or smart-search field. Due to the late entry of non-Roman characters to domain names, a backwards-compatible method of representing them aids phishing.

Unicode allows the representation of nearly all the glyphs—characters, symbols, ideograms, script element, and more—that form the basis of language and other written subjects, like math and games, in use around the world. While the Unicode Consortium started its work decades ago, but it’s only in the last few years that it’s finally permeated operating systems, browsers, and apps to the point where you can almost rely on it working almost everywhere.

But the Domain Name System (DNS) that operating systems use to turn human-readable location and resource names into the numeric and other data needed to make a connection dates back even before Unicode. And because of its ubiquity, making any change could break compatibility for hundreds of millions of people and devices—maybe more. This is why some sensible improvements, like having a cryptographic component to a domain name that prevented its being spoofed by a party that didn’t own the domain, has still not been rolled out.

Domains were encoded in a very tiny subset of all possible glyphs: the 26 Roman letters (capitalization aside) plus the numbers 0 to 9 and the hyphen. (The period or “dot” is used to separate elements of a domain name.) This rankled people who live or write or do commerce in scripts other than Roman English—even folks who just needed a ñ or a é were left out. Since that’s the vast majority of people on the planet, something had to give.

Hide one script in another

In 2003, the patch job was “punycode,” a funny name for using ASCII characters to represent glyphs outside of the DNS-allowed set. This would let me register, say, 🤡🍑.com, which gets converted to xn--ri8h56g.com. The codes are a compressed way to represent Unicode characters, which a browser can then interpret and display.

But here’s the long-running problem: many scripts contains letters similar to those in other scripts. In fact, they may have exactly the same design in the font being used in a location/search bar, which in this context is called a homograph. A homograph is usually words spelled the same but having different meanings; here, it’s really words spelled differently (in different scripts) while having the same appearance.

That has let phishers register domains that, when converted from punycode into Unicode and shown in a browser field look like, say, “apple.com” (or similar enough to it). These domains can also receive legitimate TLS certificates, so a savvy user looking for validation that it’s a “real” domain will spot the lock icon showing an https connection is in place.

Browsers have added limitations to punycode conversion over the years to reduce the opportunity for phishing, but the issues has reared its head again as researchers poked at the limitations of how browsers handled issues like using only Roman-identical letters from other scripts. One found that an all-Cyrillic “apple.com” could be registered and would appear in Chrome and Firefox as rendered Unicode.

All the major browsers show the ASCII punycode domain without conversion if there’s a mix of scripts in a single domain, an easy way to prevent someone from using an “a” from another script in the middle of Roman letters. Some browsers won’t render punycode if it’s in a language/script combination that isn’t installed in some fashion on the device in use, or if the script doesn’t match a country-code top-level domain, like .jp for Japan. (Wikipedia has a detailed list of these mitigations.)

And some top-level domains (TLDs) only accept names in a given script. For .рф (Russian Federation), for instance, domains must be entirely in Cyrillic. More general TLDs, like .com, don’t have policies this strict to allow greater support for international purposes and compatibility. Thus browsers have to do the vetting.

Safari and Internet Explorer appear immune to this new approach. Chrome was updated to avoid rendering punycode when a domain name only uses homograph characters in Cyrillic (letters that look identical in Romain) in domains that aren’t script-limited, like .com. That will be rolled out in Chrome 58.

Firefox remains the odd person out here, partly to concerns that showing unrendered punycode unfairly makes Roman characters the default, and penalizes domain holders who have no phishing purpose at all.

Solve and monitor for punycode phishing

Fortunately, if you’re a Firefox user and you don’t visit IDN domains nor worry about the display of these URLs, you can disable the rendering setting:

  1. Go to about:config.
  2. Enter punycode in the search field. This will show the network.IDN_show_punycode setting.
  3. Double click false and it changes to true.
privatei fix punycode firefox IDG

Firefox still shows problematic punycode. Change one of its settings, and the underlying domain ASCII is always shown.

Here are three more tips to help you avoid punycode phishing, in case other cases slip through browser approaches to preventing homographs.

  • Copy and paste the URL before clicking it into a text-editing app, like TextEdit or BBEdit. Even if the URL is rendered in the location/search bar, when you copy it, the underlying punycode is copied, and revealed when you paste.

  • Check a certificate for an https connection. This generally involves clicking the lock icon in the location/search bar, and then follow steps to view the underlying certificate. Any major site, like Apple’s, will have a directly registered certificate. The certificate authorities that create these certificates are supposed to perform certain kinds of validation (more for “extended validation” versions) that make the details in a certificate relatively reliable. If the certificate’s owner doesn’t match your expectation, use the copy/paste trick above.

  • Only fill in account information via a password manager like 1Password or LastPass. They aren’t fooled by homographs, but rely on the actual domain name’s underlying details for auto filling. If your password manager won’t fill in login information, you’re being phished (or the site is misconfigured and referred you to a domain other than the one at which you stored the password.)

Shop Tech Products at Amazon