HowToNotFuckUpArabic

So recently a friend pointed me at the excellent site Nope, Not Arabic, which is basically Engrish for Arabic. Except worse, because Engrish is often at least vaguely comprehensible. Now, I really don’t actually know Arabic, but I took a couple years of Arabic classes in high school and while I’ve forgotten most of the language through disuse, I still utterly love the script. I can look around from my desk and see two pieces of Arabic calligraphy hanging on my walls right now, and that’s honestly not enough. So, it makes me sad to see all these places that should know better screw up Arabic script so fundamentally.

Now, the person who runs the Nope Not Arabic site also has a very good article on how to make a computer render Arabic text properly, so I’m not going to try to do that. What I want to produce here is a short, easy tutorial for programmers, graphic designers and other such people who don’t speak Arabic and don’t really want to, but also might have to touch Arabic text and have enough pride in their work to not want to gratuitously screw it up. So this is a short primer on how Arabic text is written, so you can look at Nope Not Arabic and understand why a particular example is wrong. It is NOT going to have anything to do with the actual Arabic language, grammar or vocabulary, apart from where absolutely necessary.

Also, as I said, I barely actually know Arabic, so if I screw something up, please correct me. This is a wiki after all, go ahead and make an account.

Disclaimer: I’m just going to refer to “the Arabic script” as “Arabic” here on. Remember that it’s used for various other languages as well. If I mean the language, I’ll write “the Arabic language”. Also, unfortunately in some browsers the default font for Arabic on this site is itty bitty, so I recommend you zoom in a bit so you can see the details of various letters that your eyes aren’t used to looking at. Sorry.

Fundamental stuff

Arabic is written right to left

You’d think this would be obvious, as Arabic is one of the most common right-to-left languages, but apparently not. There’s several examples in Nope Not Arabic of this getting messed up.

Arabic is made of letters

To the untrained eye, Arabic basically just looks like a pile of squiggles and dots. Fear not. Each of those squiggles is a letter, and you can take them apart and move them around pretty freely. We’ll go over that in a bit.

Arabic is always cursive

The letters in an Arabic word are written to connect to each other, like cursive handwriting in Latin script. Always. You cannot write a bunch of disconnected letters and still have it be correct. This is the prime sin that gets made in various images on the Nope Not Arabic site, someone copy-pastes some Arabic text and the program mangles the letter connections in the process.

Repeat after me: Arabic text is always cursive!

The annoying thing is, it’s not even particularly hard to make a computer program select the right form of a letter based on where it is in a word and connect it up with the other letters. It just takes a little work. The letters basically always start and end at the baseline of the word, or can be made to do so by the font, so there’s no irritations like the common cursive o that starts at the baseline and ends off of it. What is hard is dealing with a mixture of left-to-right and right-to-left text in a sane way, and connecting letters up correctly. If you’re dealing with right-to-left text only it isn’t terribly hard.

Arabic has lots of fluff

There’s a big difference in complexity between “Arabic as written in the Koran, dictionaries, and academic texts”, and “Arabic as written by and for actual humans”. Arabic is visually defined by a plethora of dots, flourishes and other diacritics in odd places, which makes it look very busy and confusing. The thing is that a lot of these are optional, and are left out to a greater or lesser degree in everyday stuff like newspapers, handwriting and signs/advertisements.

As an example, take a look at the logo for Arabic Wikipedia, which TOOK SOME HECKIN’ DECIPHERING but which I think says “Welcome to Wikipedia”:

Welcome to Wikipedia - Fancy
Welcome to Wikipedia - Fancy

Now here’s the same text except I’ve scratched out all the stuff that doesn’t need to be there:

Welcome to Wikipedia - Simple
Welcome to Wikipedia - Simple

The dots are part of the letters, the same way the dot on an i or the cross-stroke on a t are. All that other stuff is mostly just a built-in pronunciation guide, the equivalent of writing little IPA glyphs above and between your letters – See the appendix for details. All you actually need is the dots. You’ll often find one or two extra pronunciation marks (diacritics) to accent sounds or disambiguate words that are pronounced differently but otherwise spelled the same, such as English “lead” (not follow) versus “lead” (soft, heavy metal). But you basically never need to go ham and write in every possible diacritic in every possible spot. In this case it’s there just because it looks cool. This is pretty common in stuff like titles and logos.

The optional diacritics are always above, below or sometimes beside a letter, and generally rendered in a lighter line than letters and their associated dots. Usually words can be written entirely without them, occasionally specific words have one or two for clarity. But if you see text that has one or more of these marks for every letter in the word, realize you’re reading a super fancy document or a dictionary definition.

The grammar and occasionally spelling of the Arabic language is also very complex, adding or mutating letters based on the sentence structure… but that generally doesn’t actually change how things are written, so we don’t have to worry about that. You might run into weird diacritics or stuff that looks like letters with extra bits tacked on. We’ll go over a few of the common ones later, but for the most part you can just treat them like the letters they most resemble and you’ll probably do okay.

Letters

The Arabic language has 28 letters. Here they are, shamelessly stolen from Wikipedia, with my… ahem, interpretation of what English letter/sound they most resemble:

Arabic alphabet
خ ح ج ث ت ب ا
KH like the Klingons use it H J TH as in “throw” T B AA
ص ش س ز ر ذ د
S but different SH S Z R TH as in “though” D
ق ف غ ع ظ ط ض
Q as in Iraq F GH like the Klingons use it A with more phlegm Z but different T but different D but different
ي و ه ن م ل ك
EE OO H N M L K

Good thing I’m not trying to write about the actual language, isn’t it? The only letters that we have to really recognize are the vowels, because they often have a couple extra bits attached, plus a handful of characters we’ll talk about in a minute. Note that there’s only really three vowels, AA and EE and OO, or rather, ا and ي and و. These are “long vowels” and you pronounce them as such. (Though at the beginning of a word, و has the W sound and ي has the Y sound.)

For our purposes you don’t have to worry about the pronunciation, but take some time to see the patterns in how all the letters are written. Get a piece of actual paper or something and write each of them down a couple times to get a feel for it. Remember, right to left.

The thing is that for each of these letters there’s three extra forms of it, based on whether it’s standing alone or it’s at the beginning, middle and end of the word. Lots of people freak out about this and get frustrated at the idea of having to memorize 112 entirely distinct letters. But it’s not actually a huge deal, because they’re not entirely distinct. As you can see, the letters are mostly variations off of <10 actual shapes, and a few standalones. And the starting, middle and ending forms for these shapes are basically the same. It’s not 100% consistent, ن is written more rounded than ب for example so there’s a bit more than just moving the dot, but it is pretty close. So here are all the forms are, grouped vaguely by how I think they look:

Isolated form Final Medial Initial
ب ـب ـبـ بـ
ت ـت ـتـ تـ
ث ـث ـثـ ثـ
ن ـن ـنـ نـ
ي ـي ـيـ يـ
ف ـف ـفـ فـ
ق ـق ـقـ قـ
س ـس ـسـ سـ
ش ـش ـشـ شـ
ص ـص ـصـ صـ
ض ـض ـضـ ضـ
ط ـط ـطـ طـ
ظ ـظ ـظـ ظـ
ح ـح ـحـ حـ
خ ـخ ـخـ خـ
ج ـج ـجـ جـ
ع ـع ـعـ عـ
غ ـغ ـغـ غـ
ل ـل ـلـ لـ
ك ـك ـكـ كـ
م ـم ـمـ مـ
ه ـه ـهـ هـ

All of these letters connect up to each other at the baseline of the letter, so if you wanted to write a word spelled “ف ث ك” (which doesn’t spell anything I think), you just choose the “starting word” form of ف which is فـ, the “middle word” form of ث which is ـثـ, and the “end word” form of ك which is ـك. Mush them together, and you get فثك. See? Easy. If you’re typing these things into a computer program and it doesn’t do this for you, your computer program is broken and you should use one that isn’t broken.

You may have noticed some letters were missing from that chart, so I’m about to make it a little more complicated: There’s six letters that only connect to the right, they don’t connect to anything on the left side. These are:

Isolated/Initial form Medial/Final
ا ـا
و ـو
ر ـر
ز ـز
د ـد
ذ ـذ

Again, there’s only like 3-4 actual unique forms depending on how you look at it. If one of these letters happens at the end of the word, you just use the end-of-word form. If it occurs at the beginning or middle of a word, you use the end-of-word version, then start off with the initial form of the next letter. You leave a gap, but a smaller one than the space between words.

So, let’s investigate some more nonsense words. What if “ف ث ك” was “ف ر ك”? Then you write it “فرك”. If it were “ذ ث ك” then, similarly, it becomes “ذثك”. And if there’s one of these letters at the end of the word, like “ف ث ا”? You get “فثا”. ezpz. I strongly suggest noodling around with this on paper to play with the various forms. Try writing your name, names of places you like, swear words, whatever nonsense you want. It will stick in your brain better if you do it by hand.

One last thing, often in handwritten Arabic and some fancy fonts, two dots get merged into a single dash (or apparently tilde, though I haven’t seen that myself before), and three dots get merged as a little carat-looking thing. Here’s an excellent example from this blog, which saves you from my horrible handwriting:

Dot variants
Dot variants

Ligatures and other weird stuff

Arabic is infamous for these. However, this also is not actually a big deal in my experience, because there’s only a handful that you’re likely to actually encounter often.

First off, the most important ligature is “L A”, that is, ل ا . You put these together and you get لا. There’s a couple different ways to write it, same way there’s a couple different ways to write a lower case “a”, but the difference is only stylistic. Since it ends with ا, it only connects to the right, so “ل ا م” is “لام”, and you can see it in the middle of other words like “تعديلات” (a real word for once!). You basically have to do this ligature, in printed text and handwriting, and this letter sequence is pretty common. Not doing it will Look Wrong. However, it is also the only ligature that you have to do; any others are just for style points. For extra credit, see how many ligatures you can find in the Arabic Wikipedia logo. Here’s one for free: The first word is مرحبا, not وحبا.

The Arabic language also has some weird… things that it pretends aren’t letters but sure as heck act like them. Again, I don’t actually know Arabic per se, so I kinda know what these are but sure can’t explain how they work. Fortunately, as far as I can tell there’s only a few of them:

  • Ta marbutah, which literally means “tied-up T”. It’s the H letter, ه, with two dots over it, like so: ة – or rather, it’s the T letter ت, bent into a circle. It works the same way that ه does though, and only appears at the end of words, so you’ll usually see it as ـة, and it’s pronounced sometimes like a final h or sometimes like a final t or sometimes like not much at all. It’s not considered as a letter in the alphabet for whatever reason, I think it’s used only to modify words based on grammar. But it’s pretty common in text, so you’ll have to know it when you see it.
  • Alif maqsura, which means “limited A”, is also is written only at the end of words, and is a dotless ي, so, ى or ـى. It’s a weird “A”, and is uncommon. That’s all I know about it.
  • Hamza, which is a little squiggle like a backwards 2: ء. It represents a glottal stop. Occasionally you’ll see it alone like that at the end of a word, but more often it gets attached to a vowel, so you’ll see إ or أ or ؤ or ئ . Note that in the last case, it’s a ي but the dots beneath it vanish when you add the ء, for whatever reason. Also for whatever reason, there’s something that looks like it in the standalone and terminal forms of ك. Those aren’t a separate hamza, those are just part of the letter the same way the dots are, because why not.
  • There’s also آ. Not sure what the deal with that is. It’s an ا with a squiggly bit.

All of these weird not-letters are just treated the same ways as letters or parts of letters, as far as I know. If you leave them out or screw them up, your text may be misspelled but people will generally still be able to read it.

Case study

I found this image on Nope Not Arabic particularly painful:

Ouch
Ouch

Now, the word for “hello” in Arabic is “marhaba”; even I remember that. But if you squint at the Arabic text a bit, it sure isn’t that. In fact… Not only are the letters disconnected, but they’re written left-to-right. That’s… about as embarrassing a travesty as you can get. They’re trying to write “hello” and they actually write something more like “olleЧ”. In an image demonstrating how to write non-Latin scripts correctly. It’s hard to fuck up Arabic more than that.

So, how do we do it right?

The actual letters are correct: M R H B A. No funny grammar or spelling here. So first we arrange them from right to left, and get “م ر ح ب ا”. We find the initial form of م and connect it to the ر, and get “مر ح ب ا”. Off to a good start. ر doesn’t connect to the letter left of it, so we take the initial form of the ح, connect it to the middle form of the ب, and connect that to the final form of the ا to get “مرحبا”.

Looks plausible. Did we do that right? Let’s try to look it up. That dictionary translates “hello” as “مَرْحَباً”, but we recognize the little extra dashes and circles over the letters as diacritics we can leave out, so without those it just says “مرحبا”. We did it right!

See, Arabic isn’t so hard! Until you try to learn the Arabic language and get to third-year grammar, that is.

Conclusion

Don’t be scared of complicated scripts. Write letters, from right to left, and connect them together properly. If you do that then you might not still be writing what you actually want to write, but chances are good that you’ll just have something spelled wrong. At least you won’t be actively butchering the language, and a native speaker will probably look at it and say “lol they spelled it wrong”, and not “YOU DENSE MOTHERFUCKER”. And short of taking a few years to actually learn the language well, or having a friend who has, that’s sometimes the best one can really hope for.

Getting a computer to do this, of course, is left as an exercise to the reader. But hopefully the reader can at least now tell when the computer is doing it wrong.

Appendices

In which we talk about some of the aforementioned fluff.

Short vowels and other deelyboppers

Written Arabic is mostly just consonants, with a few vowels where they really matter. But there are vowel sounds between consecutive consonants, they are pronounced, and they do matter. Usually you just don’t bother writing them because it’s perfectly clear what the word is without them. The Arabic language was originally written mostly without these and even without dots, and people just glarked the right word from context. However, during the Islamic Golden Age a lot of people who weren’t native Arabic speakers were suddenly interested in learning Arabic and pronouncing it correctly, and this ambiguity caused issues for them. People started adding dots to mark some vowels, but it was still ambiguous and not great. Then a total baller genius named Al-Farahidi refined and simplified the system to add extra markers to the letters to explain exactly how you pronounce all parts of the word, and the most common of these are the “short vowels”. This is basically the system that gets used today.

Short vowels and other diacritics take the form of little letter-like strokes above or below a letter, and there’s three of them matching the long vowels, A and E and O. They are written ـَ and ـِ and ـُ respectively, which makes sense to me. ا is tall so a short A goes above the letter, ي goes below the baseline of the word and so short E goes below the letter, and short و is just a mini version of it. Then there’s ـْ which means “no short vowel”. They can be attached to vowels or other consonants, and are used for diphthongs as well. Then there’s other things that mark “double consonent”, “terminal N” or other weird mutations of words. Frankly, Wikipedia will do a far better job of explaining the whole list than I will, so check that out for the full list.

Non-Arabic Arabic letters

Arabic text actually gets used for a lot of different languages, and a lot of them have sounds that aren’t in the Arabic language. It’s used in Persian, it’s used in a bunch of African and Central Asian languages, there’s even Chinese languages written in Arabic script. So, much like English uses things like “kh” in Arabic or “x” in Chinese or such to represent sounds that the English language doesn’t have, various people have mutated Arabic letters a bit and made new letters that represent sounds that the Arabic language doesn’t have. You can find a whole ocean of them on Wikipedia.

Easy example: Arabic has no letter P. So if you were trying to write “Pepsi” in Arabic, the P becomes a B, and you might write something like بيبسي, “Beebsee”. However, sometimes people use پ‎ for P instead, that is, a ب with three dots instead of one. It’s not really a formal or proper thing to do, but occasionally you see it when someone wants to be cool and write a foreign word more accurately.

Actually now that I look it up, apparently in Persian and some other languages, پ is a real letter that is used for the P sound, so maybe this usage isn’t as made up as I first thought. So, just be aware that the same way that English might casually steal the ñ in “El Niño” or the é in “élan” and pretend those words were theirs all along, Arabic may randomly swipe letters from other languages.

Numbers

Or, “How to read license plates in Jordan”.

  • “Arabic” numbers (used in European languages): 0 1 2 3 4 5 6 7 8 9 10
  • “Indian” numbers (used in Arabic languages): ١٠ ٩ ٨ ٧ ٦ ٥ ٤ ٣ ٢ ١ ٠

As you can see, these things are written pretty similarly to each other, often just twisted or rotated or redone a bit. This is of course not a coincidence. As the names imply, this style of number started in India, migrated through the Middle East, and ended up in Europe. Maybe in China they call these things “European” numbers?

The thing is, take a look at that last number, ١٠, which is of course 10. You’ll notice it’s written in the same order as the European version, despite Arabic text being written the opposite direction. So the number 5193 is written ٥١٩٣, in the same order as European numbers are written, even when part of a full sentence you’re reading right to left such as “الرقم ٥١٩٣ رائع”. And if that sort of nonsense doesn’t give the computer people writing bidirectional text systems fits, I don’t know what will.

Of course, it could be argued that Europeans write the numbers the wrong way around, and ٥٦, 56, should actually be written as 65. After all, that’s how many languages such as German actually say it: “six and fifty”. Fortunately though, we’re just stuck with it now and nobody’s going to change it.