cyber.dic 2.0: Expand Your Computer’s Vocabulary
We’ve just released a major update of cyber.dic, the spell checker add-on specializing in cybersecurity terms. It’s the latest resource to come out of a need that we, as editors at Bishop Fox, identified for more consistency in the language used across the cybersecurity industry.
The Bishop Fox editorial team initially made the Cybersecurity Style Guide as an attempt to make sense of the dynamic world of software and cybersecurity terminology for ourselves. As the guide was adopted by users across the internet, we learned that what had begun as a way to keep ourselves accurate, consistent, and forward-thinking within the company was also a useful tool for writers across all fields interacting with technology — journalists, developers, sci-fi writers, and so on.
Some researchers requested an open source version of the style guide, but allowing multiple versions would have quickly diluted the guide’s power as a tool for resolving language conflicts. Instead, we turned to a more adaptive format to address this request: the spellcheck dictionary. We made a companion file, called the cyber.dic, that would add spellcheck support for industry-specific terms in people’s word processors. This way, they wouldn’t need to check the style guide for basic questions or second-guess their own expertise on technical spellings.
The guide, and the cyber.dic that evolved from it, have been a way to extend editors’ ability to communicate farther than we ever could by just working with individual authors. The guide is an aid for people to actively reference outside the document they are working on; the cyber.dic instills confidence and helps writers within the document as they are writing. The only catch is, the cyber.dic has limited communication with a writer: It can only indicate a word’s correctness by adding or not adding a squiggly red underline.
WHAT SPELL CHECKERS USUALLY DO
Practically every word processor in use today has some kind of spell checker that determines if you’ve misspelled a word or committed some sort of grammar sin. They are usually based on some established dictionary. For example, Microsoft Word’s documentation shows that its proprietary dictionary pulls from the American Heritage Dictionary and World English Dictionary. Meanwhile, LibreOffice uses an open source format and approach to crowdsource continuous refinement.
But our point is, spell checkers aren’t a panacea for spelling problems: Like any software, they can be as flawed as the people who create them. If you’ve heard of the Cupertino effect, then you’ll remember how Apple’s early dictionaries supported the spelling “co-operation” but not “cooperation”. Anyone who typed the latter found that it was autocorrected to the name of the city in which the company was founded.
Installed spell checkers aren’t typically on the forefront of technological developments either. While they correctly highlight misspellings and give valid suggestions for many common types of writing, documents about tech and cybersecurity typically end up riddled with red underlines that may or may not be necessary or accurate. As a result, the underlining may be an illegitimate distraction, undermine your confidence in the subject matter, and slow your momentum when drafting a document.
The same problem applies to words that don’t get underlined in technical documents; the spell checker may not properly point out misspellings that happen to be dictionary words. Because automatic spell checkers have to cater to the entire range of possible topics someone might write about, awkward situations can arise with technological subject matter. For instance, an accountant might write about an asset’s value depreciating over time, but a security analyst more typically discusses deprecating an old, vulnerable piece of software.
SPELL CHECKING WITH CYBER.DIC
The cyber.dic is meant to account for the limitations of default word processor dictionaries. We’ve provided a supplemental list of over 3,000 terms that adds onto your regular spell checker, along with an exclusion file that acts as an “anti-dictionary” to underline anything that should be flagged as potentially wrong. Here’s a brief overview of how cyber.dic enhances the built-in spellcheck dictionary.
What does it mean when there’s a red underline?
- You misspelled a term that is in the cyber.dic-augmented spellcheck dictionary.
- You spelled a real word that isn’t typically used in tech writing and that you should double-check (e.g., depreciate, breech).
- You spelled a term correctly that isn’t yet in cyber.dic. (Email [email protected] with your suggestions.)
The red underline is more likely to be valid. Maybe you spelled the name of a technology slightly different from how it’s meant to be, or you used a version of a compound word that isn’t consistent with how most people in the industry spell it. It also tailors the spell checker to catch typos that are valid words used in the wrong context. Meanwhile, if a rare term is actually correct but isn’t in the cyber.dic, then you can easily add it to your custom dictionary.
What does it mean when there’s no underline?
- You spelled a term correctly that is in your cyber.dic-augmented spellcheck dictionary.
Success! Your document is no longer littered with unnecessary distractions, and you can confidently navigate through the landscape of acronyms like GDPR, SMTP, and SNMP; objects like YubiKeys and Torx screws; and activities like Zoombombing, sinkholing, and safelisting.
STYLE GUIDE VS. CYBER.DIC
The Cybersecurity Style Guide is a manual in two ways:
- It is analog and meant for humans to use.
- It is meant to guide writers with explicit instructions on what words to use and how to use them.
It might seem at first glance that turning a style guide’s word list into a spellcheck dictionary would be a simple copy/paste action, but a lot more actually went into that transition.
The cyber.dic is not like the style guide in two important ways:
- It is not meant for a person to read. It is meant to be consumed by a computer program and used as a set of instructions with binary implications (red line or no red line).
- It only implicitly instructs a writer on spelling, filtered through the spellcheck software in use.
The table below gives an overview of the changes and additions that made the cyber.dic’s word list:
|THINGS STYLE GUIDE HAS||THINGS STYLE GUIDE DOESN’T HAVE|
|THINGS CYBER.DIC NEEDS||A big, curated word list of terms used in security, programming, and corporate discussions||
Plural, conjugated, and possessive forms of every word that has those forms
Words categorized based on whether they contain spaces, hyphens, punctuation, etc.
|THINGS CYBER.DIC DOESN’T NEED||
Supplementary information: pronunciation, meaning, related terms
Terms we recommend against using (“abuse,” “segregate”)
|Most things in this universe|
THE MAKING OF THE DICTIONARY
After adapting the initial style guide word list, there was still a lot left to do. We also had to learn a bit about how spellcheck software works.
Step 1: Research the Mechanics
Spellcheck dictionaries designed for specific users aren’t uncommon — academics and scientists have passed around field-specific word lists for years.
However, the resources for actually creating a processor-specific, user-defined dictionary were surprisingly sparse. We gathered tidbits of information about the inner workings of various spell checkers by wading through existing sources on how to make a dictionary, ranging from the banal (right-click and add word) to full-blown language building with Hunspell. Some of our most useful resources were blog posts by a few intrepid superusers who had previously researched the two word processors we wanted to support: LibreOffice Writer and Microsoft Word. In particular, we have to give a shout-out to Bob Mesibov* for his detailed technical guidance for the former and Suzanne S. Barnhill** for her expert documentation on the latter.
The process of getting the final product on GitHub would be simple once we found the right instructions online, right? We pieced together export processes for the word lists that got down to the technical detail of how to encode the text file and how to order the terms. Then, we wrote out installation instructions by aggregating information from existing sources. Done?
Step 2: Test the Limits
There was still one important technical problem that we could not find answers for anywhere: Seemingly no one had documented how spellcheck dictionaries would react to non-traditional terms that may or may not look like typical words. Our cybersecurity-specific word list included words with unexpected symbols (ATT&CK), words with hard capitalization rules (MitM), and terms that consisted of more than one word (RC4 NOMORE). Before finalizing terms in the dictionary, we needed to make sure they didn’t cause some unexpected disaster like trapping the spell checker in an infinite loop.
We began throwing real and nonsense words into dictionary files, typing weird sentences into documents, and observing the spell checkers’ reactions. The charts we built to decipher the results got longer as we found more edge cases to test, and inconsistent results from Microsoft’s spell checker were sometimes confounding.
We first had to figure out the general logic of spellcheck functionality, considering the following questions for each spell checker:
- How does it resolve conflicts? If the same word is included in a custom dictionary and exclusion file, does it get underlined as incorrect? (In Microsoft, no. In LibreOffice, yes.) If a real word that already exists in the built-in dictionary is added to the exclusion file, will the spell checker honor the exclusion? (Yes, but you have to include every form: initial caps, plurals, verb forms, etc.)
- How does it treat capitalization? If a dictionary includes an all-lowercase string, will the spell checker automatically accept the version of it with an initial uppercase character? (In LibreOffice, yes. In Word, dictionaries accepted it but exclusion files did not.) What about the all-caps version? (Sometimes.) What about the all-lowercase version of a term with initial caps? (No.)
- Do the same rules apply to dictionaries and exclusion files? (A resounding “no.”)
- What happens to non-letter characters in a word?
The question of non-letter characters was particularly important to figure out. Would a spell checker break when it encountered terms like ATT&CK, C#, HTTP/2, and ASN.1? We categorized the possible non-letter characters like this:
- Standard punctuation marks: periods, commas, apostrophes
- Special characters: symbols like @, &, #, /
The conclusions were, again, complicated. Word accepted the weirdest characters but did not allow spaces. LibreOffice, meanwhile, was perfectly fine with spaces but did not understand special characters or punctuation inside a term.
After verifying the technical capabilities of each word processor, it was time to codify the rules as conditional statements in Excel. We imported our already culled word list and began noting special cases. Here’s a comparison of how that spreadsheet looked between preparing for initial release and the more robust version with automation that we used to develop cyber.dic v2:
Our formulae have changed as we’ve discovered more exceptions over time and as updates to the software have caused subtle changes in how spell checkers function.
Step 3: Complete the Word Lists
The last part of the technical puzzle was how to make our dictionary and exclusion word lists as complete as possible while working inside the technical boundaries of the spellcheck software.
We went through the entire list multiple times to add variants of terms. On launch, we only included plurals for some singular terms (honeytoken/honeytokens), but with v2 we have considered all forms for all terms, adding verb forms (spidered/spidering) and possessives for proper nouns (GitHub/GitHub’s). We also paid more attention to terms with spaces that were somewhere in between a word and a phrase, splitting up those terms into two entries to account for Word’s notable inability to handle words with spaces in them (CIS CSC).
We also used the mighty exclusion file to overcome some limitations. For example, the style guide uses the term web server with a space, but Word’s built-in dictionary allows webserver without the space. Because we couldn’t simply add web server as a single term, we instead included variations of webserver in the exclusion file to override Word’s automatic spelling. This included webserver, webservers, and, because it wouldn’t account for capitalization automatically, Webserver and Webservers.
Working with and around the strange host of limitations sometimes involved extreme mental gymnastics, and there are still some things that are easily understood by one word processor that we simply cannot add to the other processor’s version of cyber.dic. We like to think that with v2, we’ve managed to troubleshoot most of the problems in v1 and added a richer word list to make the cyber.dic more useful for everyone.
We started out imagining the cyber.dic as an extension of the style guide, but ultimately a spellcheck dictionary is a very different tool for a very different use case. Our approach to creating the latest version of cyber.dic has matured greatly since the initial release. As creators and users of the dictionary over the past year, we have learned its flaws, removed what doesn’t work, and fixed our methodology.
The original cyber.dic included 1,786 terms for LibreOffice Writer and 1,314 terms for Microsoft Word. Since then, it has more than doubled in size with terms from style guide updates and from our experiences using it, and we have devised a more standardized, systematic way to determine how to add terms. We stopped trying to pack in all of the style guide’s nuances and embraced the simpler communication style of the spellcheck dictionary format so that we could maximize our tool’s flexibility and give you a smoother user experience. (It looks like Microsoft has been tweaking its built-in spellcheck dictionaries, too — we like to think we may have given them a little nudge.) We hope that you find cyber.dic to be a valuable asset and that it helps you feel more confident as you write about tech and security.
*Build a Scientific Names Dictionary for LibreOffice and How to Build and Edit LibreOffice Dictionaries by Bob Mesibov.
**Create an Exclusion Dictionary and Mastering the Spelling Checker by Suzanne S. Barnhill.
Introducing cyber.dic (September 2019)
The Bishop Fox Cybersecurity Style Guide (June 2018)
Subscribe to Bishop Fox's Security Blog
Be first to learn about latest tools, advisories, and findings.
Thank You! You have been subscribed.