Introducing jsluice: The Why Behind JavaScript Gold Mining (Part 1)
JavaScript. Depending on who you are it's a word that can instil fear, joy, or curiosity. Regardless of your opinions on Brendan Eich's polarising creation, it's hard to deny its influence on the web. Once a simple HTML salad garnished with gifs, the modern web is a bubbling cauldron of complexity with JavaScript as its thickening agent.
jsluice
is an attempt to make it a little easier to find some of the tasty morsels bobbing around in that cauldron. To talk about why it needs to exist, we first need to talk about The Old Web. If you know what a webring is, or you've ever had to pick the exact right "under construction" gif, you might be able to skip some of this next bit.
The Old Web and The New
The old web was based mostly on HTML. Hyperlinks were clicked and rocketed you through cyberspace from one place to another; JavaScript was mostly reserved for adding a bit of flare along the way. This was a world quite easily understood by fairly simple programs. If you wanted to write some code that crawled the web you could get good coverage from just following those links:
<a href=/guestbook.html>Sign my guestbook!</a>
Extract that href
attribute, attach the path to the site's domain and you get a new URL to look at:
http://example.com/guestbook.html
Recursively repeat the process for the page you just loaded, and you've got yourself a pretty good picture of that whole website. Sure, you might want to grab copies of the images loaded with the <img>
tag, or maybe even follow the trail left by the action
attribute on <form>
tags, but for the most part you could follow the regular links and you'd find ninety percent of what the web had to offer.
New Powers for JavaScript
Around the turn of the millennium, Microsoft brought us something that would change the web forever: the pithily named, XMLHttpRequest.
Those who looked beyond the inconsistent capitalization realized the true potential of this thing we had been given. The ability to make HTTP requests with JavaScript meant that new data could be fetched and displayed to the user without reloading an entire web page as had been the case for the history of the web up until that point. And so, people renamed it AJAX and set about coming up with new and interesting ways to load ads, track users, and generally make life a little more difficult for web crawlers.
It wasn't long before the Single Page Application (SPA) was born. The SPA eschewed the full-page-loads of old in favor of loading just about everything with JavaScript. Somewhere along the way we got the fetch API to replace XMLHttpRequest
, but the result was the same: the web browser could load a web page that was not much more than a single <script>
tag; leaving JavaScript to build the structure of the page, fetching the data required from APIs.
APIs
In the days of The Old Web, web servers generally served only a handful of types of data: HTML, Images, JavaScript files, CSS files, and maybe the odd Java applet. That made a lot of sense because those were the kinds of things that web browsers understood natively. With JavaScript suddenly able to make its own HTTP requests, the requirement for a browser to understand the data being returned was gone and a new requirement introduced: make it easier for JavaScript to understand the data being returned.
Early HTTP APIs were largely XML (and Microsoft seemed to think it would stay that way given the name XMLHttpRequest
). XML is complicated, so Douglas Crockford discovered a thing he called JavaScript Object Notation (JSON). JSON is based on a subset of JavaScript syntax used to define objects. It looks a bit like this:
{"name": "A. Crawler", "age": 3, "likes": ["html", "hyperlinks"]}
The fact that this syntax already existed within JavaScript itself is why I say it was discovered
rather than invented. That fact was also a big part of why JSON caught on so quickly; it made it super simple for JavaScript to understand JSON. Pass some JSON you got from an API to the eval
function (or once it was available, the more secure JSON.parse
) and you've got a native JavaScript datatype you can manipulate with all your favorite JavaScript functions and facilities.
The data for web pages being fetched from APIs made life difficult for web crawlers. JavaScript is complicated, and really the only things that understand it fully are the JavaScript engines used in web browsers.
The shift from using APIs on the server-side to the client-side also had security implications. Among those implications was the dramatically increased chance that the secrets sometimes used to authenticate with APIs would accidentally show up in JavaScript files from time to time. These secrets can sneak into JavaScript files for other reasons too. We'll come back to this later.
Crawling The New Web
There are a few reasons you might want to crawl the web. The first, and perhaps most obvious, is to build a search engine. Another reason, and the reason that we're most interested in, is to build a map of a web application for security testing. The more you know about a web application, the more endpoints you can probe for weaknesses, and the more vulnerabilities you are likely to find. If we want to crawl The New Web, the most obvious solution is to use a headless browser.
A headless browser is the same as a regular browser, you just don't get to see the webpage. All the same things happen though; the JavaScript runs, builds the page structure, makes calls to APIs and that kind of thing. Tools like chromedp let you inspect the final state of the web page after the JavaScript has done its thing, the links that are now in the page, the HTTP requests that were made and so on. You do often have to "stimulate" the web page running in the browser though. Writing code to click buttons, scroll, and interact with other elements in the page causes more JavaScript events to happen, hopefully increasing the amount of the application you discover. That's kind of tricky to do well, but it's probably good enough if you want to build a regular search engine.
For security testing though, we want to know a web application's deepest, darkest secrets. Buried in the multi-megabyte monstrosities that some applications call JavaScript files are often swathes of functionality that are only triggered in rare circumstances, or sometimes not at all. The kind of analysis you can do with a headless browser, what you could call dynamic analysis - i.e. analyzing a program while it runs - is very powerful, but it requires that code is actually running to analyze it. If you can't make all the code run, you can't analyze all the code. This limitation makes room for static analysis - analyzing code that isn't running.
Static Analysis
In many ways static analysis isn't as capable as dynamic analysis. There are variables you don't know the value of, functions you don't know about, and it can be nearly impossible to figure out what the code will actually do without running it. That's not so much of a problem for us though, because we mostly care about extracting specific kinds of information:
- URLs, paths and their ilk – so we can probe them for weaknesses
- Juicy secrets that were left there by mistake
Let's focus on the first of those for now. There's a bunch of different ways JavaScript might use a URL or path.
It could make the browser navigate to a new page:
<p>document.location = '/guestbook.html'</p>
It could open a new window:
window.open('/help')
Or it could make an HTTP request:
fetch('/api/comments').then(handleResponse)
There are several other variations on that theme, like assigning to different properties and being passed to different functions, etc.
Our goal is to extract the /guestbook.html
, the /help
, and the /api/comments
from those examples. Now, I know what you might be thinking - let's use regular expressions. There's a popular quote about regular expressions along the lines of "if you use 'em to solve a problem, now you have two problems". That quote is really about the overuse of regular expressions, particularly in situations where they're not the best solution. Regular expressions are powerful, and jsluice
does make use of them, but let's look at some of the reasons they're not a great fit for this task.
Regular Expressions and JavaScript
Rather than consider all the different scenarios we might encounter, we'll just focus on one example: making an HTTP request using the fetch function:
fetch('/api/comments').then(handleResponse)
The bit we're after is the first argument to the fetch
function: /api/comments
.
This is just about the simplest invocation of fetch as you could have, and a simple regular expression would do the job in this case, say:
/fetch\('([^']+)'\)/
Despite looking a bit like a mangled kaomoji that's not the worst first stab at the problem we could have had. It's not going to handle this case using double quotes though:
fetch("/api/comments").then(handleResponse)
We need to update our regular expression a bit to account for that. At first glance that seems easy to deal with. We just swap the quotes for a class containing both kinds of quotes:
/fetch\(['"]([^'"]+)['"]\)/
That matches now, but what about for this case where a single quote appears inside double quotes?
fetch("/search/users/o'neill").then(handleResponse)
You might think I'm being unreasonable with this example, and you're kind of right about that, but the thing is at scale, the seemingly unlikely becomes practically commonplace. The web contains just about every variation of how you could do something right, and even more examples of how you can do something wrong. Trying to account for even a decent proportion of them is going to difficult.
As it happens, you can solve this particular problem with backreferences:
/fetch\((['"])([^\1]+)\1\)/
Don't forget to handle backticks for template literals, and whitespace before and after the string, and cases where fetch
has a second argument, and cases where there's two strings concatenated together, and... you get the picture.
Let's say you manage the gargantuan task of good writing, robust regular expressions for this scenario and all the others you want to handle. You now must maintain that pile of ASCII-put-through-a-blender, and presumably debug it from time to time. I wish you luck, because you'll need it if you want to also add some context to the data you've extracted.
Extracting Context
Having a list of URLs and paths you pulled out of JavaScript is useful for a security-oriented crawler. It could be even better though. Here's a slightly more complicated fetch example to illustrate what I mean:
fetch('/api/v2/guestbook', { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({msg: "..."}) })
Knowing that the /api/v2/guestbook
endpoint exists is valuable, but also knowing that it's expecting JSON to be sent with the HTTP POST method is even more valuable.
Now, I'm not going to claim that this is impossible to do with regular expressions. In fact, some Perl-addled lunatic may well be trying to prove it can be done at this very moment (please email me if this is you).
So, if regular expressions aren't the best tool for the job, what is?
Syntax Trees
When a programming language like JavaScript is executed, one of the first things to happen is that the text is parsed into a thing called an Abstract Syntax Tree (AST). An AST represents the structure of the source code and makes it easier for other programs to understand.
For the first time in this blog post I want to introduce you to the jsluice
command-line tool, but only a little preview, because it has a mode for printing the AST for any JavaScript file. Let's use a slightly cut-down version of one of our previous examples for the input:
fetch("/api/guesbook", {method: "POST"})
I've saved that as fetch.js
and then run jsluice tree fetch.js
to see the AST:
▶ jsluice tree fetch.js fetch.js: program expression_statement call_expression function: identifier (fetch) arguments: arguments string ("/api/guesbook") object pair key: property_identifier (method) value: string ("POST")
As you can probably see, things get verbose quite quickly, so it's a good job we only used a small example.
Having an AST has removed the requirement for us to concern ourselves with whitespace, quoting styles, and that sort of thing. It has also added context that lets us much more easily write code to extract the information we want. If we want to extract the request path and HTTP method from the above example, we can write relatively simple code to do so. Here's some pseudocode that should do the job:
Find call_expression branches in the tree For each call_expression: If the function is not 'fetch': move on to the next one If the first argument is not a string: move on to the next one Save the first argument as $PATH If the second argument is an object: Find pairs where the key identifier is 'method' If one exists: Save the value as $METHOD Print the $PATH and any $METHOD
This process should work regardless of what kind of quotes were used, if there's random whitespace scattered throughout, if the method key was quoted or not, if the second argument exists or not, and so on. It also gives us tons of opportunity to do any other clever things we might think of along the way because instead of writing incantations full of backslashes and brackets we're just writing regular code in our language of choice.
This is how jsluice
does its thing. It uses the absolutely superb MIT-licensed Tree-sitter library to get a syntax tree for the input JavaScript, and then walks that tree to pull out the gold that's just waiting there to be discovered.
Check out Part 2 of this series to learn how to extract important information from JavaScript code and files in your own projects.
Subscribe to Bishop Fox's Security Blog
Be first to learn about latest tools, advisories, and findings.
Thank You! You have been subscribed.