I talked to Dr. Hariri the other day, and it turns out that the last feature wasn’t necessary for me to do, thank god. The next step of the project is threat testing, which is basically building a model for what can go wrong in terms of potential weaknesses in the software and how those weaknesses can be defended. I also pulled up the malicious XML files Doug gave me a month or so ago to test my completed program on actual malicious files, because up to this point I had been using a combination of the benign test files Chintan gave me and XML files I wrote myself for the purposes of testing different aspects of my program.
So, I ran the parser on the malicious files. For all you programmers out there, you know exactly what I’m about to say. The very first malicious file I tested the parser on broke the program.
Fast forward through a day or two and a number of hours of research, and I had discovered what the issue was (also, if you haven’t gathered yet, programming is an endless loop of running into and fixing errors). The Python module I used to navigate the XML files, ElementTree, wasn’t designed to navigate malicious files, so when a recursive payload attack was run against it (I’ll explain that one in a second, it’s called the Billion Laughs attack >:)) the parser more or less caved in on itself. If I let it run too long, it would even freeze my computer. So I did some research (aka begging the benevolent god that is Google for help) and found that there was another module, defusedxml, that did the same file navigation that ElementTree did, but this one had built-in detectors for four common attacks, one of them being the Billion Laughs Attack.
Okay, what is a module and what is this virus that sounds like it was developed by the Joker? I’ll keep you in suspense and go with the less interesting one first. Programming is essentially developing functions based on the functions available to you through the language (print, split, ”.join, etc.) and having the computer use them to do what you want it to do. A module is basically a class (the word ‘class’ as is used in programming isn’t relevant at the moment) which contains more functions you can utilize than the language had before. You have to download them, but they’re generally free and in my case very necessary. I’ve already had to download five extra modules for this program alone. Now on to the more interesting bit.
So the Billion Laughs attack is actually a pretty ingenious bit of code, even if it is a virus used to execute DoS attacks (overloading a system with junk information). It’s also known as the XML Bomb or more formally as the Exponential Entity Expansion Attack. Enough with the technicalities, here’s what it looks like (I pulled this one from Wikipedia, but the one Doug gave me is essentially identical):
<!ENTITY lol "lol">
<!ELEMENT lolz (#PCDATA)>
<!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
<!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
<!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
<!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
<!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">
<!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;">
<!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;">
<!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;">
<!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">
Yes I’m sure you understand what that bit of code does. I totally did too when I first saw it (<= sarcasm, if you couldn’t tell). Basically, each element in the next line refers to all the lols in the previous lines, ie the lol1s in entity lol2 refer to the whole line of lol1s. The lol2s in lol3 all refer to the lol2 entity, which refer to all the lol1 entities each of which refer to the whole line of lols in lol1. Each element in one line refers to the full line of all the previous, and it waterfalls down until at lol9 you have 100,000 lols that the parser has to go through. One more line and you’ve got your Billion Laughs attack and you can easily overload a system with just handful of these. It effectively makes a file of a few kilobytes seem like a file of a few gigabytes. That tiny chunk of code is a full-fledged virus.
Thankfully, the defusedxml module throws an error when it runs this and three other well-known attacks (which stops the program rather than overloading my CPU). Unfortunately, it throws an error and I had to deal with the fit my program threw for the single error. One error and the program stops. This time Google wasn’t much help and I had to ask Chintan after a few hours of failing to find a solution to said error. A little tinkering and my program was back on track.
Another thing this module fixed was an error I couldn’t figure out earlier, one in which for some reason my parser could only handle small samples of the 533 test files Chintan gave me. Now my program can churn through all 533 files at once and extract 23 attributes from each file and do so at an average of one file for every half second. Do the math, the program takes while to run but the parser works on every. single. file. This is one of those moments where I’m in awe at what computers can do rather than wanting to tear my hair out at what computers can do.
I very well may post again this week, but I just wanted to celebrate the fact that my program is running so well for the moment. It’s actually pretty simple to test the Billion Laughs attack since all browsers can run XML files. On a browser it doesn’t do much other than show you a hundred thousand lols though, so it’s not too bad. Parsers are the ones that really hate it.
Anyway, see you later!
Here’s the main article I used to figure out the Joker attack: https://cytinus.wordpress.com/2011/07/26/37/