Update: I’ve basically completed my XML parser, though there are some discrepancies in the data that I need to fix. Changing the module mitigated the Billion Laughs attack, but after testing and retesting the program, I’ve discovered the module throws an error every time it detects an embedded reference in XML. I explained how the references work in my previous post, but it’s like a link to a website (or in the case of XML, another bit of data in the file) that may or may not be malicious. Once I discovered this, I developed a way to test for false positives.
The 533 test files are all benign, and I had two resources for logging data on these test files: the PyCharm (the text editor/IDE I use for writing code in Python) console, which prints data directly from the program, and the database, which stores 12,259 data points in total for the test files. The console rapidly spits out a dozen or so lines of text for every file, so counting false positives there wouldn’t be easy. The database stores data in relation to each individual file, so that’s not very useful for counting either. Neither neither would be effective at logging what I’m looking for.
Then I remembered, I can edit files. I’ve already had to work with reading files for the parser, so how hard could writing to them be? So that’s what I did.
Essentially, with a few lines of code it’s pretty simple to write a variable or a string (a line of text) to a file. The parameters required are the path name of the file and a letter that indicates what you want to do with the file: read it, erase it and write to it, or add to it. I tinkered with the last two. After a few Google searches and help from Stack Overflow, the program did what I wanted it to: log every instance of a false positive. Out of 533 files, I had nine false positives. Not too bad. It’s information that is definitely useful and needs to be addressed before it goes through the data analysis.
Then I remembered an error I had to smooth over earlier: because XML is encoded in UTF-8 rather than in ASCII, one of the functions that was given to Chintan and I didn’t work on XML files if they had non-ASCII characters. ASCII is an encoding for basic letters and symbols, most of which (if you’re reading this in English) you have on your keyboard. Accented letters and many foreign characters are in the UTF-8 codex that XML uses. Add that to the list of reasons web applications are moving towards XML for development.
Anyway, back to the error: I had thrown a try-catch statement and a counter around it (sorry Mr. B, I can practically hear you telling me to never use try-catch statements) but never fully solved the issue. Turns out, because of how the function (which I’ve picked apart piece by piece, yet still can’t figure out how to modify it for XML) is written, even my counter wasn’t working correctly. I applied a similar detection to the one I used for the false positives, and found that 111 of the 533 test files had non-ASCII characters in them. That’s a problem, because it means the function doesn’t work on one fifth of the files.
So I’m stuck yet again. I don’t see any solution to the false positives because that has to do with the source code of the defusedxml module, nor do I see a solution for the non-ASCII characters because I didn’t write the function and I can’t change how XML files are encoded.
Greg, the data analysis guy, had jury duty today so we didn’t have the weekly meeting. He might be able to help with the encoding issue since he wrote the function, but I’ll have to wait until next week for that. Shubha, a grad student that works here, is also working on the software to run the analytics on the database, which means Chintan and I are slightly at a loss for what to do until she’s done with that program. Chintan said it’d hopefully be done by the end of the week. I worked overtime these past few days, so I don’t have to come in tomorrow. Hopefully these errors will get solved and the project will move forward more next week. Until then, I’m going to enjoy my weekend. KFMA Day is this Sunday, and I’m looking forward to seeing Blink-182. The other bands look interesting, but I think last year’s lineup was better.
See you next week!