I discovered (from Dr. Hariri) that the goal for my project is to build a working microcosm of the methodology that Avirtek uses to classify malicious XML files. The project is three parts: feature selection, data analysis and rule development/classification.
In terms of the parser, the false positives I was getting turned out to be more of an issue than I realized because those errors wouldn’t allow the data to be stored in the database for those files. I talked to Dr. Hariri, and he said that it would be better to disregard the Billion Laughs attack for this model and focus on getting the data to establish a baseline for a normal XML file. In other words, I’ve switched back to the elementtree module and left the Billion Laughs attack on my desktop rather than keep it in my IDE (the place I develop code and where all the relevant files are). All that time spent on defusedxml, but in the end it doesn’t even matter (some of you will get that reference). Thankfully, elementtree still works for the other malicious files I have. That brings the number of malicious files to 5/541, and that is what the data analytics will be trained on. With that, my parser (the program for feature selection) is complete and I have begun working on the data analytics phase of the project.
I have found that coding is applied theory, and this very much holds true for the data analytics phase. At the end of yesterday, Dr. Hariri went over the theory for the discretization of data, which is basically condensing data down into manageable, processable blocks. Remember Shubha, who is currently working on the data analytics? Dr. Hariri asked her to teach me about the actual code for discretization, which she is learning on the fly (I’ve discovered that this is pretty common, once you have programming skills you’re expected to just use those skills to figure out how to solve various problems even if you don’t have much background. Sound familiar?). I understood the majority of it, and she sent me both her code and the tutorials for the necessary math modules.
All I did today is pick apart her code in an attempt to understand it. First I had to download all the necessary modules. Since there’s so much math, there are a number of Python modules that can work as a substitute for something like R. For those of you who went through econometrics with me, you remember R. If I had known I could do pretty much everything R can do in Python, I would have done it in Python. Indices start at 0 and the syntax actually makes sense.
After I had the modules installed (which, by the way, is all done through the command line since I’m using a Linux machine) I began going through Shubha’s code line by line and googling every one of the functions I didn’t know, their parameters (think x in f(x)) and what they return (think the result of putting a number or equation into f(x)). Her program is around 70 lines of code, and I’ve been going through it all day. I think I have a decent handle on it, but there are still a few strings of logic I’m unsure on. Tomorrow’s mission: psuedocode her program (put it in understandable form using pencil and paper) then ask her my dozen remaining questions.
*cool thing while I’m still at the office: Chintan left his computer open when he went somewhere earlier, and it freaked me out for a second that it randomly started operating the mouse and keyboard. Turns out, he left it open so he could work on it remotely