Dr. Hariri was back last week, but for the majority of the week I had to write and practice my SRP presentation, which gave me little time for coding. On the other hand, what little coding I got done and the meeting I attended/caused turned out to be incredibly productive. Kind of. In terms of the classifier and the machine learning, I worked out the errors then promptly got results that didn’t make sense. After an email to Greg this week, I figured out how the classifier is supposed to function and what the odd results actually meant. Unfortunately the sample size I’m working with is much smaller than the HTML sample size, so I’ve had to recopy and reformat the XML data to get the machine learning to work, which I’m pretty sure is throwing off the accuracy of the model. Machine learning in this case is essentially developed by training a program to recognize a baseline normal then flag any deviation, which separates normal files from anomalies. Or at least that’s the idea. Currently how it works is that it models both malicious and normal then classifies them as such. A few weeks ago, there was a solid half-hour debate between Dr. Hariri and Greg over how the project was supposed to function. The rest of us in that meeting just kind of sat there and watched it unfold. To be honest it was really interesting listening to them debate, and it ended up illuminating the trajectory of the project.
In terms of my project, I asked Dr. Hariri how it was all supposed to fit together and he called a meeting of him, Chintan, Shubha, Doug and I. We developed the diagram above, which is a pretty good theoretical outline of the HTML project and how my project (XML is in red and green) runs in tandem with it. Apparently Avirtek is working on a demo that they need to present in the next week or two, which coincides with the end of my internship.
Outlines and diagrams only reveal so much however, and after talking to Chintan about how everything fits together it’s nice knowing that I’m not the only one a little confused. Discretization is part of the diagram, but in practice the machine learning uses the raw data, and I’ve already explained a little bit of the classifier confusion. Apparently there are different versions of what’s going on in practice, and it’s a bit scattered at the moment. Dr. Hariri will be in tomorrow, and I have questions for him in terms of both the seams of the project and in terms of tying up loose ends for my internship. I’m curious as to how this is going to play out and what my role will be. If I continue working at Avirtek (which I really hope to), I will either keep updating this blog (if Winkelman is okay with that) or move the content to a new site and link it to this one.
I’ll keep you posted,