Category: Cameron B.

Weeks 11+12: Classifying the Chaos

Weeks 11+12: Classifying the Chaos

Dr. Hariri was back last week, but for the majority of the week I had to write and practice my SRP presentation, which gave me little time for coding. On the other hand, what little coding I got done and the meeting I attended/caused turned out to be incredibly productive. Kind of. In terms of the classifier and the machine learning, I worked out the errors then promptly got results that didn’t make sense. After an email to Greg this week, I figured out how the classifier is supposed to function and what the odd results actually meant. Unfortunately the sample size I’m working with is much smaller than the HTML sample size, so I’ve had to recopy and reformat the XML data to get the machine learning to work, which I’m pretty sure is throwing off the accuracy of the model. Machine learning in this case is essentially developed by training a program to recognize a baseline normal then flag any deviation, which separates normal files from anomalies. Or at least that’s the idea. Currently how it works is that it models both malicious and normal then classifies them as such. A few weeks ago, there was a solid half-hour debate between Dr. Hariri and Greg over how the project was supposed to function. The rest of us in that meeting just kind of sat there and watched it unfold. To be honest it was really interesting listening to them debate, and it ended up illuminating the trajectory of the project.

In terms of my project, I asked Dr. Hariri how it was all supposed to fit together and he called a meeting of him, Chintan, Shubha, Doug and I. We developed the diagram above, which is a pretty good theoretical outline of the HTML project and how my project (XML is in red and green) runs in tandem with it. Apparently Avirtek is working on a demo that they need to present in the next week or two, which coincides with the end of my internship.

Outlines and diagrams only reveal so much however, and after talking to Chintan about how everything fits together it’s nice knowing that I’m not the only one a little confused. Discretization is part of the diagram, but in practice the machine learning uses the raw data, and I’ve already explained a little bit of the classifier confusion. Apparently there are different versions of what’s going on in practice, and it’s a bit scattered at the moment. Dr. Hariri will be in tomorrow, and I have questions for him in terms of both the seams of the project and in terms of tying up loose ends for my internship. I’m curious as to how this is going to play out and what my role will be. If I continue working at Avirtek (which I really hope to), I will either keep updating this blog (if Winkelman is okay with that) or move the content to a new site and link it to this one.

I’ll keep you posted,

-Cameron

Week 10: Almost There

I have a little over two more weeks to go before I present my SRP. That hit me last Tuesday.  I’ve had a lot of fun with this project, and there is a possibility I will be able to continue it after the SRP portion of it is over. At this point, I know everyone (all seven others) in the company and I understand about 90% of what they’re talking about in the weekly meetings. I feel like I’ve gone from an awkward coexistence here to being an actual functioning part of Avirtek. Presenting at the meeting that one time helped, I think it illuminated what I’m doing to some of the other employees who had no idea what I was doing here. It also helped that I’ve had to work with Chintan, Shubha, and Fabian to model what aspects of the project they’re familiar with. Dr. Hariri helps me along and gives me the next step in my project, and Doug is friendly and easy to talk to. I was even able to build a rapport with the (Turkish?) guy that works here (I actually don’t know his name or what he does here, all I know is that he joked about stealing my sandwich from the fridge and afterwards I accidentally locked him out). I got to hear Doug’s unusual career story, learned that Fabian takes quick trips to random places around the world, Chintan’s going for his master’s soon and Shubha sadly won’t be here for more than another month. Both the technical work and the office atmosphere has been fascinating for me to see and be a part of.

Enough reminiscing, what have I been up to this week? Well I finished working through the discretization code and learned more about the algorithms it uses. This is also the point after which I’m hesitant to explain things due to security issues. There is a reason the data needed to be discretized and her code does something really cool at the end, but I don’t think I can actually go into it. The last phase of the process is what’s secure.

I’m also getting to the point where I’m at a loss for what to do, since I’m pretty close to completing my project. I finished the discretization on Tuesday, so I’m at the third and final phase. Dr. Hariri is out of town for the week, and if you haven’t gathered he’s the one that’s been overseeing my project and giving me incremental tasks and research papers. I finished the paper he gave me for this week, and Fabian sent me the code for the last phase of the project. It was designed for HTML, and I ran into an error trying to run it on XML. The code was literally written by Greg, the stats guy with a PhD, so to say modifying it for XML would be over my head is a gross understatement. Doug said he’s not surprised it didn’t work for XML since that code was intended for HTML, and that brings me back to having nothing to do on the Avirtek side until next week. On the BASIS side however, I need to rewrite my IEEE presentation into an SRP presentation, so that’s what I’ll be working on until Dr. Hariri gets back.

Til next time,

-Cameron

Week 9: Be Discrete

Week 9: Be Discrete

I discovered (from Dr. Hariri) that the goal for my project is to build a working microcosm of the methodology that Avirtek uses to classify malicious XML files. The project is three parts: feature selection, data analysis and rule development/classification.

In terms of the parser, the false positives I was getting turned out to be more of an issue than I realized because those errors wouldn’t allow the data to be stored in the database for those files. I talked to Dr. Hariri, and he said that it would be better to disregard the Billion Laughs attack for this model and focus on getting the data to establish a baseline for a normal XML file. In other words, I’ve switched back to the elementtree module and left the Billion Laughs attack on my desktop rather than keep it in my IDE (the place I develop code and where all the relevant files are). All that time spent on defusedxml, but in the end it doesn’t even matter (some of you will get that reference). Thankfully, elementtree still works for the other malicious files I have. That brings the number of malicious files to 5/541, and that is what the data analytics will be trained on. With that, my parser (the program for feature selection) is complete and I have begun working on the data analytics phase of the project.

I have found that coding is applied theory, and this very much holds true for the data analytics phase. At the end of yesterday, Dr. Hariri went over the theory for the discretization of data, which is basically condensing data down into manageable, processable blocks. Remember Shubha, who is currently working on the data analytics? Dr. Hariri asked her to teach me about the actual code for discretization, which she is learning on the fly (I’ve discovered that this is pretty common, once you have programming skills you’re expected to just use those skills to figure out how to solve various problems even if you don’t have much background. Sound familiar?). I understood the majority of it, and she sent me both her code and the tutorials for the necessary math modules.

All I did today is pick apart her code in an attempt to understand it. First I had to download all the necessary modules. Since there’s so much math, there are a number of Python modules that can work as a substitute for something like R. For those of you who went through econometrics with me, you remember R. If I had known I could do pretty much everything R can do in Python, I would have done it in Python. Indices start at 0 and the syntax actually makes sense.

After I had the modules installed (which, by the way, is all done through the command line since I’m using a Linux machine) I began going through Shubha’s code line by line and googling every one of the functions I didn’t know, their parameters (think x in f(x)) and what they return (think the result of putting a number or equation into f(x)). Her program is around 70 lines of code, and I’ve been going through it all day. I think I have a decent handle on it, but there are still a few strings of logic I’m unsure on. Tomorrow’s mission: psuedocode her program (put it in understandable form using pencil and paper) then ask her my dozen remaining questions.

*cool thing while I’m still at the office: Chintan left his computer open when he went somewhere earlier, and it freaked me out for a second that it randomly started operating the mouse and keyboard. Turns out, he left it open so he could work on it remotely

Week 8: Waiting and a Surprise Presentation

Week 8: Waiting and a Surprise Presentation

Sorry I’m posting this a week late, but here’s what I was up to last week:

Remember those two errors I’ve been trying to fix? The false positives and the encoding issue? That’s what I’ve been working on this week. Tuesday I did research on Internet protocols as they pertain to XML, and Wendesday I spent the entire day researching a module that could be used in place of defusedxml that could mitigate the false positives issue. Because there was no meeting last week, this was a pretty slow week. I did more research than coding, and coding is the fun part. Research is definitely necessary to write quality code though.

I also emailed my presentation from Big Sky to Dr. Hariri, Chintan, and Mr. B to get some feedback. I thought they would look at it and respond in an email. Dr. Hariri had another idea.

I’ve had trouble sleeping this week. I think it’s a combination of not exercising as much as I gured outhave the last few weeks but still being as hungry as I have been the last few weeks. Regardless, today was one of those days I hit my alarm too many times, rushed through my morning routine and got to work still trying to wake myself up. Thursday is meeting day, which means I usually sit quietly and listen. Dr. Hariri had another idea.

What better way to get feedback on a presentation than presenting? This was Dr. Hariri’s idea: present your paper at the meeting. That woke me up. I was going to present a paper on cyber security in front of a half-dozen people whose experience levels exceeded mine from four to forty years. No pressure. Oh, I hadn’t even looked at this presentation in two weeks either. By the end of the meeting and the beginning of my presentation, I was running on nervous energy. Thankfully, I didn’t make a complete fool of myself. Dr. Hariri gave me feedback and suggestions on which slides to fix, add or remove, and I now have a better foundation for my SRP presentation. Greg even offered some encouragement. All the critiques were definitely helpful, even if that presentation was one of the most nerve-wracking 20 minutes of my life.

The meeting was also important because I was planning on asking Greg about the encoding error in my borrowed function. I sat down and went over the function one more time before asking him, and realized something: the error was in the most nested (a function within a function) function, which was built in to Python. The error was that that nested function couldn’t handle UTF-8 characters. I looked up its documentation, and happened across a similar function as I was reading through it. This function was exactly the same but ran through the Unicode codec, which included all the UTF-8 characters. I swapped out the function, and it worked like a (py)charm. My list of files the parser wasn’t effective for dropped from 119 to 9. This was another time I celebrated for a solid five minutes. I had fixed an error that had been plaguing me for weeks, and I was that much closer to completing the parser.

One more error for next week before I’m finished with the parser and can move on to data analytics: the defusedxml/elementtree module dilemma. Hopefully I’ll have as much luck with that as I did the borrowed function.

Week 7.8: Errors, Encoding, and Borrowed Functions

Update: I’ve basically completed my XML parser, though there are some discrepancies in the data that I need to fix. Changing the module mitigated the Billion Laughs attack, but after testing and retesting the program, I’ve discovered the module throws an error every time it detects an embedded reference in XML. I explained how the references work in my previous post, but it’s like a link to a website (or in the case of XML, another bit of data in the file) that may or may not be malicious. Once I discovered this, I developed a way to test for false positives.

The 533 test files are all benign, and I had two resources for logging data on these test files: the PyCharm (the text editor/IDE I use for writing code in Python) console, which prints data directly from the program, and the database, which stores 12,259 data points in total for the test files. The console rapidly spits out a dozen or so lines of text for every file, so counting false positives there wouldn’t be easy. The database stores data in relation to each individual file, so that’s not very useful for counting either. Neither neither would be effective at logging what I’m looking for.

Then I remembered, I can edit files. I’ve already had to work with reading files for the parser, so how hard could writing to them be? So that’s what I did.

Essentially, with a few lines of code it’s pretty simple to write a variable or a string (a line of text) to a file. The parameters required are the path name of the file and a letter that indicates what you want to do with the file: read it, erase it and write to it, or add to it. I tinkered with the last two. After a few Google searches and help from Stack Overflow, the program did what I wanted it to: log every instance of a false positive. Out of 533 files, I had nine false positives. Not too bad. It’s information that is definitely useful and needs to be addressed before it goes through the data analysis.

Then I remembered an error I had to smooth over earlier: because XML is encoded in UTF-8 rather than in ASCII, one of the functions that was given to Chintan and I didn’t work on XML files if they had non-ASCII characters. ASCII is an encoding for basic letters and symbols, most of which (if you’re reading this in English) you have on your keyboard. Accented letters and many foreign characters are in the UTF-8 codex that XML uses. Add that to the list of reasons web applications are moving towards XML for development.

Anyway, back to the error: I had thrown a try-catch statement and a counter around it (sorry Mr. B, I can practically hear you telling me to never use try-catch statements) but never fully solved the issue. Turns out, because of how the function (which I’ve picked apart piece by piece, yet still can’t figure out how to modify it for XML) is written, even my counter wasn’t working correctly. I applied a similar detection to the one I used for the false positives, and found that 111 of the 533 test files had non-ASCII characters in them. That’s a problem, because it means the function doesn’t work on one fifth of the files.

So I’m stuck yet again. I don’t see any solution to the false positives because that has to do with the source code of the defusedxml module, nor do I see a solution for the non-ASCII characters because I didn’t write the function and I can’t change how XML files are encoded.

Greg, the data analysis guy, had jury duty today so we didn’t have the weekly meeting. He might be able to help with the encoding issue since he wrote the function, but I’ll have to wait until next week for that. Shubha, a grad student that works here, is also working on the software to run the analytics on the database, which means Chintan and I are slightly at a loss for what to do until she’s done with that program. Chintan said it’d hopefully be done by the end of the week. I worked overtime these past few days, so I don’t have to come in tomorrow. Hopefully these errors will get solved and the project will move forward more next week. Until then, I’m going to enjoy my weekend. KFMA Day is this Sunday, and I’m looking forward to seeing Blink-182. The other bands look interesting, but I think last year’s lineup was better.

See you next week!

-Cameron

Week 7.2: Billion Laughs

Week 7.2: Billion Laughs

I talked to Dr. Hariri the other day, and it turns out that the last feature wasn’t necessary for me to do, thank god. The next step of the project is threat testing, which is basically building a model for what can go wrong in terms of potential weaknesses in the software and how those weaknesses can be defended. I also pulled up the malicious XML files Doug gave me a month or so ago to test my completed program on actual malicious files, because up to this point I had been using a combination of the benign test files Chintan gave me and XML files I wrote myself for the purposes of testing different aspects of my program.

So, I ran the parser on the malicious files. For all you programmers out there, you know exactly what I’m about to say. The very first malicious file I tested the parser on broke the program.

Fast forward through a day or two and a number of hours of research, and I had discovered what the issue was (also, if you haven’t gathered yet, programming is an endless loop of running into and fixing errors). The Python module I used to navigate the XML files, ElementTree, wasn’t designed to navigate malicious files, so when a recursive payload attack was run against it (I’ll explain that one in a second, it’s called the Billion Laughs attack >:))  the parser more or less caved in on itself. If I let it run too long, it would even freeze my computer. So I did some research (aka begging the benevolent god that is Google for help) and found that there was another module, defusedxml, that did the same file navigation that ElementTree did, but this one had built-in detectors for four common attacks, one of them being the Billion Laughs Attack.

Okay, what is a module and what is this virus that sounds like it was developed by the Joker? I’ll keep you in suspense and go with the less interesting one first. Programming is essentially developing functions based on the functions available to you through the language (print, split, ”.join, etc.) and having the computer use them to do what you want it to do. A module is basically a class (the word ‘class’ as is used in programming isn’t relevant at the moment) which contains more functions you can utilize than the language had before. You have to download them, but they’re generally free and in my case very necessary. I’ve already had to download five extra modules for this program alone. Now on to the more interesting bit.

So the Billion Laughs attack is actually a pretty ingenious bit of code, even if it is a virus used to execute DoS attacks (overloading a system with junk information). It’s also known as the XML Bomb or more formally as the Exponential Entity Expansion Attack. Enough with the technicalities, here’s what it looks like (I pulled this one from Wikipedia, but the one Doug gave me is essentially identical):

<?xml version="1.0"?>
 lolz [
 <!ENTITY lol "lol">
 <!ELEMENT lolz (#PCDATA)>
 <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
 <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
 <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
 <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
 <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">
 <!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;">
 <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;">
 <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;">
 <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">
]>
<lolz>&lol9;>

Yes I’m sure you understand what that bit of code does. I totally did too when I first saw it (<= sarcasm, if you couldn’t tell). Basically, each element in the next line refers to all the lols in the previous lines, ie the lol1s in entity lol2 refer to the whole line of lol1s. The lol2s in lol3 all refer to the lol2 entity, which refer to all the lol1 entities each of which refer to the whole line of lols in lol1. Each element in one line refers to the full line of all the previous, and it waterfalls down until at lol9 you have 100,000 lols that the parser has to go through. One more line and you’ve got your Billion Laughs attack and you can easily overload a system with just handful of these. It effectively makes a file of a few kilobytes seem like a file of a few gigabytes. That tiny chunk of code is a full-fledged virus.

Thankfully, the defusedxml module throws an error when it runs this and three other well-known attacks (which stops the program rather than overloading my CPU). Unfortunately, it throws an error and I had to deal with the fit my program threw for the single error. One error and the program stops. This time Google wasn’t much help and I had to ask Chintan after a few hours of failing to find a solution to said error. A little tinkering and my program was back on track.

Another thing this module fixed was an error I couldn’t figure out earlier, one in which for some reason my parser could only handle small samples of the 533 test files Chintan gave me. Now my program can churn through all 533 files at once and extract 23 attributes from each file and do so at an average of one file for every half second. Do the math, the program takes while to run but the parser works on every. single. file. This is one of those moments where I’m in awe at what computers can do rather than wanting to tear my hair out at what computers can do.

I very well may post again this week, but I just wanted to celebrate the fact that my program is running so well for the moment. It’s actually pretty simple to test the Billion Laughs attack since all browsers can run XML files. On a browser it doesn’t do much other than show you a hundred thousand lols though, so it’s not too bad. Parsers are the ones that really hate it.

Anyway, see you later!

Here’s the main article I used to figure out the Joker attack: https://cytinus.wordpress.com/2011/07/26/37/

Week 6: How Long Will This Duration Take?

Last time I posted about the XML parser, I thought I was almost done with the feature selection part of the process. I had finished 22/25 features, how hard could the next few be?

Well, the duration feature alone took over a week.

There were really only two features I had to worry about, one of which is something called a duration, a set amount of time (three years, ten days, six minutes) as opposed to a date (March 16, 2017). In XML, durations are enumerated in this format: P nY nM nD T nH nM nS. The P announces it as a duration, the T separates the calendar from the clock, and the n’s can be pretty much any number to specify the duration. The other letters represent years, months, days, hours, minutes, and seconds, respectively. Doesn’t seem like it would be that hard to isolate, run statistics on, and put into a database, right?

Ha. Ha. Ha.

Monday, my first day back, was composed of googling a thousand variations of “how do I convert durations into a usable format in Python?” without much luck. Then I discovered a module called isodate, which had a built in function to convert durations in the XML format into the isodate format that Python understands. For a moment, I thought I had it figured out. Now it’s just a matter of running the statistics.

Nope. Running the statistics requires being able to perform basic operations with the durations such as addition, subtraction, multiplication and division. The isodate module, I discovered, cannot do that. The next step was figuring out how to convert that into something else python could understand AND perform operations with. I found an object type called a timedelta, and the documentation said isodates could be converted into timedeltas. Eureka, right? Wrong. Again. It took me a while, but I finally found the algorithm to convert isodates into timedeltas. I took one look at it and knew it was way over my head. I spent an entire day discovering that isodate is useless.

On the other hand, I found out that timedelta was the object type I needed to convert the durations into. That was my Tuesday and Wednesday. I had to split up the duration and put it into a list, ensure all the necessary elements were there to fit the duration format, take out the letters, then hard code in the months and years to days and weeks conversion because timedeltas don’t accept years or months. Finally, I had it converted into a usable format. Now it was down to running statistics.

Then I discovered that 15 days, 10:00:40 (hours, minutes, seconds) does not equal (!=) -15days, 10:00:40. It converts to the negatives by comparing it to reference datetimes. Thankfully, the algorithm to compare timedeltas was pretty simple and I had already found it when I was digging through the isodate documentation. Finally, at the end of yesterday, I completed the duration feature extraction.

Having the duration feature and statistics complete narrows the number of incomplete features down to one. I have one more feature to extract and I’m done with this phase of the project. I’m sure the final feature won’t be anywhere near as hard as the last one, right?

Week 5 – Interim and IEEE Aerospace

Week 5 – Interim and IEEE Aerospace

So I was gone for a week, but I’m back!! Apparently Tucson jumped from winter to summer while I was gone, though I use the term winter loosely. When I left, it was generally around fifty, sixty degrees for most of the day but now it’s around ninety in the afternoons. What the heck Tucson? I was enjoying what you called “cold.”

Anyway, I was gone last week for my final presentation at the IEEE Aerospace Junior Conference. The regular IEEE Aerospace Conference is held in Big Sky, Montana every year in early March, and it’s a gathering of a bunch of people with PhDs who want to go skiing and listen to papers that might as well be in Greek for all I understand. I’d say a good quarter of the six- or seven-hundred attendee conference was from Los Angeles’s Jet Propulsion Lab, though there are attendees from various tech companies and universities (such as the University of Arizona, my dad was one of two attendees from the UA). The junior conference on the other hand was set up to allow K-12 students (basically the adult conference’s kids) to present on topics they’ve researched. I’ve seen everything from basic melting point experiments to research on string theory and Mars mission developments. It’s always an enjoyable conference (once you get over public speaking nerves), and the parents love seeing the kids get up on stage and present. If you’re interested in the papers, here’s the link to this year’s conference:

https://www.aeroconf.org/junior-engineering

The IEEE Aerospace conference is held at the biggest ski resort in the country, and it’s a running joke that people use the conference as an excuse to go skiing. While we do get discounted lift tickets, it is a professional conference. I’ve heard talks on everything from quantum physics (I understood all 0% of that one) to NASA’s plans to put a man on mars. Oh, and that last talk was given by Alan Stern, the head of the New Horizons (the Pluto fly by) mission. So yes it’s fun to go skiing there, but it’s also an incredible delve into the world of big wig science geeks sipping red wine. Every time I go there I learn something about the professional world or about what projects companies like JPL, NASA, etc. are working on.

So remember how I said I left during what Tucson calls “cold”? Yeah Big Sky’s version of cold is a tad uh, colder. As in the high was 19 degrees one day. Yes, in Fahrenheit. I did ski that day. I went from wearing thermals, ski pants and a gigantic red jacket in Big Sky to trying not to overheat in shorts and a t shirt in Tucson.

(placeholder for ski picture)

I know this blog isn’t supposed to have anything unrelated to my senior project, so you’re probably wondering why this whole post seems like a tangent. The reason it isn’t a tangent is the same reason I went to this conference in the first place: to present a paper on what I’ve been working on. It was a fifteen minute presentation, but I covered the basics of XML files, what attributes of XML files allow for attackers to utilize it, what threat modeling is and how security professionals go about securing software from the ground up, and finally the general overview of what my parser does and the theory behind the feature selection. Above is the plaque I received for presenting, the gift I received for presenting all these years (this was my seventh presentation) and a thumb drive of all the junior conference papers. This was my last presentation, and I’m going to miss attending every year. Thank you to Mary, who runs the junior conference, and my friends Sophie and Ryan, for making this conference so enjoyable.

It’s been a blast

 

 

Week 4: Threat Testing, Research, and Feature Completion

Week 4: Threat Testing, Research, and Feature Completion

Is it really March already? It feels like yesterday it was February and I was just starting at Avirtek. It’s been an eventful month. I went from knowing one programming language to about five, and I had to more or less teach the other four to myself. I now know Java (thank you Mr. B), Python (thank you Google/Stack Overflow/Python Documentation), SQL (thank you Code Academy), the Command Line (thank you again Code Academy), and XML (thank you W3 Schools). A month ago I only knew Java, which thankfully gave me an incredible foundation for learning the others. If I hadn’t had that, the rest would have been impossible.

Progress update on the XML parser: I finally (after a day or so of nearly tearing my hair out in frustration) connected my program to the local Avirtek database, so every time I run the parser it uploads all the XML file data. I had to teach myself regex, or Regular Expressions, in Python to extract one of the features, which was another day I wanted to pull my hair out. After I had successfully completed both of those tasks, I spent a solid five minutes celebrating because I had a working program and a full head of hair. That, my friends, is a small miracle.

In terms of the data I had to extract from XML, I’m almost done with extracting all the different features. I have about two more to go before that part of the project is finished. I have also discovered that the feature extraction is only the first step in this process, after that the real cyber security work begins with the data analysis.

I’ve had to do a decent amount of research (apart from Googling “how to do x in Python” every few seconds) as of late. Dr. Hariri gave me a paper that outlines what Avirtek was founded on and how they determine if files are malicious, and has been generous enough to go over it with me and help me understand it. He also gave me a textbook on threat modeling, which is basically a book on how to go about creating secure software. Finally, he gave me a list of common XML attacks. The original syllabus I had for this project went out the window a couple weeks ago, but I have plenty of research material to work with and the portions of the XML parser I can actually disclose will be my final product.

XML is the next step in what Avirtek is working on, which means I’m being phased in. I was actually mentioned in the meeting today. They’re moving along with the project, and the meetings are beginning to make sense. The more time passes, the clearer picture I have of what’s actually being done.

In terms of my own project, once I finish with all the features I’ll move on to the data analysis and malicious file detection, which is the core of cyber security. Pulling attributes from XML was just the beginning. If you’re wondering why I’ve been talking about XML so openly, it’s because I’ve discovered that it’s not secure information. Most of what I’ve done so far can be found in the research paper (which is in the public domain, it’s actually an Israeli paper) Dr. Hariri gave me or in the threat testing textbook. There are portions I’ve had to omit however, and for obvious reasons I can’t tell you which portions.

Finally, I’m going to be off next week. I’ll be in Montana attending the IEEE Aerospace Conference with my dad, and I’ll be presenting my own paper on XML at the Junior Conference. Did I mention the conference is held at a ski resort? It’s held at a ski resort. This is also my last year at the junior conference, so wish me luck!

To do list for when I get back: refine the parser (i.e. reduce runtime, consolidate functions, delete unnecessary variables, shift to Object Oriented Programming, ensure all necessary aspects of the features are detected and stored, etc.) and begin working on data analysis

See you in two weeks!

 

Week 3: XML, Statistics, Meetings and Data

Week 3: XML, Statistics, Meetings and Data

It’s week three at Avirek, and I’m finally settling into my role in both the work Avirtek is doing and into the new world of office life. This will probably be my most technical post so far, so bear with me.

Avirtek is a cyber security company, and I think every day I discover more about what both “cyber security” and “company” actually mean. The first thing I noticed when I started was that I wasn’t in school anymore. That might seem obvious, but the contrast between high school classes and an office existence is pretty sharp. I don’t have the pressure of tests and quizzes nor do I have homework constantly hanging over my head. Rather than being tested on what I am supposed to understand, I’m being asked to figure out how to work on projects that I halfway understand. Google is my new best friend. It took a week or so for both Avirtek and I to figure out what I should do, but Dr. Hariri began nudging me towards static analysis of XML from the very beginning. I sit at the computer and work 9-2 four days a week and work on a list of attributes I need to extract from the XML files. After meeting with Dr. Hariri last Friday and today, I’ve discovered that his plan for my project is for me to complete the first phase (there are three major phases for the completion of a section of the project, and each phase takes about a month) of the XML analysis on my own. He’s helped me read through the previous leading paper on XML, given me a book on software threat testing, and given me a list of common XML attacks. It boggles my mind that he thinks I can do this as a high schooler, but I’ll take my best shot.

Another thing I’ve discovered about cyber security is that it’s about half data analysis, though the data analysis has the end goal of recognizing malicious files. This became apparent because every week, a mathematician named Greg meets with Chintan, Doug, and Dr. Hariri to discuss the algorithms of the project, how to implement them in the code, etc.  I sit there quietly and listen to them speak the language of statistics and algorithms that are WAY over my head. From what I gather, these meetings are used to bridge the gap between the data analysis side of cyber security and the programming side of cyber security. Some of the jargon I understand, but most of it I don’t. Sitting at that table for the meetings is always interesting for me. Three of the people at the table have PhD’s and Chintan’s basically done with his undergraduate degree from the University of Arizona. Then there’s me, the student fresh out of (yes basis, but still) high school. It’s as humbling as it is fascinating.

Then there’s office life: Avirtek has a fridge, a coffee machine, and a microwave. Just like Chintan, I eat my lunch at my desk while I work. I don’t have to move from class to class, and I can come and go as I please as long as I fill my required hours. The freedom and the trust, in exchange for a higher level of responsibility, has probably been the biggest shift for me to adjust to. Regardless, it’s one I greatly appreciate.

Next step : complete feature selection and begin database manipulation