Last time I posted about the XML parser, I thought I was almost done with the feature selection part of the process. I had finished 22/25 features, how hard could the next few be?
Well, the duration feature alone took over a week.
There were really only two features I had to worry about, one of which is something called a duration, a set amount of time (three years, ten days, six minutes) as opposed to a date (March 16, 2017). In XML, durations are enumerated in this format: P nY nM nD T nH nM nS. The P announces it as a duration, the T separates the calendar from the clock, and the n’s can be pretty much any number to specify the duration. The other letters represent years, months, days, hours, minutes, and seconds, respectively. Doesn’t seem like it would be that hard to isolate, run statistics on, and put into a database, right?
Ha. Ha. Ha.
Monday, my first day back, was composed of googling a thousand variations of “how do I convert durations into a usable format in Python?” without much luck. Then I discovered a module called isodate, which had a built in function to convert durations in the XML format into the isodate format that Python understands. For a moment, I thought I had it figured out. Now it’s just a matter of running the statistics.
Nope. Running the statistics requires being able to perform basic operations with the durations such as addition, subtraction, multiplication and division. The isodate module, I discovered, cannot do that. The next step was figuring out how to convert that into something else python could understand AND perform operations with. I found an object type called a timedelta, and the documentation said isodates could be converted into timedeltas. Eureka, right? Wrong. Again. It took me a while, but I finally found the algorithm to convert isodates into timedeltas. I took one look at it and knew it was way over my head. I spent an entire day discovering that isodate is useless.
On the other hand, I found out that timedelta was the object type I needed to convert the durations into. That was my Tuesday and Wednesday. I had to split up the duration and put it into a list, ensure all the necessary elements were there to fit the duration format, take out the letters, then hard code in the months and years to days and weeks conversion because timedeltas don’t accept years or months. Finally, I had it converted into a usable format. Now it was down to running statistics.
Then I discovered that 15 days, 10:00:40 (hours, minutes, seconds) does not equal (!=) -15days, 10:00:40. It converts to the negatives by comparing it to reference datetimes. Thankfully, the algorithm to compare timedeltas was pretty simple and I had already found it when I was digging through the isodate documentation. Finally, at the end of yesterday, I completed the duration feature extraction.
Having the duration feature and statistics complete narrows the number of incomplete features down to one. I have one more feature to extract and I’m done with this phase of the project. I’m sure the final feature won’t be anywhere near as hard as the last one, right?