I had an interesting discussion with a highly educated and self-proclaimed computer-literate professional on the process to dedupe emails. The interesting part is that I couldn’t believe what I was hearing about his process on how to dedupe files.
https://www.merriam-webster.com/dictionary/self-taught
I’ll sanitize this story to protect the guilty. So, here is the scenario.
Step 1: Find exact duplicates in a batch of 3,000 emails (.msg format)
That’s it. No step 2 or 3 or 4. Simply find the duplicate emails from a folder of emails.
I know what you are thinking; that you would just drop the files into an app like HashMyFiles (https://www.nirsoft.net/utils/hash_my_files.html), or maybe even get fancy by creating a case in your favorite forensic suite and adding the emails as evidence items, and output a formal report which would add maybe 5 or 10 minutes to the process.
Either way, the total processing time to find the exact duplicates would take about a minute. Here is where it gets a little interesting. The process that was described to me was way more elaborate. It went something like this:
- Import the emails into MS Outlook.
- Print the inbox.
- Compare the titles of the printed inbox against emails in a folder.
- Export the emails to a spreadsheet.
- Use Excel to remove duplicates.
- Visually compare each email in the spreadsheet against the emails in a folder.
The time spent deduping emails this way took 60 hours, and strangely, the IT pro was bragging about how long it took.
Speed test!
This is what it looks like when compared to using a free file hashing utility.
Nirsoft HashMyFiles |
Microsoft Excel |
1 minute |
3,600 minutes (60 hours over several weeks) |
This would be fine if there were no resources available to know otherwise, you had no training or education in technology, you were physically unable to ask anyone for advice, and you had never been exposed to file hashing before. However, in this instance, not a single resource was used. The IT professional didn’t use anything that was taught formally in either the BS or MS degrees, nor from any of the CompTia courses completed, didn’t ask anyone how to do this, and didn’t even search the Internet to see how to find duplicate files. That might normally be ok, but not here.
The problem is that this IT pro intentionally didn’t ask for help or search online for a process and boasted that “this is the way we do it in this field do it; by being self-taught.” With that statement, I figured that if one person thinks this is the right way, maybe others do too, therefore, this post needs to be written.
There are many right ways to self-learn. This was not one of them.
I am a big believer in self-learning. We learn better when we learn information on our own. It is as if we discovered the information, therefore we “own” it and can be proud of it. But there is a line between self-learning and simply doing it wrong, and worse, doing it wrong on purpose.
Being self-taught means that you first look for the answers (or the processes) that others have discovered. You can modify and improve upon processes that exist, but you use these as a starting point of self-teaching.
An analogy
I once built a motorcycle from the frame up. I had no idea of what I was getting into. This was years before the Internet, so my only resources included a friend that knew a lot about motorcycles and my local library. It took me a summer to build the bike, but I could not have done it without help from someone who knew what he was doing and the books that I checked out of the library.
Had I not asked for help or researched in a stack of manuals, I would have ended up with boxes of parts for a garage sale. Instead, I had a bike that I fully built myself.
Self-taught means that you learned outside a classroom. It means that you used resources available to learn, such as books, Internet searches, and asking others to show you. Of course, being self-taught includes practice and experimentation, but even that requires some resources as a baseline of where to start.
Excel
It might not be a stretch to say that practically everyone in DFIR is competent with spreadsheets. Excel is a flexible and necessary tool in DFIR to view, analyze, and display data. But just because you dump data in Excel does not mean that you are using it correctly.
In the example of dumping emails into a spreadsheet to find duplicates when there are probably dozens of applications (free, open-source, and commercial) that can do this task easier and without error, using a spreadsheet because it seemed like the best way goes directly against the meaning of being self-taught. This would be the same as me buying every nut, bolt, and part of a motorcycle and trying to put it together blindly in order for me to be self-taught in building a motorcycle.
So now, when I hear that someone is self-taught, I have to dig a little deeper to get the details. If I hear that self-taught involved deep research, replicating what others have done, and improving upon what others have done, only then will I believe that the person was self-taught. To do otherwise is to waste time and do the direct opposite of learning.
Self-teaching advocate
Once you become competent in any field, self-learning is what you do for the rest of your career. You will always “self-learn” a process new to you by seeing someone else do it or write about it. Then you replicate it. Eventually, you improve upon it. And if you share it, it will further be improved upon by others. If you are lucky, you have co-workers who share what they learned with each other, which takes team competence to much higher levels.
For managers, be aware of those who rather learn absolutely everything on their own without some sort of process (research > ask > replicate > improve). Blindly trying anything is likely wasting time and making things worse. It will be a net negative and can border intentional incompetence.
For practitioners, “trying something new” is all well and good, but before spending 60 hours on something, spend 6 minutes to see if what you want to do has already been done before. If it has, then you can replicate it. Use that 59 hours and 54 minutes of time you just saved to improve upon your replicated process.
Leaps and bounds
Do you ever wonder why some in DFIR jump so fast and far ahead of others? It is not usually because they have a higher IQ. They are smarter tho. They are smarter in the fact that they know to RTFM (aka: research first). With a firm foundation, their experimentation starts at a higher level and propels them ahead as if having booster rockets.
Those who start from scratch and intentionally choose not to do even the barest minimal research not only have no foundation of which to build, but will learn the wrong way to do DFIR things. This is not only not moving forward, it is moving backward.
The deduping emails story
The end result of the story of this deduping emails is that the IT pro was proud of the time spent as it was an “exhaustive effort”. Yet, the emails were not deduped because admittedly, the IT pro admitted that he was unsure of some emails being exact duplicates or not, so they were produced anyway (no email was even hashed). All of this wasted time could have been avoided with a phone call, an ask of someone else in the IT shop, or just one Internet search. Instead, we have self-taught incompetence that wasted weeks of work with a defective work product.