Data Science

Speed Up Your Data Processing with Macros

When it comes to preprocessing, data scientists can jump a bit too quickly to a full blown script in their favorite language... I was able to employ a quick Vim macro to quickly transform some structured text into a Python dictionary that I could use for preprocessing references.

Data engineers and data scientists are all about data. Often, however, it's those Linux guys that are really good at transferring data formats with their cool utilities like awk, sed, tr, and nc to mention a few. When it comes to preprocessing, data scientists can jump a bit too quickly to a full blown script in their favorite language (Python, R, Julia, etc). I have been recently working on a project that requires importing many Biblical references, and I was able to employ a quick Vim macro to quickly transform some structured text into a Python dictionary that I could use for preprocessing references.

Here's how it works. On that first line, I begin at the first character and initialize a macro to the a key using q a. Because some of the books have spaces in them, I need to keep those in mind when I use a set of keystrokes that get replayed with the macro.

| Ge     | Genesis         |

What I need to do here is surround the "Ge" with quotes and not assume I can jump by spaces. Therefore, I'll use d W to delete until the "G", insert a quote, then jump to the pipe with f | and get to the end of the "Ge" using b (back a word) and e (end of the word). At that point, I can insert another quotation mark and delete until the pipe d t |. I can then replace that pipe with a colon r :.

For the expanded book, I can take a similar approach. I'll jump to beginning of "Genesis" with w, insert a quote, jump to the final pipe f | and then get to the end of the final word (in this case just "Genesis") with b e. Then, I insert a quote and delete to the end of the line D. To prepare for the macro being run over and over, I'll also jump back to the beginning of the line and go to the beginning of the next line with ^ j.

Phew! With the movements completed, I press q to stop the macro recording and save it off to a. With that complete, I can then simply tell it how many times to repeat the macro. In this case, I repeated it 71 times with 7 1 @ a. As you can see, the magic immediately begins. Afterwards, I just clean up a couple of the books I don't want, and add some curly brackets to make it into an actual Python structure.

This process may seem slightly tedious, but I can promise that if you practice using a tool like Spacemacs (or plain Vim) you'll be ready for these tasks. It'll save you plenty of time just fiddling around with open, map, trim, and the other Python commands you might need. If you're interested in installing vim, I've got a one-line installation (macOS and Linux) that gets you setup with most of the things you need: https://github.com/rcdilorenzo/vimfiles. I actually use Spacemacs (http://spacemacs.org) for my everyday work, but the same Vim commands I've discussed will work the same.

Happy data mining!

Speed Up Your Data Processing with Macros

Read next

Deploying Your Data Science Projects in JavaScript

Data Science from Concept to Production: A Case Study of Automatic Building Footprint Segmentation

Testing for the Data Scientist

Comments ()

Read next

Comments ( )

Comments ()