Write your own! On having better habits as an R programmer
I contribute responses to Stack Overflow pretty frequently. I like answering well-written questions and enjoy that it keeps my skills sharp. However, one area of annoyance for me on Stack Overflow is that many answers start with “You can do this using the <insert package name here> package” – even when the task at hand can be handled in base R. For many posters, that’s probably not a big deal, but I find myself getting those answers occasionally on my questions, even when I explicitly ask for base R solutions.
“So what?” I can hear you asking it already, and it’s a valid question. After all, one of the great benefits of R is that you can tap into the collective talent of thousands of statistical programmers across the globe. In part, that’s what makes R such a powerful tool for data scientists and statisticians – the fact that it is, for all intents and purposes, the “bleeding edge” of statistical methods development. If you want to find someone working with a new type of analysis, you look for their R code. You can know for sure that it won’t be included in SAS for at least five years, if ever. (That’s not a slam on SAS, per se – it’s a recognition that the two tools are used for different things.)
But I suggest that there are many reasons to limit use of third-party packages, and that in the context of Stack Overflow, it is as much a detriment as it is a benefit. So, my proposal is this: the default position of all R programmers (and especially new R programmers) should be to “do it in base R” for a lot of bread-and-butter tasks, and that external packages should be limited to a) specialized tasks that would take an inordinate amount of time to code manually, or b) analytic methods where the published packages are written by the people developing the methods. (To be clear, here, I define “base R” as including the packages that come with R in a clean install.) Forcing yourself write your own solutions will make you a better R programmer, and will make your code more sustainable over the long run.
Using a nuclear reactor to power a lawn mower
A common question on Stack Overflow might come from a relatively new R programmer, who – perhaps coming from a different language – is trying to manipulate data in preparation for analysis. This is not “sexy” data science work. This is the “janitorial” duty that you do prior to doing some cool analysis, or some cool visualization. It is absolutely important and essential to doing good analysis, but it’s also not exactly the most exciting work.
Our new user is probably asking questions that show how little R programming they’ve done (no harm there – we were all beginners in R once). They might not understand how R is different from other languages. Instead of nice vector operations, they are writing nested loops and processing data line by line. Rather than defining their own functions, they are creating tons of temporary objects throughout the global environment. They encounter something that simply seems intractable to them, and they log into Stack Overflow and ask “How can I do X?”
Next, experienced R programmers look at the problem and say “Well, that’s really not nearly as hard as it’s being made – you can do it this way!” A few folks start writing answers, and inevitably, one of them will be like “Just use the dplyr function X to do this.” (Insert whatever package you want in place of dplyr.) Now, I want to make clear that I have no issues with dplyr. It is an awesome tool and it’s certainly published by someone with impeccable R credentials (the one and only Hadley Wickham). Using dplyr is a great option for some particularly annoying problems.
My problem with this approach is that it unnecessarily overcomplicates relatively simple things. Most of the time, users are being told to install dplyr, or reshape2, or another package, to do something that can be done quite easily in base R. It may not be quite as convenient as using a third-party package, but it is very doable. But, instead of being told to create appropriate R code to fulfill their needs, users are being told to create a dependency to an independent third-party package in order to merge some data.
To me, this is akin to building a nuclear reactor, and then using it to power your lawn mower. You’ve created a much more complicated system to accomplish a relatively simple task. By all means, you should use a package when the task is very complicated or would take a long time to fix using only base R. But a third-party package shouldn’t be the first thing you reach for when encountering a programming problem; that encourages bad habits and does not force you to learn how things actually work in your data and your program.
Learning bad habits makes bad programmers
New users don’t know a lot about R, and teaching them to reach directly for a third-party package means that they never have to understand why things work they way they do in R. I will absolutely agree that dplyr makes a lot of menial tasks much easier to do, and if you are an experienced R programmer, that time savings can be very useful. The problem with teaching new users to use dplyr first is that they never learn that R is a vectorized language, or why you might use one of the *apply() functions, or why we try to avoid loops.
To some extent, you might shrug and say “Who cares?” I posit that we should all care when answering questions for new R users, because making better R users requires that people understand how R works. Similarly, we should all care about how we program, because how we program defines what kind of programmers we become. If you teach people to always go to a third-party package, you let them circumvent learning how R works. If you reach to a third-party package to do everything, you circumvent needing to flex your skills to meet new challenges. It is, to put it bluntly, lazy. And when being lazy becomes a habit, we become bad programmers.
I started programming in R almost five years ago, but I didn’t make the plunge to using it full-time until around 2012. My abilities have grown a lot since I first started using R, and I’ve done a lot of data manipulation with the language. (My previous job, after all, was as a clinical data manager.) I am confident in saying that I am a substantially better R programmer because I have forced myself to write my own functions and to explore how to fix things without a load of third-party packages.
Am I a purist who never uses an external package? Hardly. I use third-party packages for things that would be extremely time-consuming to fix. For instance, I used hwriter at one point while generating reports, because writing my own functions to generate HTML tables would have been dozens of hours of work for something that the package did very quickly. Similarly, when I was doing some random forest analyses, I downloaded and used the randomForest package, given that it was ported from the original Fortran written by the guys who invented it. (I’m guessing the port is better than anything I could produce…)
The message here is not “Avoid all packages!” The message is “Think long, hard, and carefully when including a third-party package in your code, to make sure it is the best choice given the situation.”
Reproducibility and good practices
As mentioned above, including a third-party package in your code makes it more complex, and that complexity increases with each new package. Revolution Analytics recognized this fact when they came up with the Reproducible R Toolkit, which lets you “lock” your code to a point in time so that the results are consistent. It’s a fantastic system that helps to make your code reproducible, which is especially important for programmers working in a research context, or with regulated studies (my old world).
Using the Reproducible R Toolkit (RRT) is a good way to deal with a situation where you need a third-party package to do analyses (i.e. using randomForests to do an analysis) but you want to make sure your code is stable over time. But a lot of code doesn’t even need this – it just needs to be written well and with an eye toward long-term stability.
Remember, this is about the habits you build as a programmer. If you make it a habit to write every program to be stable and to rely only on internal code, you find yourself in few situations where the RRT is needed. If your habit is reaching for a bunch of packages to do even the simplest things, you will end up needing the RRT a lot – and that’s the same problem as relying on a bunch of packages, only with extra complexity (after all, the RRT is itself just another package).
Ultimately, there’s nothing wrong or evil about third-party packages, and the goal of this post is not to convince anyone that using a third-party package is wrong. The point of this – and my plea to teach new users to do things in base R, before handing them third-party packages – is to help make us all better R programmers. Building a habit of solving problems rather than taking a lazy approach makes you a better R programmer. Teaching new users to solve their own problems when possible makes them better R programmers. And the better we all get, the better the R community becomes, and the more powerful R is as a tool for all of us.
So…to sum it up: write your own! Make yourself learn how R works, rather than taking the path of least resistance. Don’t settle for a package just because it did what you wanted quickly; make sure you understand why it worked. And force yourself to have a habit of being better. In the long run, you’ll be thankful you took this road. Like all tools, R is best used when you understand how it’s supposed to work.
Comments are closed.