My R Coding Convention

It seems like many R programmers (probably, many programmers in general) end up writing a post of this type, so I decided to jump on the bandwagon. I recently switched jobs, so I am at a nice point to make a “fresh start” with my coding conventions: I am not facing the need to refactor years of my programs to be consistent.

On top of this new start, over the last few months, I’ve caught my R conventions evolving – and also becoming inconsistent. The genesis of this was my realization, thanks to Hadley Wickham’s style guide, that embedding dots in my user-defined functions was actually rather bad practice, since S3 concatenates methods for classes together using dot (i.e. plot.function, plot.ts, and so on). Prior to that, I had been fairly consistent in naming my user-defined functions using the prefix func. (for instance, func.query_model_data).

Anyway – ever since I realized this issue, my programming style has been kind of fluctuating, because I never sat down and decided what to do now that my intuitive approach was not ideal. Today, I’m aiming to rectify that!

While this set of conventions is focused on my R programming, some aspects of it touch on the method in which I organize my Git repositories. I try to put all analytic code that I write into source code management (the vast majority of time, this is Git), even if I am the only person working on it. This has three key benefits:

  • As the project evolves, I can see what I used to do, and I can always go back and reproduce an analysis (yeah reproducible research!)
  • If I need to hand off a project, it comes with pre-built documentation and history. This is great when you transition a project to someone else.
  • I often have to come back to something after months away from it. Having the repository there makes it easier to figure out what the hell I was doing when I last worked on it.

In any case, the way this affects my R programming is mostly in the world of file names, which is first up for discussion. I didn’t write this in a way that always makes a clean distinction between language rules and style rules, like the Google R Style Guide, but I’ve tried to group these logically.

File Names

  • File Extensions Rules: Files should end with capitalized extensions, consistent with the defaults provided by RStudio.
    • R files should end with .R, not .r. (I used to struggle against this, preferring .r, but I’ve given up.)
      • GOOD: file.R
      • BAD: file.r
    • R Markdown files should end with .Rmd, not .rmd.
    • Sweave files should end with .Rnd, not .rnd.
    • R Presentation files should end with .Rpres, not .rpres.
  • File Naming Rules: Begin each file with the type of file, following by a short project name, and then a descriptive name of what it does. Separate words by using an underscore. Do not use capital letters in the file name, unless they are used in an acronym.
    • GOOD: rcode_dm3study_query_survey_results.R
    • GOOD: rcode_dm3study_PHQ9_calculator.R
    • GOOD: rmarkdown_dm3study_recruitment_report.Rmd
    • GOOD: sql_dm3study_refresh_reports.sql
    • BAD: querysurveyresults.R
    • BAD: rcode_QuerySurveyResults.R

Program Structure

I am a big fan of using RStudio. As such, I make certain structural changes to my program to take advantage of some of RStudio’s features.

  • Sections: Separate the program into logical portions using sections. Sections in RStudio are collapsible, which can be a useful feature.
    • Define sections using # Section Name ----. Make section names descriptive without being excessively long.
    • Every program should include, at the start, the following sections in the following order:
      • # Program Description ----
        A place for the preamble, described below.
      • # Load Packages and Set Options ----
        All packages used in the program should be loaded here. General program options should be set, though they can be changed throughout the program as needed. As a matter of default practices, I generally set stringsAsFactors to FALSE.
      • # Obtain Runtime ----
        A block of code storing the execution time of the program. Used to create time-stamped outputs, as discussed below.
      • # Declare User Functions ----
        All user-defined functions should be declared here, rather than throughout the program. Absent very compelling reasons, programs should be self-contained and should not source() other programs. Programs without user-defined functions should note this in a single comment.
    • Programs should end by declaring a final section, called EOF. This allows us to support RStudio collapsing the last section header if needed. Place a single line below the section header.
  • Hashbang: If the program requires a hashbang (!#) to indicate what program should execute it, this is above all other lines. A line should separate it from the first header.
  • Preamble: All programs should have a preamble, set under the Program Description header.
    • Enclose the header in a box constructed out of hash symbols (#) on the sides and dashes (-) on the top and bottom. It should appear as follows:

      #-----#
      #     #
      #     #
      #-----#

    • Put a line between the end of the preamble and the next header.
    • Begin typing content two spaces to the right of the # that demarcates the left side of the box. Leave at least one space between the end of each line and the # that demarcates the right side of the box.
    • The header should include the elements below. Either place the element name, followed by a colon, and then followed by content (with hanging indent to align text), or place element name on one line and then content on the following line.
      • The program name, written in a simple statement.
      • The file name for the program.
      • The author, along with a job title if relevant, and the author’s email.
      • The function of the program. This is generally a short paragraph describing what the program is intended to do.
      • Anything on which the program depends to run. This could include libraries, connections, data sources, or anything that affects the usability of the program. I don’t generally include actual data files here, but I do consider database connections and the like.
  • Line Breaks and Carriage Returns: Put space between code that is not conceptually linked. Group together lines that are related.

Names and Object Management

  • Regular Variables and Objects: All lower-case words, separated by underscores. Data frames, data tables, matricies, and lists should be prefixed with an appropriate type identifier, such as df_. I use df_ for data frames, dt_ for data tables, mat_ for matricies, and lst_ for lists. Objects which are conceptually linked together can include a prefix indicating this linkage (for instance, site1_location, site1_projection). The type identifer should always precede the grouping prefix, if it is used.
    • GOOD: list_site1_active_physicians
    • GOOD: df_all_patient_records
    • GOOD: dt_hospital1_discharge_notes
    • BAD: doctors
    • BAD: site1_dt_bills
  • Temporary Objects: Avoid when possible. If absolutely necessary, indicate their nature by using TEMP_ before any other name elements. Remove these as soon as possible.
    • GOOD: TEMP_carryover_value
    • BAD: tempnumber
    • BAD: mat_fitdata_temp
  • Looping Variables: If using single letters, consider avoiding x as this is a comment argument name in functions. When possible, use short but descriptive names. Remove looping variables immediately after the loop.
    • GOOD: TEMP_carryover_value
    • BAD: tempnumber
    • BAD: mat_fitdata_temp
  • User-Defined Functions: Prefix user-defined functions with uFunc_. Write user function names in CamelCase, rather than using underscores. I generally hate CamelCase, but this does make it easier to see what each component is within your program.
    • GOOD: uFunc_QuerySystemDatabases()
    • BAD: doquery()
    • BAD: User_Func_Query_Data()

    Edit: It has been pointed out to me that formally, CamelCase requires that the first letter be lower-case, while PascalCase requires the first letter to be upper-case. The portion above containing the actual function name would be PascalCase, but the uFunc_ at the start is properly CamelCase. The inclusion of an underscore throws the entire thing into disarray, of course. Long story short: I use some bizarre hybrid of CamelCase, PascalCase, and my own thing, but I think the explanation above is clear.

Code Layout

  • Line Length: Many people say you should keep your line length under 80 characters. I find this insane in the modern era of 1080p monitors…but there is merit to the idea of not letting your lines get too wide. I try to keep lines less than 200, and probably end up feeling uncomfortable intuitively around 120 – 150 characters.
  • Function Arguments: When giving a function arguments, place a space between the argument name, the equals sign, and then the value (for instance, time = 3). When nesting function calls, or using functions with many arguments, put each argument on a separate line, indented so they are vertically aligned.
    • GOOD: plot(x = c(1,2,3),y = c(4,5,6))
    • GOOD:
      plot(x = c(1,2,3),
           y = c(4,5,6))
    • BAD: plot(x=c(1,2,3),y=c(4,5,6))
  • Indenting: By default, use 4 spaces. It’s good enough for Python, so it’s good enough for me! Use indenting to vertically align things when this improves the readability and maintainability of the code.
  • Spacing: Use spaces around all mathematical opeartors. Also, place a space on either side of the assignment operator and equals signs. Do not place spaces after the commas that separate function arguments. Place a space before curly brackets; do not place a space before parentheses.
    • GOOD: add <- 3 + 2
    • GOOD: div <- 3 / 5
    • GOOD: while(z < 3) {...
    • BAD: mult <- 9*1
    • BAD: for (x %in% ...
  • Curly Brackets: Put the first curly bracket on the same line as the element using it (for, while, function and so on). Put the last curly bracket on its own line, unless you are using an else statement; in those cases, the curly bracket and the else may be on the same line, separated by a space. After an opening curly bracket, put an empty line before beginning code contained within the brackets. Indent the close curly bracket to the same level as the line where the curly brackets opened. Except for extraordinarily trivial code lines, do not put both open and close curly brackets on the same line.
    • GOOD:
      for(x %in% 1:3) {
       
          x2 <- x * x
    • GOOD:
      if(x > 2) {
       
          print("2 or more")
      } else {
       
          print("Less than 2")
    • ACCEPTABLE: for(x in list_of_docs) {print(x)}
    • BAD:
      if(x > 2)
      {...
    • BAD:
      for(x %in% 1:3) {
       
          x2 <- x * x} 
    • BAD:
      for(x %in% 1:3) {
          x2 <- x * x}
  • Assigning Values:Use <- exclusively – do not use = to make any assignments.
  • Semi-Colons: Half colon, all party, these guys are great. But R does not need them and you should not use them.

Program Output and Folder Setup

  • Folder Structures: For each distinct project, create a data/ folder and a outputs/ folder. Exclude these from your repository using the .gitignore file. Data files should be stored in the first folder, while outputs should be sent to the second folder. (In cases where outputs must also go somewhere else, either route the output both places using your R code, or create a wrapper script that moves files afterward.)
  • Timestamps: All output files produced by your program should be timestamped with the date and time, formatted as YYYYMMDD_HHMMSS_. Use 24-hour times. If timezones are relevant because your project covers multiple locatons, make all of your timestamps UTC (GMT +0). Outputs should have descriptive names explaning what was done, and should include the short project name used to identify the code files. Avoid using version numbers in your names. In the case where you are producing many related outputs of the same type, you may omit the timestamp on the file if you place multiple outputs into a single output folder. In that case, your output folder should be timestamped in the same fashion as the files; a description is optional, but should be included if multiple different folder-based outputs are being produced. Files should maintain a descriptive name even if placed in output folders. Remember, if the file alone makes it to someone and they view it, they need to be able to figure out at least some of the context of it!
    • GOOD: 20150803_133211_arcblast_residual_heat_plot.png
    • GOOD: 20150803_133211/arcblast_residual_heat_plot.png
    • GOOD: 20150803_133211_arcblast_heatdisp/residual_heat_plot.png
    • BAD: plot.png
    • BAD: residual_heat_plot_Aug315.png
  • What Must Be Timestamped: Any output that is used or presented in any fashion outside of exploration should be created with a timestamp. For plots, consider including the date and time of run in the lower left of the plot.

Other Notes

  • Clean-Up and Workspace: Do not allow “cruft” to accumulate in your workspace. Remove objects once they are no longer necessary for the program to execute.
  • Comments: Liberal use is recommended. I generally place comments above code, rather than in-line. Place the # for a comment on the first spot on that row. Then, place one space before beginning your comment. For instance, # This is a proper comment
  • Attach: Just don’t use attach and you will find life easier!
  • Indices: Do not use numbers in lieu of named columns, elements, etc. This helps to prevent future code breakage if you add, remove, or otherwise order the elements differently.
  • TRUE and FALSE: Spell the full word of TRUE or FALSE out completely. Do not just use T or F.
  • Leading Zeros: Any time decimal values between -1 and 1 are used, use a leading zero. For instance, 0.50 rather than .50.

So, that’s my coding convention document. A few places that I reviewed as I considered my own guide are below…enjoy!

Comments are closed.