Data Structures in R: A Primer

Posted by Cytel

Jan 19, 2017 7:05:00 AM

 datapicture.jpg

R is on the rise in biopharma, and as we have previously discussed on the blog, it is now time for SAS programmers to get up to speedwith this popular and powerful programming language.  Indeed, one of the advantages of R is its ability to integrate with other languages like C, C++, Python and SAS. Its strong graphical capabilities allow output in PDF, JPG, PNG, and SVG formats and table outputs for LaTeX and HTML. Importantly, as an open source resource, there is a strong community around R and extensive support for users in the form of forums like R-bloggers, StackOverflow and GitHub Repository

In this blog, we’ll provide an overview of R basic structures for programmers.  

Data types

R provides support for different kinds of data of the following types:

Logical – e.g., log.var <-  c(TRUE, TRUE, FALSE)

Integer – e.g., int.var <-  c(1L, 6L, 10L); L stands for “long integer”

Double (also called numeric) – e.g., dbl.var <-  c(1, 2.5, 3.6)

Character – e.g., chr.var <-  c(“hello”, “d”, “a”)

Factors - These are categorical variables.

e.g., fact.var <-  factor(c("red", "green", "blue"))

These data types have analogues in other common programming languages.

 

Data Structures

Factor Variables

Factor variables are a structure specific to R- there is not a direct comparison available in SAS. However languages like C and C++ do have enum variables that are somewhat similar to the factor variables in R.

In this structure, variable values are treated as factors – in other words, a fixed list of values which the variable can take.   A great deal of data used in statistics has fixed values which can thus be coded as factors in the R program, and used to store either nominal or ordinal variable types.

factorvariables1.png

Vectors

Another data structure of note provided in R are vectors.  One important aspect of vectors is that all elements must be of the same type. Common operations on vectors include:

  • typeof()
  • length()
  • is function e.g., is.character(),  is.double(), is.integer()

 vector1.png

Matrices

A matrix is a collection of elements in a two-dimensional rectangular format as in the example below.

 Matrices1.png

Data frames

This is one of the most important structures and very often used by statisticians and statistical programmers when analyzing data. A data frame is a data table with multiple columns. The columns could be of different data types, which is the prime difference between data frame and matrix.

 dataframes1.png

Lists

 A list is a very useful structure which allows the programmer to put different types of structures together in one output- for example you could compile several data frames, or a combination of structures such as one matrix, one data frame etc.  

 List2.png

 

In a follow up blog, we'll go on to overview other aspects of basic syntax and variables. Sign up for blog notifications and get updates direct to your inbox.

Subscribe

Want to learn more?  Further related reading is below:

 

The Rise of R- is it time for SAS programmers to get up to speed?

Harnessing the power of R API

The evolving role of the modern statistical programmer

 With many thanks to Cytel's Namrata Deshpande and Imran Hossein.

 

 

Topics: Statistical Programming, biostatistics, SAS, Rstats, R

The Cytel blog keeps you up to speed with the latest developments in biostatistics and clinical biometrics.  Sign up for updates direct to your inbox. You can unsubscribe at any time.

 

Posts by Topic

see all

Recent Posts