Working with Data in R – Part II: Data Types & Reading-In Data

This post is Part II of our series on working with data in R (you can find Part I here).

Next Steps

In this post, we will continue to learn about data classes in R, such as Vectors, Factors, Matrices and Data Frames. We will also look at a number of ways in which data can be loaded – or “read in” to R for analysis. Enjoy!

Assigning Objects: Vectors

When we left off on Part I of this series, we had just created a new object by “assigning” values we had specified to it. For example:

vec <- c(2, 4, 6, 8, 10)

The code above creates a new object in our R environment called ‘vec’ which consists of the values 2, 4, 6, 8, and 10. In R, this object is known as a vector. Specifically, a numeric vector since it contains numbers as opposed to characters, etc.

We can call this object by name within a variety of basic R functions to learn more about it:

class(vec)  # What data class does it belong to?
## [1] "numeric"
str(vec)  # What is its "structure"?
##  num [1:5] 2 4 6 8 10
sum(vec)  # What is the sum of its values?
## [1] 30
mean(vec)  # What is the mean of its values?
## [1] 6
summary(vec)  # Displays basic information, such as a variable's distribution characteristics
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       4       6       6       8      10

Character Vectors

Of course, vectors don’t need to be numeric, we can also create character vectors using a variation on the code above:

w <- c("a", "b", "c", "d")
class(w)
## [1] "character"
w
## [1] "a" "b" "c" "d"

Lists

We can also create a special kind of vector called a list, comprised of elements from different data classes (see Part I of this series for a review of data classes in R). Below is an example of a list containing numeric and character data:

x <- list(1, 2, 3, 4, "a")
class(x)
## [1] "list"
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] "a"

Factors

Another special kind of vector in R is the factor. Factor variables are used to represent categorical data and can be ordered or unordered in nature. Factors allow R to make important distinctions and treat categorical data (such as Male/Female, Smoker/Non-Smoker, etc) properly in a wide variety of analytical procedures. In R, factors can be thought of as integer variables where each integer has a label.

We can create a factor variable explicitly by using the factor() function below:

f <- factor(c("yes", "yes", "no", "yes"))
f
## [1] yes yes no  yes
## Levels: no yes

One of the key features that allow R to treat factors as true categorical variables are called levels. You can think of each level as a distinct category. Our sample factor contains 2 levels: “yes” and “no.” More complicated factor variables can contain dozens – even hundreds – of distinct levels.

The Matrix

Next up in our exploration of data classes in R is the Matrix. Matrices are yet another special type of vector. Here, the key difference is the addition of a dimension attribute. Although this might seem foreign at first, you are probably already very familiar working with similar data in Microsoft Excel!

Below, we create one matrix object (‘M’) by first specifying its values (1 thru 6) and its dimensions (‘nrow’ for number of rows and ‘ncol’ for number of columns) inside the matrix() function:

M <- matrix(1:6, nrow = 2, ncol = 3)
M
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

By default, matrices are constructed column-wise – that is, data will be populated down the first column, then down the second column, and so on. We can alter this default behavior by creating our own matrices by column- or row-binding with the cbind() or rbind() functions as below:

x <- 1:3
y <- 10:12
cbind(x,y)
##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x,y)
##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

As we can see, the cbind() function forces R to build the matrix column-wise, and the rbind() function builds a matrix row-wise.

Attributes

As our R objects get more and more complex, it can also be handy to call the attributes function to keep track of the various characteristics of the data we’re working with. R Attributes can include names for certain named objects, dimensions for things like matrices and data frames (more on those below), class, or the length of the object you’re working with. For example:

attributes(M)  # The Attributes of our Matrix
## $dim
## [1] 2 3
attributes(f)  # THe Attributes of our Factor
## $levels
## [1] "no"  "yes"
## 
## $class
## [1] "factor"

Missing Data

So far, all the examples we’ve looked at have been complete – that is, without any missing values. Unfortunately, data in the real world is rarely so well-behaved. As analysts, we will frequently encounter things like blanks, nulls, or N/A values. R has a number of different ways it displays “missing” values. For example:

Some calculations can result in either positive or negative infinity:

5/0
## [1] Inf
-14/0
## [1] -Inf

Other calculations can lead to results that aren’t numbers at all. These are represented in R as NaN (“not a number”):

0/0
## [1] NaN
Inf/Inf
## [1] NaN

More commonly, missing data will be represented as NA. It’s important to remember there is a distinction between NaN and NA – these are not the same thing and are not treated in the same way by R.

We can create a simple vector with a missing (NA) value, and then use the is.na() function to check for any missing values:

y <- c(1, NA, 3, 4, NA)
y
## [1]  1 NA  3  4 NA
is.na(y)
## [1] FALSE  TRUE FALSE FALSE  TRUE

If we desire to exclude any missing values found, we can use a technique called sub-setting, which allows us to specify criteria that allow only certain parts or elements of an R object to be returned. Sub-setting is typically done inside square brackets. For example:

y[!is.na(y)] # Remember from Part I, the "!" means "NOT", so here we want elements of 'y' that are NOT NA.
## [1] 1 3 4

NOTE: Certain R functions will not run properly if missing values are present, so be certain to check your data for missing values and understand their impact on you work. The mean() function is a classic example of this:

mean(y)
## [1] NA

Duplicates and Uniqueness

Another common task is to evalute data for duplicate values, as well as unique / non-unique values. R treats these in different ways, and for good reason.

One possible scenario is wanting to ensure that a data file does not contain any duplicates (such as duplicate ID#s, names, or email addresses). R can check this for us using the duplicated() function. For example, imagine we had the following column of ID#s in a data element ‘D’:

D
##       ID
##  [1,]  1
##  [2,]  2
##  [3,]  3
##  [4,]  1
##  [5,]  7
##  [6,]  8
##  [7,]  9
##  [8,]  7
##  [9,]  8
## [10,] 10

where the 4th, 8th, and 9th rows represent duplicate records. We can call the duplicated() function to identify these rows for us (we could then use sub-setting to omit them from our data set):

duplicated(D)
##          ID
##  [1,] FALSE
##  [2,] FALSE
##  [3,] FALSE
##  [4,]  TRUE
##  [5,] FALSE
##  [6,] FALSE
##  [7,] FALSE
##  [8,]  TRUE
##  [9,]  TRUE
## [10,] FALSE

If, however, we wanted to identify unique values, for example from a list of States so that we had a distinct list, we could use R’s unique() function to display each distinct value only once:

p
##      State
## [1,] "AL" 
## [2,] "AK" 
## [3,] "AZ" 
## [4,] "AK" 
## [5,] "CA" 
## [6,] "AL" 
## [7,] "AZ"
unique(p)
##      State
## [1,] "AL" 
## [2,] "AK" 
## [3,] "AZ" 
## [4,] "CA"

Reading-in External Data

Ok, this is where things get exciting! Loading – or “reading in” data as it’s known in R lingo, is one of the most versatile and necessary skills to learn when working with data in R. There are a broad range of R functions for reading in all manner of data (including tables, txt files, Excel spreadsheets, json objects, xml files), so there’s no way we can cover everything here (Google is definitely your friend).

For our simple example, we’ll cover the basics using a data set of Portugese student performance on language tests (available for you to download and try yourself from the very cool Machine Learning Data Repository at UC Irvine here).

R allows us to seamlessly download data directly from the internet and read it into our R session. To do that, we first need to make sure our Working Directory has been set so that the download ends up in the correct place. For example, something like this:

setwd("C:/Users/MyUsername/Documents/R_practice")

Next, we’ll:

1. Point R to the URL containing our data;

2. Ask R to download the .zip folder containing our data; and

3. Unzip the folder in our current Working Directory

dataset_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"
download.file(dataset_url, "student.zip")
unzip("student.zip")

Now we’re ready to read in our student performance data using the read.csv() function. This function allows users to specify a wide variety of parameters relating to the file (for more information on this or any R function, type a question mark followed by the function name into your R console: ?read.csv), but since our file is fairly simple, we only need to include a few arguments:

student <- read.csv("student-por.csv", header = TRUE, sep = ";")

In the code above, we’ve pointed R to the csv file we want to read in, told R that our file contains a header row, and that columns are separated by semi-colons. Finally, we’ve named the new object containing our data file “student”.

Congratulations, you’ve created your first data frame in R!

Checking Our Work

Now that we have our data loaded in, how do we know whether or not we did it right? There are a few quick ways to check our work so far:

First, we can review the description of our file available from the UCI Repository: https://archive.ics.uci.edu/ml/datasets/Student+Performance#. We can see that our file should contain 649 Instances (records) and 33 Attributes (columns). Further, the names and descriptions of each column as well as the data each column should contain are listed in the “Attribute Information” section of that page.

We can verify that our R object “student” contains the same information by looking in RStudio’s Global Environment pane and making sure our object is listed with “649 obs. of 33 variables” (“obs.” meaning observations). We can also click on the small dropdown arrow next to the object name to display the list of variables. Here, you should see the variable names “school” through “G3” listed along with the data class of each column and a few sample data points.

We can also get a good snapshot of our data by calling the summary() function on either the entire data set:

summary(student)
##  school   sex          age        address famsize   Pstatus
##  GP:423   F:383   Min.   :15.00   R:197   GT3:457   A: 80  
##  MS:226   M:266   1st Qu.:16.00   U:452   LE3:192   T:569  
##                   Median :17.00                            
##                   Mean   :16.74                            
##                   3rd Qu.:18.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home :135   at_home : 42  
##  1st Qu.:2.000   1st Qu.:1.000   health  : 48   health  : 23  
##  Median :2.000   Median :2.000   other   :258   other   :367  
##  Mean   :2.515   Mean   :2.307   services:136   services:181  
##  3rd Qu.:4.000   3rd Qu.:3.000   teacher : 72   teacher : 36  
##  Max.   :4.000   Max.   :4.000                                
##         reason      guardian     traveltime      studytime    
##  course    :285   father:153   Min.   :1.000   Min.   :1.000  
##  home      :149   mother:455   1st Qu.:1.000   1st Qu.:1.000  
##  other     : 72   other : 41   Median :1.000   Median :2.000  
##  reputation:143                Mean   :1.569   Mean   :1.931  
##                                3rd Qu.:2.000   3rd Qu.:2.000  
##                                Max.   :4.000   Max.   :4.000  
##     failures      schoolsup famsup     paid     activities nursery  
##  Min.   :0.0000   no :581   no :251   no :610   no :334    no :128  
##  1st Qu.:0.0000   yes: 68   yes:398   yes: 39   yes:315    yes:521  
##  Median :0.0000                                                     
##  Mean   :0.2219                                                     
##  3rd Qu.:0.0000                                                     
##  Max.   :3.0000                                                     
##  higher    internet  romantic      famrel         freetime   
##  no : 69   no :151   no :410   Min.   :1.000   Min.   :1.00  
##  yes:580   yes:498   yes:239   1st Qu.:4.000   1st Qu.:3.00  
##                                Median :4.000   Median :3.00  
##                                Mean   :3.931   Mean   :3.18  
##                                3rd Qu.:5.000   3rd Qu.:4.00  
##                                Max.   :5.000   Max.   :5.00  
##      goout            Dalc            Walc          health     
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

or on a single variable by specifying the name of the object and variable, separated by the dollar sign:

summary(student$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   16.00   17.00   16.74   18.00   22.00

If you prefer to “see” your data as you would in Microsoft Excel, RStudio offers the ability to do so: simply click on the object name “student” in the Global Environment pane to open a new tab which displays your data in a familiar tabular format (this can also be accomplished by typing View(student) directly into your R console).

Re-Formatting Variables

But we’re not quite done yet! A close inspection of the “Attribute Information” on the UCI Repository site tells us that a few variables in our data set were not formatted exactly as they should have been. For example, the attribute description tells us the ‘Medu’ and ‘Fedu’ columns contain numbers 0-4 that should actually represent the level of education attained by the student’s mother and father, and are therefore categorical in nature. However, our R object indicates that ‘Medu’ and ‘Fedu’ were read in as Integers. We will need to re-format these and a few other similar examples.

As we learned above, in R, factors can be thought of as integer variables where each integer has a label. Conveniently, the attribute descriptions provide us with the labels to go with each integer. All we need to do is tell R which variable(s) to change, and specify the correct labelling for each value. This is demonstrated below:

student$Medu_f <- factor(student$Medu, 
                         labels = c("none", 
                                    "primary education (4th grade)", 
                                    "5th to 9th grade", 
                                    "secondary education", 
                                    "higher education"))

In the code above, we’ve created a new column in our data frame called “Medu_f”, and using the factor() function, we have instructed R to copy the existing “Medu” column and apply the five text labels for each value from 0-4. We can check our work to verify our new factor variable matches the original:

table(student$Medu)
## 
##   0   1   2   3   4 
##   6 143 186 139 175
table(student$Medu_f)
## 
##                          none primary education (4th grade) 
##                             6                           143 
##              5th to 9th grade           secondary education 
##                           186                           139 
##              higher education 
##                           175

Since the counts from the original variable and our new factor variable match exactly, we know we’ve done it correctly.

We can now apply the same logic to the “Fedu”, “traveltime”, and “studytime” variables using the appropriate labelling for each:

student$Fedu_f <- factor(student$Fedu,
                         labels = c("none", 
                                    "primary education (4th grade)", 
                                    "5th to 9th grade", 
                                    "secondary education", 
                                    "higher education"))
student$traveltime_f <- factor(student$traveltime,
                               labels = c("<15 min",
                                          "15 to 30 min",
                                          "30 min to 1 hour",
                                          ">1 hour"))
student$studytime_f <- factor(student$studytime,
                              labels = c("<2 hours",
                                         "2 to 5 hours",
                                         "5 to 10 hours",
                                         ">10 hours"))

We also need to re-format the variables “famrel”, “freetime”, “goout”, “Dalc”, “Walc”, and “health”, but since these all represent Likert-scale responses (“very bad” to “excellent”), we can re-format them as ordered factor variables (there is some disagreement about whether or not this is the recommended course of action; here we’re presenting it for demonstration purposes). To do this, we simply need to add the ordered argument:

student$famrel_f <- factor(student$famrel,
                           labels = c("very bad",
                                      "bad",
                                      "neutral",
                                      "good",
                                      "excellent"),
                           ordered = TRUE)
student$freetime_f <- factor(student$freetime,
                           labels = c("very low",
                                      "low",
                                      "neutral",
                                      "high",
                                      "very high"),
                           ordered = TRUE)
student$goout_f <- factor(student$goout,
                           labels = c("very low",
                                      "low",
                                      "neutral",
                                      "high",
                                      "very high"),
                           ordered = TRUE)
student$Dalc_f <- factor(student$Dalc,
                           labels = c("very low",
                                      "low",
                                      "neutral",
                                      "high",
                                      "very high"),
                           ordered = TRUE)
student$Walc_f <- factor(student$Walc,
                           labels = c("very low",
                                      "low",
                                      "neutral",
                                      "high",
                                      "very high"),
                           ordered = TRUE)
student$health_f <- factor(student$health,
                           labels = c("very bad",
                                      "bad",
                                      "neutral",
                                      "good",
                                      "very good"),
                           ordered = TRUE)

There you have it – usable, well-formatted data plucked directly from the internet and ready for analysis in R! We have covered a lot, but the fun is just getting started – stay tuned for Part III of our series, where we’ll cover data exploration, manipulation, and visualization!

Below is a sneak peak of what’s to come!

Andy Blog Chart

To learn more about how the BWF Insight team uses analytics tools like R to help fundraisers all around the world, check us out on the web at BWF Insight!

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s