This post is Part II of our series on working with data in R (you can find Part I here).
Next Steps
In this post, we will continue to learn about data classes in R, such as Vectors, Factors, Matrices and Data Frames. We will also look at a number of ways in which data can be loaded – or “read in” to R for analysis. Enjoy!
Assigning Objects: Vectors
When we left off on Part I of this series, we had just created a new object by “assigning” values we had specified to it. For example:
vec <- c(2, 4, 6, 8, 10)
The code above creates a new object in our R environment called ‘vec’ which consists of the values 2, 4, 6, 8, and 10. In R, this object is known as a vector. Specifically, a numeric vector since it contains numbers as opposed to characters, etc.
We can call this object by name within a variety of basic R functions to learn more about it:
class(vec) # What data class does it belong to?
## [1] "numeric"
str(vec) # What is its "structure"?
## num [1:5] 2 4 6 8 10
sum(vec) # What is the sum of its values?
## [1] 30
mean(vec) # What is the mean of its values?
## [1] 6
summary(vec) # Displays basic information, such as a variable's distribution characteristics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 4 6 6 8 10
Character Vectors
Of course, vectors don’t need to be numeric, we can also create character vectors using a variation on the code above:
w <- c("a", "b", "c", "d")
class(w)
## [1] "character"
w
## [1] "a" "b" "c" "d"
Lists
We can also create a special kind of vector called a list, comprised of elements from different data classes (see Part I of this series for a review of data classes in R). Below is an example of a list containing numeric and character data:
x <- list(1, 2, 3, 4, "a")
class(x)
## [1] "list"
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] "a"
Factors
Another special kind of vector in R is the factor. Factor variables are used to represent categorical data and can be ordered or unordered in nature. Factors allow R to make important distinctions and treat categorical data (such as Male/Female, Smoker/Non-Smoker, etc) properly in a wide variety of analytical procedures. In R, factors can be thought of as integer variables where each integer has a label.
We can create a factor variable explicitly by using the factor()
function below:
f <- factor(c("yes", "yes", "no", "yes"))
f
## [1] yes yes no yes
## Levels: no yes
One of the key features that allow R to treat factors as true categorical variables are called levels. You can think of each level as a distinct category. Our sample factor contains 2 levels: “yes” and “no.” More complicated factor variables can contain dozens – even hundreds – of distinct levels.
The Matrix
Next up in our exploration of data classes in R is the Matrix. Matrices are yet another special type of vector. Here, the key difference is the addition of a dimension attribute. Although this might seem foreign at first, you are probably already very familiar working with similar data in Microsoft Excel!
Below, we create one matrix object (‘M’) by first specifying its values (1 thru 6) and its dimensions (‘nrow’ for number of rows and ‘ncol’ for number of columns) inside the matrix()
function:
M <- matrix(1:6, nrow = 2, ncol = 3)
M
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
By default, matrices are constructed column-wise – that is, data will be populated down the first column, then down the second column, and so on. We can alter this default behavior by creating our own matrices by column- or row-binding with the cbind()
or rbind()
functions as below:
x <- 1:3
y <- 10:12
cbind(x,y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x,y)
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
As we can see, the cbind()
function forces R to build the matrix column-wise, and the rbind()
function builds a matrix row-wise.
Attributes
As our R objects get more and more complex, it can also be handy to call the attributes function to keep track of the various characteristics of the data we’re working with. R Attributes can include names for certain named objects, dimensions for things like matrices and data frames (more on those below), class, or the length of the object you’re working with. For example:
attributes(M) # The Attributes of our Matrix
## $dim
## [1] 2 3
attributes(f) # THe Attributes of our Factor
## $levels
## [1] "no" "yes"
##
## $class
## [1] "factor"
Missing Data
So far, all the examples we’ve looked at have been complete – that is, without any missing values. Unfortunately, data in the real world is rarely so well-behaved. As analysts, we will frequently encounter things like blanks, nulls, or N/A values. R has a number of different ways it displays “missing” values. For example:
Some calculations can result in either positive or negative infinity:
5/0
## [1] Inf
-14/0
## [1] -Inf
Other calculations can lead to results that aren’t numbers at all. These are represented in R as NaN (“not a number”):
0/0
## [1] NaN
Inf/Inf
## [1] NaN
More commonly, missing data will be represented as NA. It’s important to remember there is a distinction between NaN and NA – these are not the same thing and are not treated in the same way by R.
We can create a simple vector with a missing (NA) value, and then use the is.na()
function to check for any missing values:
y <- c(1, NA, 3, 4, NA)
y
## [1] 1 NA 3 4 NA
is.na(y)
## [1] FALSE TRUE FALSE FALSE TRUE
If we desire to exclude any missing values found, we can use a technique called sub-setting, which allows us to specify criteria that allow only certain parts or elements of an R object to be returned. Sub-setting is typically done inside square brackets. For example:
y[!is.na(y)] # Remember from Part I, the "!" means "NOT", so here we want elements of 'y' that are NOT NA.
## [1] 1 3 4
NOTE: Certain R functions will not run properly if missing values are present, so be certain to check your data for missing values and understand their impact on you work. The mean()
function is a classic example of this:
mean(y)
## [1] NA
Duplicates and Uniqueness
Another common task is to evalute data for duplicate values, as well as unique / non-unique values. R treats these in different ways, and for good reason.
One possible scenario is wanting to ensure that a data file does not contain any duplicates (such as duplicate ID#s, names, or email addresses). R can check this for us using the duplicated()
function. For example, imagine we had the following column of ID#s in a data element ‘D’:
D
## ID
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 1
## [5,] 7
## [6,] 8
## [7,] 9
## [8,] 7
## [9,] 8
## [10,] 10
where the 4th, 8th, and 9th rows represent duplicate records. We can call the duplicated()
function to identify these rows for us (we could then use sub-setting to omit them from our data set):
duplicated(D)
## ID
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] TRUE
## [5,] FALSE
## [6,] FALSE
## [7,] FALSE
## [8,] TRUE
## [9,] TRUE
## [10,] FALSE
If, however, we wanted to identify unique values, for example from a list of States so that we had a distinct list, we could use R’s unique()
function to display each distinct value only once:
p
## State
## [1,] "AL"
## [2,] "AK"
## [3,] "AZ"
## [4,] "AK"
## [5,] "CA"
## [6,] "AL"
## [7,] "AZ"
unique(p)
## State
## [1,] "AL"
## [2,] "AK"
## [3,] "AZ"
## [4,] "CA"
Reading-in External Data
Ok, this is where things get exciting! Loading – or “reading in” data as it’s known in R lingo, is one of the most versatile and necessary skills to learn when working with data in R. There are a broad range of R functions for reading in all manner of data (including tables, txt files, Excel spreadsheets, json objects, xml files), so there’s no way we can cover everything here (Google is definitely your friend).
For our simple example, we’ll cover the basics using a data set of Portugese student performance on language tests (available for you to download and try yourself from the very cool Machine Learning Data Repository at UC Irvine here).
R allows us to seamlessly download data directly from the internet and read it into our R session. To do that, we first need to make sure our Working Directory has been set so that the download ends up in the correct place. For example, something like this:
setwd("C:/Users/MyUsername/Documents/R_practice")
Next, we’ll:
1. Point R to the URL containing our data;
2. Ask R to download the .zip folder containing our data; and
3. Unzip the folder in our current Working Directory
dataset_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"
download.file(dataset_url, "student.zip")
unzip("student.zip")
Now we’re ready to read in our student performance data using the read.csv()
function. This function allows users to specify a wide variety of parameters relating to the file (for more information on this or any R function, type a question mark followed by the function name into your R console: ?read.csv
), but since our file is fairly simple, we only need to include a few arguments:
student <- read.csv("student-por.csv", header = TRUE, sep = ";")
In the code above, we’ve pointed R to the csv file we want to read in, told R that our file contains a header row, and that columns are separated by semi-colons. Finally, we’ve named the new object containing our data file “student”.
Congratulations, you’ve created your first data frame in R!
Checking Our Work
Now that we have our data loaded in, how do we know whether or not we did it right? There are a few quick ways to check our work so far:
First, we can review the description of our file available from the UCI Repository: https://archive.ics.uci.edu/ml/datasets/Student+Performance#. We can see that our file should contain 649 Instances (records) and 33 Attributes (columns). Further, the names and descriptions of each column as well as the data each column should contain are listed in the “Attribute Information” section of that page.
We can verify that our R object “student” contains the same information by looking in RStudio’s Global Environment pane and making sure our object is listed with “649 obs. of 33 variables” (“obs.” meaning observations). We can also click on the small dropdown arrow next to the object name to display the list of variables. Here, you should see the variable names “school” through “G3” listed along with the data class of each column and a few sample data points.
We can also get a good snapshot of our data by calling the summary()
function on either the entire data set:
summary(student)
## school sex age address famsize Pstatus
## GP:423 F:383 Min. :15.00 R:197 GT3:457 A: 80
## MS:226 M:266 1st Qu.:16.00 U:452 LE3:192 T:569
## Median :17.00
## Mean :16.74
## 3rd Qu.:18.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home :135 at_home : 42
## 1st Qu.:2.000 1st Qu.:1.000 health : 48 health : 23
## Median :2.000 Median :2.000 other :258 other :367
## Mean :2.515 Mean :2.307 services:136 services:181
## 3rd Qu.:4.000 3rd Qu.:3.000 teacher : 72 teacher : 36
## Max. :4.000 Max. :4.000
## reason guardian traveltime studytime
## course :285 father:153 Min. :1.000 Min. :1.000
## home :149 mother:455 1st Qu.:1.000 1st Qu.:1.000
## other : 72 other : 41 Median :1.000 Median :2.000
## reputation:143 Mean :1.569 Mean :1.931
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000
## failures schoolsup famsup paid activities nursery
## Min. :0.0000 no :581 no :251 no :610 no :334 no :128
## 1st Qu.:0.0000 yes: 68 yes:398 yes: 39 yes:315 yes:521
## Median :0.0000
## Mean :0.2219
## 3rd Qu.:0.0000
## Max. :3.0000
## higher internet romantic famrel freetime
## no : 69 no :151 no :410 Min. :1.000 Min. :1.00
## yes:580 yes:498 yes:239 1st Qu.:4.000 1st Qu.:3.00
## Median :4.000 Median :3.00
## Mean :3.931 Mean :3.18
## 3rd Qu.:5.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.00
## goout Dalc Walc health
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
or on a single variable by specifying the name of the object and variable, separated by the dollar sign:
summary(student$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 16.00 17.00 16.74 18.00 22.00
If you prefer to “see” your data as you would in Microsoft Excel, RStudio offers the ability to do so: simply click on the object name “student” in the Global Environment pane to open a new tab which displays your data in a familiar tabular format (this can also be accomplished by typing View(student)
directly into your R console).
Re-Formatting Variables
But we’re not quite done yet! A close inspection of the “Attribute Information” on the UCI Repository site tells us that a few variables in our data set were not formatted exactly as they should have been. For example, the attribute description tells us the ‘Medu’ and ‘Fedu’ columns contain numbers 0-4 that should actually represent the level of education attained by the student’s mother and father, and are therefore categorical in nature. However, our R object indicates that ‘Medu’ and ‘Fedu’ were read in as Integers. We will need to re-format these and a few other similar examples.
As we learned above, in R, factors can be thought of as integer variables where each integer has a label. Conveniently, the attribute descriptions provide us with the labels to go with each integer. All we need to do is tell R which variable(s) to change, and specify the correct labelling for each value. This is demonstrated below:
student$Medu_f <- factor(student$Medu,
labels = c("none",
"primary education (4th grade)",
"5th to 9th grade",
"secondary education",
"higher education"))
In the code above, we’ve created a new column in our data frame called “Medu_f”, and using the factor()
function, we have instructed R to copy the existing “Medu” column and apply the five text labels for each value from 0-4. We can check our work to verify our new factor variable matches the original:
table(student$Medu)
##
## 0 1 2 3 4
## 6 143 186 139 175
table(student$Medu_f)
##
## none primary education (4th grade)
## 6 143
## 5th to 9th grade secondary education
## 186 139
## higher education
## 175
Since the counts from the original variable and our new factor variable match exactly, we know we’ve done it correctly.
We can now apply the same logic to the “Fedu”, “traveltime”, and “studytime” variables using the appropriate labelling for each:
student$Fedu_f <- factor(student$Fedu,
labels = c("none",
"primary education (4th grade)",
"5th to 9th grade",
"secondary education",
"higher education"))
student$traveltime_f <- factor(student$traveltime,
labels = c("<15 min",
"15 to 30 min",
"30 min to 1 hour",
">1 hour"))
student$studytime_f <- factor(student$studytime,
labels = c("<2 hours",
"2 to 5 hours",
"5 to 10 hours",
">10 hours"))
We also need to re-format the variables “famrel”, “freetime”, “goout”, “Dalc”, “Walc”, and “health”, but since these all represent Likert-scale responses (“very bad” to “excellent”), we can re-format them as ordered factor variables (there is some disagreement about whether or not this is the recommended course of action; here we’re presenting it for demonstration purposes). To do this, we simply need to add the ordered
argument:
student$famrel_f <- factor(student$famrel,
labels = c("very bad",
"bad",
"neutral",
"good",
"excellent"),
ordered = TRUE)
student$freetime_f <- factor(student$freetime,
labels = c("very low",
"low",
"neutral",
"high",
"very high"),
ordered = TRUE)
student$goout_f <- factor(student$goout,
labels = c("very low",
"low",
"neutral",
"high",
"very high"),
ordered = TRUE)
student$Dalc_f <- factor(student$Dalc,
labels = c("very low",
"low",
"neutral",
"high",
"very high"),
ordered = TRUE)
student$Walc_f <- factor(student$Walc,
labels = c("very low",
"low",
"neutral",
"high",
"very high"),
ordered = TRUE)
student$health_f <- factor(student$health,
labels = c("very bad",
"bad",
"neutral",
"good",
"very good"),
ordered = TRUE)
There you have it – usable, well-formatted data plucked directly from the internet and ready for analysis in R! We have covered a lot, but the fun is just getting started – stay tuned for Part III of our series, where we’ll cover data exploration, manipulation, and visualization!
Below is a sneak peak of what’s to come!
To learn more about how the BWF Insight team uses analytics tools like R to help fundraisers all around the world, check us out on the web at BWF Insight!