Marketing Yogi: An R primer for Session 3 and beyond

Hi all,

Time to gently introduce R into MKTR. In Session 3, I attempt to demonstrate sampling related concepts using R. This does not require that you as students do anything other than watch and listen, discuss and learn. However, I would encourage you to try the same thing I demo in class at home on R. All you have to do is copy-paste the Rcode below (in the grey shaded boxes) onto the R console. Since there is no grading or assignment pressure involved, consider this a gentle intro to R. Later, when the graded homework assignments happen, then also, expect the same thing - I will put out R code here on the blog and you have the option of using it to solve your homeworks directly.

1. To download and install R:
Go to the following link
CRAN project: R download links
download the installer, and follow its instructions.

After that, I expect that in < 20 minutes (on a good net connection), you will have top-class computational firepower ready on your machines.

Open R and, well, look around. The GUI won't look like much probably, but appearances can be deceiving (as we'll see later in the course in Session 6 - Qualitative research and Session 7 - Experimentation).

2. Basic Input/Output in R
Any dataset you want to read in R, store in a standalone .txt or .csv file anywhere on your computer. Then simply copy paste the following code:

mydata = read.table(file.choose(), header = TRUE)
dim(mydata); summary(mydata); mydata[1:5,]

Note: The dataset should have column headers, else put "header = FALSE" above. We have both read in the dataset (named 'mydata') into R and run a quick-check on it to see its summary characteristics.

3. Read in Session 3 Dataset into R

For session 3, to demonstrate sampling concepts on R, I collected data from students in the MKTR class of 2011 on their self-reported height (in cms) and weight (in kgs). This data is available for copying from this google spreadsheet. Please copy the data onto a .txt file on your computer and use the code in step 2 to read it into R.

4. Run your first set of analyses on R - histograms for what the data look like.

# What are the data like? #
attach(mydata)
par(mfcol=c(2,1))# makes plots on a single page in a 2 x 2 pattern
for (i in 1:ncol(mydata)) # starts loop for i from 1 to 4
{hist(mydata[,i], breaks=30, main=dimnames(mydata)[[2]][i], col="gray")
abline(v=mean(mydata[,i]), lwd=3, lty=2, col="Red")}

Just copy-paste the above code into the R console. Should run peacefully.

5. Set sample size (k) and see by how much sample based estimates differ from true values.

# Randomly sample 10 values & estimate mean height, weight.
k = 10
ht = sample(height,k); ht
wt = sample(weight,k); wt
mean(ht); mean(wt)
mean(height); mean(weight)
error.ht = mean(height)-mean(ht); error.ht
error.wt = mean(weight)-mean(wt); error.wt

6. Now set a higher sample size, say k=40 (instead of 10) and rerun the above code. The errors come down, usually (but not always).

# Randomly sample 10 values & estimate mean height, weight.
k = 40
ht = sample(height,k); ht
wt = sample(weight,k); wt
mean(ht); mean(wt)
mean(height); mean(weight)
error.ht = mean(height)-mean(ht); error.ht
error.wt = mean(weight)-mean(wt); error.wt

7. Plot and see how closely sample distribution approximates population distribution.

par(mfrow=c(2,2))
hist(mydata[,1], breaks=30, main ="Population Height", xlim=c(140,200), col="gray")
abline(v=mean(mydata[,1]), lwd=3, lty=2, col="Red")
hist(mydata[,2], breaks=30, main="Population Weight", xlim=c(40,100), col="gray")
abline(v=mean(mydata[,2]), lwd=3, lty=2, col="Red")
hist(ht, breaks=20, main="Sample size k=40", xlim=c(140,200), col="beige")
abline(v=mean(ht), lwd=3, lty=2, col="Red")
hist(wt, breaks=20, main="Sample size k=40", xlim=c(40,100), col="beige")
abline(v=mean(wt), lwd=3, lty=2, col="Red")

By the way, if you want to save the above (or any other) plot from R, just right-click, copy as metafile and then paste onto your PPT or word doc etc.

8. Time to deep-dive into Sampling Distributions now. Ask yourself, what happens if we repeat the sampling process 1000 times? We get 1000 means from 1000 samples. Now, what do the 1000 means look like as a histogram?

outp = matrix(0,nrow=1000,ncol=2)
k = 10;# set sample size
for (i in 1:1000){ outp[i,1]=mean(sample(height,k))
outp[i,2]=mean(sample(weight,k))}
par(mfrow=c(2,2))
for (i in 1:ncol(outp)){
hist(outp[,i], breaks=10,main=c( "sample size=",k),
xlab=dimnames(mydata)[[2]][i],xlim=range(mydata[,i]))}

9. Now repeat the above code but with k=40 (instead of k=10) and see. The plots are superimposed in a 2x2 pattern. Do you see fall or rise in the std error? The standard error will roughly half, following the square root rule.

outp = matrix(0,nrow=1000,ncol=2)
k = 40;# set sample size
for (i in 1:1000){ outp[i,1]=mean(sample(height,k))
outp[i,2]=mean(sample(weight,k))}
#par(mfrow=c(2,2))
for (i in 1:ncol(outp)){
hist(outp[,i], breaks=10,main=c( "sample size=",k),
xlab=dimnames(mydata)[[2]][i],xlim=range(mydata[,i]))}

If all went well and you were able to replicate what I got in class in Session 3, well then, congratulations, I guess. You've run R peacefully for MKTR.

I'd strongly recommend folks try R to the max extent feasible for this course. Pls correspond with me using this blog for any R related queries, trouble-shoots etc.

Sudhir

P.S.
Here's code for demo-ing CLT on a-priori non-normal data: one from exponential distribution and the other from a log-normal one.

x1 = rexp(10000) # exponential distbn
x2 = exp(rnorm(10000)) # lognormal distbn
outp = matrix(0,nrow=1000,ncol=2)
k = 40;# set sample size
par(mfrow=c(2,2))

hist(x1, breaks=40)
hist(x2, breaks=40)

for (i in 1:1000){ outp[i,1]=mean(sample(x1,k))}
hist(outp[,1], breaks=10,main=c( "sample size=",k))

for (i in 1:1000){ outp[i,2]=mean(sample(x2,k))}
hist(outp[,2], breaks=10,main=c( "sample size=",k))

I also tried the flipped classroom thing and have putup two youtube vids on how to read in and write out data from R. Here are the links:
5 Steps to Read data into R
4 Steps to Save data from R