Thursday, October 25, 2012

Session 4 R code and Homework Data

Update:

Pls see this Session 4 HW guidance blog-post before submitting your HW.

I have re-formatted parts of this blog-post. Inline images have been added for each tool's graphical output. Pls let me know if you're having any trouble viewing the images. Use the comments box below for fastest response.

Hi all,

I hope you had the chance to run the R demo of session 3 at home just to see first-hand R's ease-of-use in MKTR.

Before jumping into the R code for Session 3, first, a small admin step - you need to know how to download and install packages in R. Say, you need the package 'MASS' installed. Go to the Packages Menu in the Menu bar, click on 'Install Package(s)...', a window will open asking which server to download the package from. Be patriotic and choose 'India'.

A second window will open listing all the packages in R (at present, this list grows every month) in alphabetical order. Click on thepackage you want and sit back. R will automatically download and install the package for you. Might take a minute or two at most.

All the data for the below examples are stored as tables in this google spreadsheet.

OK, let's get started with the R codes then.

1. Simple Data Visualization using biplots. We use USArrests data (inbuilt R dataset) to see how it can be visualized in 2 dimensions. Just copy-paste the code below onto the R console [Hit 'enter' after the last line].

pc.cr <-princomp(USArrests, cor = TRUE)# scale data
summary(pc.cr)
biplot(pc.cr)
This is what the plot should look like. Click on image for larger view.

2. Code for making Joint Space maps I use as my example the OfficeStar dataset that I also demo in class from MEXL's built-in examples database. The code is generic and can be applied to any dataset in the right format you want to make joint space maps from. To facilitate comparison, I use as input format in R the same tables that you would otherwise use in MEXL

For data input, read in only the cells with numbers, no headers. The headers will need to be read in separately.

Step 2a: Read in the attribute table into 'mydata'. Only number cells, no headers.

# Read in the 5x4 mean attributes table #
mydata = read.table(file.choose())
mydata = t(mydata)#transposing to ease analysis
mydata #view the table read

Step 2b: Read the preferences table into 'pref'. Only the number cells, no headers.

# Read in preferences table#
pref=read.table(file.choose())
dim(pref) #check table dimensions
pref[1:10,]#view first 10 rows

Step 2c: Now read in table headers for 'mydata'. The below code is specific to the current example. If you're working on another dataset, pls edit the headers below (on a notepad) and then copy-paste into R.

# Read in headers separately #
attribnames = c("Large choice","Low prices","Service quality","Product Quality","Convenience")
brdnames = c("Office star", "Paper & co.","Office Eqpmt", "supermkt")
rownames(mydata) = brdnames
colnames(mydata) = attribnames
mydata # view table with headers

Data reading is done. Can start analysis now. Finally.

Step 2d: Plot the Joint space maps. The following code is general and can be used directly for any joint space mapping you may want to do once you have correctly read-in the data.

par(pty="s")#square plotting region
fit = prcomp(mydata, scale.=TRUE)
plot(fit$rotation[,1:2], type="n",xlim=c(-1.5,1.5), ylim=c(-1.5,1.5),main="Joint Space map - Home-brew on R")
abline(h=0); abline(v=0)
for (i1 in 1:nrow(fit$rotation)){
arrows(0,0, x1=fit$rotation[i1,1]*fit$sdev[1],y1=fit$rotation[i1,2]*fit$sdev[2], col="blue", lwd=1.5);
text(x=fit$rotation[i1,1]*fit$sdev[1],y=fit$rotation[i1,2]*fit$sdev[2], labels=attribnames[i1],col="blue", cex=1.1)}

# make co-ords within (-1,1) frame #
fit1=fit
fit1$x[,1]=fit$x[,1]/apply(abs(fit$x),2,sum)[1]
fit1$x[,2]=fit$x[,2]/apply(abs(fit$x),2,sum)[2]
points(x=fit1$x[,1], y=fit1$x[,2], pch=19, col="red")
text(x=fit1$x[,1], y=fit1$x[,2], labels=brdnames,col="black", cex=1.1)

# --- add preferences to map ---#
k1 = 2; #scale-down factor
pref=data.matrix(pref)# make data compatible
pref1 = pref %*% fit1$x[,1:2]
for (i1 in 1:nrow(pref1)){segments(0,0, x1=pref1[i1,1]/k1,y1=pref1[i1,2]/k1, col="maroon2", lwd=1.25)}
# voila, we're done! #
This is what you should get. Click on image for larger size. Again, just ensure data input happened correctly. Rest follows through peacefully.

3. Code for Joint Space maps on your Session 2 Homework dataset (term 4 courses).

For data input, read in only the cells with numbers, no headers. The headers will need to be read in separately. Following steps mirror steps 2a, 2b and 2c above.
Step 3a: Read in the attribute table into 'mydata'. Only number cells, no headers.

# Read in the 4x4 mean attributes table #
mydata = read.table(file.choose())
mydata = t(mydata)#transposing to ease analysis
mydata #view the table read

Step 3b: Read the preferences table into 'pref'. Only the number cells, no headers.

# Read in preferences table#
pref=read.table(file.choose())
dim(pref) #check table dimensions
pref[1:10,]#view first 10 rows

Step 3c: Now read in table headers for 'mydata'. The below code is specific to the current example. If you're working on another dataset, pls edit the headers below (on a notepad) and then copy-paste into R.

# Read in headers separately #
attribnames = c("Conceptual & theoretical value","Practical relevance","Interest sustained","Difficulty level")
brdnames = c("GSB", "INVA","MGTO", "SAIT")
rownames(mydata) = brdnames
colnames(mydata) = attribnames
mydata # view table with headers

The rest of the analysis uses exactly the same block of code as in step 2d after the data input part.

4. MDS code

MDS or Multi-dimensional scaling is the way we analyze overall-similarity (OS) data. Recall that you rated your impression of overall similarity among 9 car brands in a series of paired comparisons. We use that data for MDS here. The input is required in a particular format and there's some data cleaning and aggregation required. Luckily, my RA Shri Ankit Anand was around and a big help. Pls find the data for MDS input in the above google spreadsheet starting at cell P1.

First, we do metric MDS. Metric MDS uses metric data as its input. Since you rated similarity-dissimilarity on a 1-7 interval scale, metric MDS will work just fine. Here's the code for metric MDS, just copy-paste onto the R console.

Read in data (with the headers).

# read in the dissimilarities matrix #
mydata=read.table(file.choose(),header=TRUE)
d = as.dist(mydata)
# Classical MDS into k dimensions#
fit <- cmdscale(d,eig=TRUE, k=2)
fit # view results

# plot solution #
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", type="p",pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)
This is my graphical MDS output. Click on image for larger size.

Suppose the similarity-dissimilarity judgments were in yes/no terms rather than in metric ratings. Then metric MDS becomes dicey to use as it relies on interval sclaing assumptions. The more robust (but somewhat less efficient) nonmetric MDS then becomes the way to go.

Nonmetric MDS is just as easy to run. However, make sure you have the MASS package installed before running it.

library(MASS)
d <- as.dist(mydata)
fit <- isoMDS(d, k=2)
fit # view results

# plot solution $
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Nonmetric MDS", type="p", pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)
Click on image for larger size.

5. Factor analysis code

We will use exploratory (or 'common') factor analysis first on the toothpaste survey dataset. This dataset can be found starting cell P22 in the google spreadsheet mentioned above. Need to install package 'nFactors' (R is case sensitive, always) for running the scree plot.

# read in the data #
mydata=read.table(file.choose(),header=TRUE)
mydata[1:5,]#view first 5 rows

# determine optimal no. of factors#
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(ev$values, ap$eigen$qevpea)
plotnScree(nS)

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above.

k1=2 # set optimal no. of factors
If the optimal no. of factors changes when you use a new dataset, simply change the value of k1 in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors #
# with varimax rotation #
fit <- factanal(mydata, k1, scores="Bartlett",rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)

# plot factor 1 by factor 2 #
load <- fit$loadings[,1:2]
par(col="black")#black lines in plots
plot(load,type="p",pch=19,col="red") # set up plot
abline(h=0);abline(v=0)#draw axes
text(load,labels=names(mydata),cex=1,pos=1)
# view & save factor scores #
fit$scores[1:4,]#view factor scores

#save factor scores (if needed)
write.table(fit$scores, file.choose())
Click on image for larger size.

In case you are wondering how the variable loading onto factors looks like in R (after factor rotation), here is the relevant R console snapshot.

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2. That is all for now. See you soon.

Sudhir

P.S.
This survey based exercise we had done in class. I've sent you its data in an excel file. For practice sake, pls run it through a joint space map and see if you get the results I did (pasted below).
This one is using Female students' inputs

And this one is using inputs from the male students:

7 comments:

  1. Had a question.

    In the slide "The Perceptual Map from HW Data" from Session 4, in order to understand the relative perceptions of GSB and SAIT along the "Conceptual and Theoretical Value gained" dimension, can we draw perpendiculars from GSB and SAIT to the extension of the line "Conceptual and Theoretical value gained" to the other side of the Origin?

    ReplyDelete
  2. Hi Samik,

    Yes. Each dimension extends to both sides of the origin. In the direction of the arrow, it increases and in the opposite direction behind the origin it decreases.

    The origin can also be considered a standardized "mean" value and brands behind it are "below the mean".

    Hope that helped.

    Sudhir

    ReplyDelete
  3. I am looking at the code for the Factor analysis and had a question -

    "On the scree plot that appears, THe green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2.
    k1=2 # set optimal no. of factors

    # extracting k1 factors #
    # with varimax rotation #
    fit <- factanal(mydata, k1, scores="Bartlett",rotation="varimax")
    print(fit, digits=2, cutoff=.3, sort=TRUE)

    # plot factor 1 by factor 2 #
    load <- fit$loadings[,1:2]
    par(col="black")#black lines in plots
    plot(load,type="p",pch=19,col="red") # set up plot
    abline(h=0);abline(v=0)#draw axes
    text(load,labels=names(mydata),cex=1,pos=1)
    # view & save factor scores #
    fit$scores[1:4,]#view factor scores

    #save factor scores (if needed)
    write.table(fit$scores, file.choose())"

    In this code piece I realise I need to change the value of the no. of factors based on the graph I get. But do I need to make adjustments for Eigenvalue also and how? Also should I make changes to the value of cut off?

    Thanks

    ReplyDelete
  4. Hi Shakun,

    Just change the value of k1 in:
    k1=2 # set optimal no. of factors

    Change it to
    k1=6 # set optimal no. of factors

    Rest of the code remains unchanged.

    Let me know if there're any more queries.

    Sudhir

    ReplyDelete
  5. Dear Professor,
    In Factor analysis code, is there any specific reason why you chose to read 5 rows? What happens if we take all the rows? Is there any impact of rows on the Scree Chart?

    # read in the data #
    mydata=read.table(file.choose(),header=TRUE)
    mydata[1:5,]#view first 5 rows

    Thanks!
    Best Regards
    Abhishek

    ReplyDelete
  6. Hi Abhishek,

    The line:
    mydata[1:5,]#view first 5 rows

    only 'views' the first 5 rows. The entire data is in R and is used as is for analysis purposes. But for convenience, just to see what the data look like, its good practice to have R display a few lines of each dataset every now and then.

    Hope that clarifies.

    Sudhir

    ReplyDelete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.