Friday, September 19, 2014

Session 5 Homework

Update:

Sorry folks, typed in the wrong URL by mistake. Those looking for sessions 8 and 9 pre-reads, pls go here.

The following are pre-reads for session 7:

1. AI meets the C-Suite (McKinsey Quarterly)

and 2. Track Customer Attitudes to predict their behaviors (HBR)

The course-pack reading for session 7 is optional and these two are mandatory.

Sudhir

-----------------------------

Hi all,

You might want to lookup the list of shiny apps available listed in the session 5 updates blog post here, before we start.

The homework has three parts. Only two of the three need be done and submitted.

Homework part 1: (group submission, mandatory)

Choose any non-obscure product or service on Flipkart or Amazon (or any other review aggregation source).

Your R.O.s are: (1) Find the top few things people like about the product.

(2) Find the top few things people dislike about the product.

(3) Suggest a (re-)positioning strategy for the product based on the above.

Pull 100+ reviews of the product.

Note: A Flipkart shinyapp is available already. Just follow instructions on the first page of the shinyapp.

We're working on an amazon shinyapp as well. watch this space for updates.

Update: Turns out Amazon pages are now dynamic. They were regular pages till last year. So no shinyapp happening on it.

Text analyze the corpus for insights.

Not everything we can do is up on shiny. Would help massively if at least one member per group runs the classwork R code successfully on their machines.

Homework part 2: (Individual submission - option 1)

Use tm.plugin.webmining to pull data from any of the following news aggregators. Pick any product/ firm/ brand/ celebrity that has been in the news lately.

Pull the last 100+ news articles wherein this entity was mentioned in the article title.

Recall the classroom example wherein we did this for Zara:

install.packages("tm") # if using for the first time

install.packages("tm.plugin.webmining")

library(tm)
library(tm.plugin.webmining)

# Note: Run below on base R, not RStudio

zara <- WebCorpus(GoogleNewsSource("Zara"))

x1 = zara # save the corpus in a local file

x1 = unlist(lapply(x1, content)) # strip relevant content from x1

x1 = gsub("\n", "", x1) # remove newline chars

x1[1:5] # view content

write.table(x1, file.choose(), row.names=F, col.names=F) # save file as 'zara_news.txt'

Replace 'zara' above with whatever entity you chose.

Alternately, try running this shiny app for googlenews pulls. Its not very stable but will do for now.

Text-analyze the corpus for sentiment.

Note: Do you see how the corpus thus obtained can potentially help you mine, measure and score some notion of "PR buzz" for the entity?

Your task: ID the two most positive and two most negative articles.

In a PPT slide or two, write what you found about the reasons for positive and negative sentiment.

Update: Pls insert the following lines of code after you run the older code for sentiment analysis.

This is to obtain the most positive and negative documents.

###############

head(pol$all[(order(pol$all[,3], decreasing=T)),]) #– Top positive polarity document

head(pol$all[(order(pol$all[,3], decreasing=F)),]) #- Top negative polarity document

##################

Homework part 2: (Individual submission - option 2)

Alternately, instead of HW part 2 above, you could do the following.

Take any long (as in 10+ pages) soft copy article that you know and have read.

Use the textsplit shiny app to split it into uniform length parts (of say 25-50 words each).

Now, text anaylze the split document for topics using the shinyapp for topic mining.

In a PPT, paste the wordclouds for each topic and write your interpretation for what that topic means (in a few descriptive words, is all).

Deliverables and Deadlines:

The deadline for this session's HWs is a week from now. Next week Friday (26-sept) midnight.

Drop boxes will be up for session 5 HW part 1 and HW part 2 separately.

For both homework parts, pls submit a zipped folder containing (a) the text dataset you used, and (b) the PPT you made.

Pls remember to write your (group) name and PGID on the title slide. Name the PPT as name_HWnumber.pptx

Added later: The PPT should be <10 slides in length. Feel free to add more slides in an annexure, if required.

The HWs are all HCC level 0. Feel free to take any help from anybody as required.

Any queries etc, contact me.

Ciao.

Sudhir

17 comments:

  1. Dear Sir,

    Could you kindly explain how the document frequency is computed? As per my understanding, it is number of times a word occurs per 100 words in the document. Kindly correct me if I am wrong.

    Regards,
    Nithya

    ReplyDelete
  2. Hi Nithya,

    Are you referring to the TFIDF weighing scheme? Well, in the classroom example, my corpus had 100 docs, hence I divided term freq TF by 100. Else, we divide by the no. of docs in the corpus.

    In any case, there exist many schema to compute TFIDFs and we can always come up with our own, besides.

    So, for now, don't worry about it and use R's internal tfidf scheme. Hope that helps.

    Sudhir

    ReplyDelete
  3. Hello Prof,

    While executing the command
    zara <- WebCorpus(GoogleNewsSource("Zara"))

    I get the following error:
    Error in function (type, msg, asError = TRUE) : couldn't connect to host

    How can I fix this?

    ReplyDelete
    Replies
    1. Hi Anand,

      Use base R and not R studio. Also, I have updated the code for newline characters in the post above. Check now and see.

      Sudhir

      Delete
    2. Hi Sir,

      I am facing this error while running base R.

      Delete
    3. Try this:

      https://wordcloud.shinyapps.io/googlenews/

      Delete
  4. Hello Prof. In homework part 2 option 1, you wrote "Your task: ID the two most positive and two most negative articles." How do we ID articles? Did you mean topics?

    ReplyDelete
    Replies
    1. Hi Sharath,

      No, it would be articles. Imagine the articles pulled are documents and you have terms as columns. Upon sentiment analysis (like we did for iron man reviews), you get polarities for each document. Hope that helps.

      Sudhir

      Delete
  5. Hi Professor,

    I tried running the R code and well the shiny app for google news pull. Both seem to be timing out while trying to establish the connection.
    Please help!

    Error as seen in R:

    Error in function (type, msg, asError = TRUE) : connect() timed out!

    Thanks
    Sonam

    ReplyDelete
    Replies
    1. Hi Sonam,

      Would be good to attend the R tutorial today and pose the Q there. Aashish Pandey, who built the shiny app, will be conducting the tutorial.

      Sudhir

      Delete
  6. Sir, in the shiny app for text analysis, Im getting the following error: NA indices not allowed

    Request your help

    ReplyDelete
    Replies
    1. HI Aditi,

      Pls reach out to Aashish pandey for shiny queries. Did you attend today's R tutorial? In any case, shall postpone the session 5 hw deadline by 24 hrs, too many folks facing issues with the text analytics pieces.

      Sudhir

      Delete
  7. Hi Professor,

    While topic mining should we go with the number of topics suggested by the Log Bayes factor or can we input the number of topics. I see that the Narendra Modi I Day speech had 6 topics.

    Regards,
    Rohit

    ReplyDelete
    Replies
    1. You can override the machine. topic interpretability should be our main concern. model fit etc can come later, I guess.

      Sudhir

      Delete
  8. Hi Professor,

    The shiny app for basic text analysis appears to be down..been trying to access it for quite some time..Please help!

    Thanks
    Sonam

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.