Florida International University R Worksheet
Description
As with all submissions, this needs to be submitted as an HTML file, and not as a .RMD file.
Part 1 – Decision Trees
Use the “Titanic Download Titanic” dataset for these questions. Data is from: https://www.kaggle.com/c/titanic/data?select=train.csv Links to an external site.. Metadata is available below:
Data Dictionary
VariableDefinitionKeysurvivalSurvival0 = No, 1 = YespclassTicket class1 = 1st, 2 = 2nd, 3 = 3rdsexSexAgeAge in yearssibsp# of siblings / spouses aboard the Titanicparch# of parents / children aboard the TitanicticketTicket numberfarePassenger farecabinCabin numberembarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton
Question #1
Produce one super awesome visual with this dataset. Explain what this visual shows in 1-2 sentences. Your visual must include a caption and subtitle, in addition to the standard labels.
Question #2
Create a model that can be used to predict õrvival®bsp;of a passenger based on attributes. This is a classification activity. 0 = Did not survive, 1 = Survived
Convert your 0 and 1 to a factor data type to facilitate the creation of a classification model. You probably want to change this to say “Survived” and “Did Not Survive” so that it is easier to read.
Drop the Name, PassengerID, and Ticket columns before building the model (otherwise your model creation will either crash your computer or produce something that doesn’t make sense).
Question #3
Produce a confusion matrix. What does it tell you? 3-5 sentences.
Part 2 – Logistic Regression
Use the same titanic data-set from decision trees for this part.
1. Build a logistic regression model to predict survival for the titanic data set.
2. Based on your logistic regression model, which variables do you think are most important for survival?
3. Produce a confusion matrix and explain your findings related to your model. 2-3 sentences (but it’s always OK if you go over)
Part 3 – Clustering
Data = Mall_Customers.csvDownload Mall_Customers.csv
Metadata = https://www.kaggle.com/shwetabh123/mall-customers
Question #1
Conduct basic exploratory data analysis with the Mall_Customers.csv data set. Create 3 graphs of your choosing. For each, provide a 1-2 sentence summary of what you see.
Question #2
Create clusters that look at both the annual income and spending score (your clustering should only look at these two columns).
Create an elbow plot and write a brief interpretation of 2-3 sentences for it. The explanation should have to do with why you chose a certain value of k.
Make a recommendation for the correct amount of clusters that should be used for this data set.
Question #3
Plot your best k-means model as a scatter plot with the centroids displayed. Refer to the notes that I provided on this to see how to do it with ggplot2.
Question #4
Write 2-3 sentences explaining what can be done with this new insight if you were in charge of the marketing and sales operation of the mall.
Part 4: 0.10 points of extra credit (Principal Component Analysis)
Data: Download pokemon-3.csv pokemon-3-1.csv Download pokemon-3-1.csv
Scenario: you¥ been hired by GameFreak, the makers of the Pokemon games, and they want you to simplify the stats for Pokemon so that younger customers do not need to worry about things like ttack ðeed ðecial Defense and ðecial Attack¼br>
GameFreak wants to turn those 5 columns into one stat that captures the majority of the patterns of those 5 columns. As a Business Analyst, you can help them with this using PCA ®bsp;a dimension reduction technique.
Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."