Make sure that if the question is asking you to briefly explain the findings, explain it in few sentences from a managerial stand point (e.g. Results show that …. Based on the results, hotel management is advised to do ….).
Please download amusement.csv to solve the questions below. In this dataset, respondents report their levels of satisfaction with different aspects of their experience at an amusement park, and their overall satisfaction with the park. You will also see how distant the respondents’ locations are, whether they visited the park over the weekend or weekday, and their number of kids as additional variables.
Relationship between continuous variables
Q1. Please examine the nature of the dataset. In your examination, make sure that you list: (i) the number of rows (respondents) and number of columns (variables), (ii) the data types of each variable, (iii) whether there is any missing value in the dataset, (iv) run summary statistics of the dataset (after cleaning the missing variables if there are any). Based on the results from (i-iv), talk about the properties of the dataset.
Q2. Run a scatter plot matrix for all the variables in the data (note: first access the “car” library to be able to run the scatterplot matrix). Check the distribution of each variable. List the ones who are normally distributed and non-normally distributed. Transform the ones that are not normally distributed [note: you can use any transformation method you want. Make sure that variable is normally distributed after the transformation, provide evidence that they are normally distributed after the transformation]
Q3. Run the correlation matrix between all the variables (except ‘weekend’– that is a categorical variable). First, download corrplot library to be able to run the correlation matrix. Also, find the correlation values between the variables. Interpret the findings.
Q4. Run the scatter plot for the variables: “overall” satisfaction and “rides” satisfaction. Add a regression line to evaluate the strength of the relationship between these variables (note: you are not running a regression analysis, instead just adding a regression line to evaluate the direction and strength of the relationship). Please discuss the findings.
Comparing groups via Visualization
Please download the hotelsat.csv. This dataset is collected from a hotel customers through a survey. Hotel customers provided their opinions and feelings about several aspects and offerings of the hotel, and rated the quality. Hotel management would like to use the insights from the findings and better their offerings for the right customer.
Q5. Please examine the nature of the dataset. In your examination, make sure that you list: (i) the number of rows (respondents) and number of columns (variables), (ii) the data types of each variable, (iii) whether there is any missing value in the dataset, and (iv) run summary statistics of the dataset (after cleaning the missing variables if there are any).
Q6. Management would like to know whether there is any pattern between how much each customer spend per night (avgRoomSpendPerNight) and overall satisfaction (satOverall) per segment (eliteStatus) as well as per visit purpose (visitPurpose). Please visually explore whether you are observing any pattern. Please briefly interpret the findings.
Q7. Management would like to know whether there is any pattern between how customers rate the city (satCity) and overall satisfaction (satOverall) per segment (eliteStatus) as well as per visit purpose (visitPurpose). Please visually explore whether you are observing any pattern. Please briefly interpret the findings.
Q8. Management would like to know how much each segment (eliteStatus) spend on average per night (avgRoomSpendPerNight). Please find the mean values for each segment and use the proper visualization tool to show the mean values. Please briefly interpret the findings.
Q9. Management would like to know how far of distance on average (distanceTraveled) people with different purpose of visit come to hotel (visitPurpose). Please find the mean distance values for each purpose of visit and use the proper visualization tool to show the mean values. Please briefly interpret the findings.
Group mean differences (t-test and Anova)
Q10. (a) Visualize the distributions of the variables in the hotel satisfaction data via scatterplotMatrix. By looking at the distributions of the variables, do you see any need to transform any variables? Which variables need to be transformed if needed? And, please transform them as new variables and add to the dataset. [note: please use the log transformed version of the variables if you see need using them any questions in this block] (b) Please run the correlations among the variables (except the characters). Do you see any problem? Please explain.
Q11. Hotel management wants to know whether people who come from closer locations spend more/less money per night on room (avgRoomSpendPerNight) and on food (avgFoodSpendPerNight), and whether they stay longer/shorter (nightsStayed). They use the median distanceTraveled as the cut off point. People who are higher than the median distanceTraveled is coded as distant locations, and people who are lower than the median distanceTraveled is coded as close locations. The question is: are there statistical differences between people who come from closer distance (less than median value) and people who come from longer distance (more than median value) when it comes to (a) avgRoomSpendPerNight, (b) avgFoodSpendPerNight, and (c) nightsStayed? Please briefly explain [hint: you need to convert the distanceTraveled into a logical data type [see slide 18 on Workshop #8 to get an idea on how].
Q12.
Hotel management would like to know whether there are meaningful differences among the eliteStatus customers regarding how much money they spend on room per night (avgRoomSpendPerNight). Please show both the statistical results (whether the differences are significant) as well as visualize the group confidence intervals. What do you see? Please briefly explain.
Please do the same thing (like what you did in Q12a), but this time for money spend on food (avgFoodSpendPerNight). What do you see? Please briefly explain
Please do the same thing (like what you did in Q12(a and b)), but this time for nightsStayed. What do you see? By keeping Q12 (a) and (b) in mind, what do results tell you? What do you advise management to do?
Please download the gapminder dataset by following the steps below:
> library(dplyr)
> install.packages(‘dslabs’)
> library(dslabs)
> data(gapminder)
Q13. (a) Please run the str() to check the variables and data types, are there any missing variables in the integer/numerical data? (b) run normality checks with $ infant_mortality, $ life_expectancy, $ fertility, are they normally distributed? If not, please transform these variables to a more normal distribution focused nature.
Q14.
(a) Please look at the relationship between $ life_expectancy, $ fertility by using an xyplot. Please also add an abline (regression line), what do you see?
(b) run the (a) by adding a continent and check the relationship between $ life_expectancy and $ fertility per continent, what do you see now?
(c) are there are any statistically significant differences when it comes to fertility among continents [ps: make sure you are using the transformed version of fertility variable if it is not normally distributed because statistical significance tests are sensitive to normal distribution]? Please show both the statistical results and the plot version.
(d) are there are any statistically significant differences when it comes to life expectancy among continents [ps: make sure you are using the transformed version of life expectancy variable if it is not normally distributed because statistical significance tests are sensitive to normal distribution]? Please show both the statistical results and the plot version.
When you bring everything together what do you see?