title: “genderreport”
author: “Jennifer Young”
date: “7/31/2021”
output: pdf_document
—
## Introduction ##
One important decision parents make is naming their children. In this study, we will look at
popular names and gender neutral names. A soon-to-be parent who is researching such an important
decision may want to consider data on a name to see how neutral the name is considered to be.
Choosing a name that is almost equally chosen for both sexes can be
the goal for parents. We will consider several names that have been labeled gender neutral
and consider how they have been used by both biological sexes historically and
we will use a model that predicts when the name is considered male or female based on
it’s use in the US. The babynames and ssa dataset were used for analysis in this study and three models (logistic regression, Random Forest, and Naive Bayes) were used to analyze the data.
“`{r}
local({r <- getOption("repos")
r[“CRAN”] <- "http://cran.r-project.org"
options(repos=r)
})
install.packages(“remotes”) # if necessary
remotes::install_github(“lmullen/gender”)
install.packages(“rTool”)#or install through RStudio
install.packages(‘plyr’, repos = “http://cran.us.r-project.org”)
install.packages(“babynames”)
install.packages(“dplyr”)
install.packages(“tidyr”)
install.packages(“ggplot2”)
install.packages(“gridExtra”)
install.packages(“magrittr”)
install.packages(“devtools”)
install.packages(“tidyverse”)
install.packages(“caret”)
install.packages(“e1071”)
install.packages(“randomForest”)
“`
I had to install psych, naivebayes, gender, randomForest, tinytex and genderdata through RStudio instead
“`{r}
library(tidyverse)
library(caret)
library(plyr)
library(naivebayes)
library(psych)
library(gender)
library(tibble)
library(devtools)
library(babynames)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(magrittr)
library(e1071)
library(tinytex)
data(babynames)
head(babynames)
tail(babynames)
“`
## Methods ##
Data visualization was used to look at specific names that are often considered to be gender neutral through various baby name web sites. We can look at the names and graph their use for male and female babies and see their use for either gender in a historical context.
Drawn from Social Security Administration data, a sample of random names were taken from websites that identify gender neutral names the prospective parents could visit using a Google search.
From the earlier analysis on each name, 7 names were chosen that seemed the most neutral based on male and female trendlines in the charts.
Logistic regression, Random Forest and Naive Bayes were used to create models of accurate classification of names for being male, female, or somewhere in between, or gender neutral.
# Finding out how many people were named X name is year X (sample) #
“`{r}
entered_name <- "Charlie"
entered_year <- 2017
result % filter(name == entered_name) %>%
filter(year == entered_year) %>%
summarize(count = sum(n))
result
“`
# Number of male and female names in dataset #
“`{r}
babynames %$%
split(., sex) %>%
lapply(. %$% length(unique(name)))
“`
# Gender Neutral Names by Sex from 1880-2017 #
For each chart, you can view the popularity of the name for use in both biological
sexes between 1880-2017. I took a sample of random names from websites that identify gender neutral names
the prospective parents could visit using a Google search
The names that were tested were taken from a few popular websites, as that is likely
the place where expectant parents would look. Some examples are:
https://www.popsugar.com/family/Gender-Neutral-Baby-Names-34485564
https://www.mother.ly/child/top-50-gender-neutral-baby-names-youll-obsess-over-
The name Kaelin seems to be used by both sexes but has fallen in popularity.
“`{r}
babynames %>%
filter(name == “Kaelin”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Kaelin, by Sex”)
“`
Charlie is another name for Charles and was traditionally used by males.
However, it has grown in popularity for both genders
“`{r}
babynames %>%
filter(name == “Charlie”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Charlie, by Sex”)
“`
Shane is a name that was traditionally given to males but has decreased in popularity
“`{r}
babynames %>%
filter(name == “Shane”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Shane, by Sex”)
“`
Quinn is a name that has been used by box sexes, but has grown in popularity in females
“`{r}
babynames %>%
filter(name == “Quinn”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Quinn, by Sex”)
“`
Morgan is a name that has historically been used by both sexes, but sharply rose among
females 20 years ago. It has fallen in usage in females since then to meet male usage
“`{r}
babynames %>%
filter(name == “Morgan”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Morgan, by Sex”)
“`
Finley has grown in usage for both sexes, but more for females
“`{r}
babynames %>%
filter(name == “Finley”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “FInley, by Sex”)
“`
Leslie is a name that was historically used in both genders, although it’s use in males
has decreased over the last 60 years. It was popular for females in the last half
of the last century. It has fallen in popularity overall.
“`{r}
babynames %>%
filter(name == “Leslie”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Leslie, by Sex”)
“`
Jessie a name that was historically used in both genders and has fallen in popularity
“`{r}
babynames %>%
filter(name == “Jessie”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Jessie, by Sex”)
“`
Sidney is a name that was historically used in both genders and has fallen in popularity
for both genders
“`{r}
babynames %>%
filter(name == “Sidney”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Sidney, by Sex”)
“`
Skyler a name that was historically used in both genders and has risen in popularity
in the last two decades
“`{r}
babynames %>%
filter(name == “Skyler”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Skyler, by Sex”)
“`
Clarke is a name that was historically used for males but has increased in
popularity for females
“`{r}
babynames %>%
filter(name == “Clarke”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Clarke, by Sex”)
“`
Jackie is a name that was historically used in both genders and has fallen in popularity for both genders
“`{r}
babynames %>%
filter(name == “Jackie”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Jackie, by Sex”)
“`
Nicky is a name that was historically used in both genders and has fallen in popularity for both genders
“`{r}
babynames %>%
filter(name == “Nicky”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Nicky, by Sex”)
“`
Ashley is a name that has generally been given to females. Gone With the Wind was an anomaly.
“`{r}
babynames %>%
filter(name == “Ashley”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Ashley, by Sex”)
“`
Oakley is the closest to gender neutral out of this data analysis and is extremely popular.
“`{r}
babynames %>%
filter(name == “Oakley”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Oakley, by Sex”)
“`
Frankie is a name that was historically used in both genders and is rising in popularity in females.
“`{r}
babynames %>%
filter(name == “Frankie”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Frankie, by Sex”)
“`
Justice is a name that was historically used in both genders and is a newer name compared
to many others.
“`{r}
babynames %>%
filter(name == “Justice”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Justice, by Sex”)
“`
Royal is a name that was historically used for males but has risen in female in the past decade.
“`{r}
babynames %>%
filter(name == “Royal”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Royal, by Sex”)
“`
What name has been the most popular over time for males? For females?
“`{r}
babynames %>% group_by(sex, name) %>%
dplyr::summarize(median_prop = median(prop)) %>%
top_n(1)
namesinyear <- function(myyear){
require(dplyr)
yearnames % filter(year == myyear) %>% distinct(name)
yearnames <- sapply(yearnames[,"name"], as.character)
return(length(yearnames))}
library(reshape2)
namescount <- c()
for (year in 1880:2017){namescount <- c(namescount,namesinyear(year))}
namescount <- as.data.frame(namescount)
namescount$year <- rownames(namescount)
namescount <- melt(namescount)
“`
This is the number of names given each year in US (1880-2017). The number is rising, which means more names will be given for our data point.
“`{r}
ggplot(namescount, aes(x = year,y = value, group=”variable”)) + geom_line(alpha = 0.4) + theme_minimal() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(label=” Number of names in a given year”) + geom_smooth(method=”loess”)
“`
We can look at the popular names and see how gender neutral they appear.
“`{r}
babynames %>%
filter(name == “Ava”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Ava, by Sex”)
babynames %>%
filter(name == “Liam”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Liam, by Sex”)
babynames %>%
filter(name == “Noah”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Noah, by Sex”)
babynames %>%
filter(name == “Olivia”) %>%
ggplot(aes(x = year, y = n)) +
geom_line(aes(color = sex)) + labs(x = “Year”, y = “Number Born”,
title = “Olivia, by Sex”)
“`
The most popular names in 2017 are not considered gender neutral. A parent would would be concerned about this would be unikely to choose these names.
# Prediction of gender by name #
I used (method = “ssa”): United States from 1930 to 2012. Drawn from Social Security Administration data.I took a sample of random names from websites that identify gender neutral names the prospective parents could visit using a Google search and graphed them earlier.
From the earlier analysis on each name, I chose 7 names that seemed the most neutral based on male and female trendlines in the charts.
“`{r}
head(gender)
ssa_names <- c("Charlie", "Royal", "Morgan", "Skyler",
“Frankie”, “Oakley”, “Justice”)
ssa_years <- c(rep(c(2009, 2012), 3), 2012)
ssa_df <- tibble(first_names = ssa_names,
last_names = LETTERS[1:7],
years = ssa_years,
min_years = ssa_years – 3,
max_years = ssa_years + 3)
ssa_df
“`
This dataset connects first names to years but there are columns
for minimum and maximum years for possible age range since birth dates are not always exact. We pass this to gender_df() function, which assigns the method that we wish to use and the names of the columns that contain the names and the birth years. The result is a tibble of predictions.
“`{r}
results <- gender_df(ssa_df, name_col = "first_names", year_col = "years",
method = “ssa”)
results
“`
“`{r}
ssa_df %>%
left_join(results, by = c(“first_names” = “name”, “years” = “year_min”))
gender_df(ssa_df, name_col = “first_names”,
year_col = c(“min_years”, “max_years”), method = “ssa”)
“`
Now, we use gender_df() to predict gender by passing it the columns
minimum and maximum years to be used for each name
“`{r}
ssa_df %>%
left_join(results, by = c(“first_names” = “name”, “years” = “year_min”))
gender_df(ssa_df, name_col = “first_names”,
year_col = c(“min_years”, “max_years”), method = “ssa”)
ssa_df %>%
distinct(first_names, years) %>%
rowwise() %>%
do(results = gender(.$first_names, years = .$years, method = “ssa”)) %>%
do(bind_rows(.$results))
ssa_df %>%
distinct(first_names, years) %>%
group_by(years) %>%
do(results = gender(.$first_names, years = .$years[1], method = “ssa”)) %>%
do(bind_rows(.$results))
“`
# Logistic Regression Model #
“`{r}
neutral_names %
select(-prop) %>%
#filter only names between years 1930 and 2012
filter(year >= 1930, year %
#get the number of female and male for each name per year
spread(key = sex, value = n, fill = 0) %>%
#Calculate the measure of gender-neutrality
mutate(prop_F = 100 * F / (F+M), se = (50 – prop_F)^2) %>%
group_by(name) %>%
#per name, find the total number of babies and measure of gender-neutrality
dplyr::summarise(n = n(), female = sum(F), male=sum(M), total = sum(F + M),
mse = mean(se)) %>%
#take only names that occurs every year and occurs greater than 9000 times
filter(n == 83, total > 9000) %>%
#sort by gender neutrality
arrange(mse) %>%
#get only the top 10
head(10)
neutral_names
“`
# Random Forest Classification #
“`{r}
library(randomForest)
neutral_names %
select(-prop) %>%
#Filter only names between years 1930 and 2012
filter(year >= 1930, year %
#Get the number of female and male for each name per year
spread(key = sex, value = n, fill = 0) %>%
#Calculate the measure of gender-neutrality
mutate(prop_F = 100 * F / (F+M), se = (50 – prop_F)^2) %>%
group_by(name) %>%
#Find the total number of babies and measure of gender-neutrality per name
dplyr::summarise(n = n(), female = sum(F), male=sum(M), total = sum(F + M),
mse = mean(se)) %>%
#Take only names that occurs every year and occurs greater than 9000 times
filter(n == 83, total > 9000) %>%
#Sort by gender neutrality
arrange(mse) %>%
#Add variable to represent gender neutral namse. Assumes an mse <= 2000
mutate(isNeutral = ifelse(mse <= 2000,1,0))
neutral_names$isNeutral <- as.factor(neutral_names$isNeutral)
set.seed(100)
train <- sample(nrow(neutral_names), 0.7*nrow(neutral_names), replace = FALSE)
TrainSet <- neutral_names[train,]
ValidSet <- neutral_names[-train,]
summary(TrainSet)
summary(ValidSet)
model1 <- randomForest(isNeutral ~ ., data = TrainSet, importance = TRUE)
model1
predTrain <- predict(model1, TrainSet, type = "class")
caret::confusionMatrix(predTrain, TrainSet$isNeutral)
“`
Train data accuracy is 100% that indicates all the values classified correctly.
Predicting on test data
“`{r}
predTest <- predict(model1, ValidSet, type = "class")
caret::confusionMatrix(predTest, ValidSet$isNeutral)
“`
Validation data accuracy is 100% that indicates all the values classified correctly.
# Naive Bayes Classification #
Comparing model 1 of Random Forest with Naive Bayes model and prediction using naive bayes on training data
“`{r}
model <- naive_bayes(isNeutral ~ ., data = TrainSet, usekernel = T)
model
plot(model)
p <- predict(model, TrainSet, type = 'prob')
head(cbind(p, TrainSet))
“`
Confusion matrix for train data, Calculate misscalculation/error,and model accuracy
“`{r}
p1 <- predict(model, TrainSet)
(tab1 <- table(p1, TrainSet$isNeutral))
miscalc <- (1 – sum(diag(tab1)) / sum(tab1)) * 100
accuracy <- (100- miscalc)
accuracy
“`
The model has an accuracy of 99.90357 on training data for the correct classification of gender neutral names.
# Results #
We can use logistic regression to make a prediction of gender from a name, we can use Random Forest Classification and Naive Bayes to make whether a name is gender neutral with close to 100% and over 99% accuracy, respectively. These methods are effective in determining whether a name is considered gender neutral based on its usage between genders historically. Using these methods indicate that the methods of classification between genders is highly accurate.
# Conclusion #
The results indicate the name and the proportion of each biological sex given that name and a prediction of whether the name is generally considered male or female. By using this data, a prospective parent can consider how names are viewed regarding gender neutrality based on statistical data from the SSA dataset. The limitations on the dataset is that it only has data up to 2017 and is not up to date to the current year.[supanova_question]
College of Administrative and Financial Sciences Assignment 2 Course Name: Communications Management
Writing Assignment Help College of Administrative and Financial Sciences
Assignment 2
Course Name: Communications Management
Student’s Name:
Course Code: MGT-421
Student’s ID Number:
Semester: 1st Semester
CRN:
Academic Year: 2021-2022
For Instructor’s Use only
Instructor’s Name:
Students’ Grade: Marks Obtained/Out of
Level of Marks: High/Middle/Low
Instructions – PLEASE READ THEM CAREFULLY
The Assignment must be submitted on Blackboard (WORD format only) via allocated folder.
Assignments submitted through email will not be accepted.
Students are advised to make their work clear and well presented, marks may be reduced for poor presentation. This includes filling your information on the cover page.
Students must mention question number clearly in their answer.
Late submission will NOT be accepted.
Avoid plagiarism, the work should be in your own words, copying from students or other resources without proper referencing will result in ZERO marks. No exceptions.
All answered must be typed using Times New Roman (size 12, double-spaced) font. No pictures containing text will be accepted and will be considered plagiarism).
Submissions without this cover page will NOT be accepted.
Assignment Regulation:
All students must use their own words.
Assignment -2 should be submitted on Saturday 20/11/2021 (by the end of week 11) using the Black Board only.
This assignment is an individual assignment.
Citing of references is also necessary in APA style.
Your answers MUST include at least 1 outside references (other than the slides and textbook)
Using references from SDL will be highly valued.
Assignment Structure:
Type
Marks
Assignment-2
Critical Thinking
2
Writing Exercise
3
Total
5
Learning Outcomes:
Ability to illustrate techniques and assess skills of correct business research report writing; learn report writing style using an approved style and apply the basics of oral communication in a presentation of a project, including proper speech, organization, use of graphical aids, and effective non-verbal communications.
Ability to write effective business letters, memorandums, and case studies.
Critical Thinking
Visual aids are used by business writers for many purposes. Here we have a graph used to represent the Gross Domestic Product (GDP) Per Capita for Saudi Arabia. Why do you think the author decided to use the bar graph among the other visual aids? Briefly analyze the bar graph below? (2 Marks)
Source: GASTAT
Writing Exercise
In the role of a senior decision-maker in business, write a (policy/ notice of change) memo for your employees on one of the following subjects: (3 Marks)
On-site smoking
Changes in working hours
Overtime
Or early retirement.
Answer:
&&&&
Reference
Gross Domestic Product Second Quarter 2021. (2021). General Authority for Statistics. https://www.stats.gov.sa/[supanova_question]