Decision Tree

Decision Tree in R

Preparing Transaction Data:The YRBSS data is used for this analysis. The data contains qualitative and quantitative variables, including age, gender, race, sexual orientation, whether one carries a weapon, whether one gets in physical fights, feelings of sadness, whether one has considered attempting suicide, whether one has made plans to attempt suicide, and more

Figure 1:Youth Risk Behavior Surveillance System (YRBSS) Data Used for Decision Tree

Link to dataset: dataset.

Creating Decision Trees

Upon the creation of the final cleaned datased , decision tree is able to be prefomed. The code used to perform the decision tree can be found here: Code for creating decision three.

Figure 2a:Confusion Matrix 1

Figure 2b:Decision Tree 1

Using the variable "considered attempting Suicide" as the root node , this decision tree looked at how different racial, age and sexual identities interact with whether one has considered suicide or not. Based on this decision tree above, we can see that large portion (89%) of those who identified as Heterosexual or "not sure" did not considered attempting suicide. When we look at youth who identified as Bi, Gay or Lesbian, race was an important factor on whether they have considered attempting suicide or Race. Specifically, those who are Black and/or Hispanic were likely to consider attempting suicide.

Figure 3a:Confusion Matrix 2

Figure 3b:Decision Tree 2

Figure 4a:Confusion Matrix 1

Figure 4b:Decision Tree 3

Also using the variable "considered attempting Suicide" as the root node, this decision tree looked at how different drug and alcohol use and race interact with whether one has considered suicide or not. This decision tree above tells us that whether someone is reports feelings of sadness or not is a great indicator if someone has considered attempting suicide or not. Of those who did not report sadness, the uses of hallucinogenic drugs is an important factor on whether one has considered attempting suicide or not. Of those who don't used hallucinogenic drugs the use of Marijuana is an important indicator. This decision tree, provided some evident to my hypothesis that drug and alcohol use is an important indicator of suicided risk factors.

Preforming Decision Tree in Python

Data from News API was used to create decision trees.

Figure 5:Text Data used for Decision Tree

Figure 6a:Confusion Matrix 1

Figure 6a:Decision Tree 1

The first decision tree was created by adjusting the two parameters, criterion and splitter of sklearn.tree.DecisionTreeClassifier function. Specifically criterion was set to equal "entropy" and splitter was set to equal "best". This way the model's nodes is spilt based on the best information gain (entropy) of a feature. We can estimate the goodness of the decision tree based on the entropy, where lower entropy indicates.This model chose the word "immigrant youth" as its root node.This model was only 54.2% accurate and had the highest accuracy of the three models ran.

Figure 7a:Confusion Matrix 2

Figure 7b:Decision Tree 2

The second decision tree was created by adjusting the two parameters, criterion and splitter of sklearn.tree.DecisionTreeClassifier function. Specifically criterion was set to equal "gini" and splitter was set to equal "best". This way the model's nodes is spilt based on the best information gain (gini) of a feature. We can estimate the goodness of the decision tree based on the gini, where lower entropy indicates.This model chose the word "black youth" as its root node.This model had only a 4.16 % accuracy and had the lowest accuracy incomparison to the three modles that were ran.

Figure 8a:Confusion Matrix 3

Figure 8b:Decision Tree 3

The third decision tree was created by adjusting the two parameters, criterion and splitter of sklearn.tree.DecisionTreeClassifier function. Specifically criterion was set to equal "gini" and splitter was set to equal "random". This way the model's nodes is spilt based on the best information gain (gini) of a feature. We can estimate the goodness of the decision tree based on the gini, where lower entropy indicates.This model chose the word "lgbtq youth" as its root node. This model was only 34.4% accurate.