ABOUT ME INTRODUCTION DATA GATHERING DATA CLEANING EXPLORING DATA CLUSTERING ARM and NETWORKING DECISION TREES NAIVE BAYES SVM CONCLUSIONS INFOGRAPHIC

Decision Tree

What is Decision Tree?

Decision Tree in R

Preparing Transaction Data:The YRBSS data is used for this analysis. The data contains qualitative and quantitative variables, including age, gender, race, sexual orientation, whether one carries a weapon, whether one gets in physical fights, feelings of sadness, whether one has considered attempting suicide, whether one has made plans to attempt suicide, and more

Figure 1:Youth Risk Behavior Surveillance System (YRBSS) Data Used for Decision Tree
Image

Link to dataset: dataset.


Creating Decision Trees

Upon the creation of the final cleaned datased , decision tree is able to be prefomed. The code used to perform the decision tree can be found here: Code for creating decision three.

Figure 2a:Confusion Matrix 1

Image

Figure 2b:Decision Tree 1

Image

Using the variable "considered attempting Suicide" as the root node , this decision tree looked at how different racial, age and sexual identities interact with whether one has considered suicide or not. Based on this decision tree above, we can see that large portion (89%) of those who identified as Heterosexual or "not sure" did not considered attempting suicide. When we look at youth who identified as Bi, Gay or Lesbian, race was an important factor on whether they have considered attempting suicide or Race. Specifically, those who are Black and/or Hispanic were likely to consider attempting suicide.

Figure 3a:Confusion Matrix 2

Image

Figure 3b:Decision Tree 2

Image

Using the variable "considered attempting Suicide" as the root node , this decision tree looked at how different racial, age and sexual identities interact with whether one has considered suicide or not. Based on this decision tree above, we can see that large portion (89%) of those who identified as Heterosexual or "not sure" did not considered attempting suicide. When we look at youth who identified as Bi, Gay or Lesbian, race was an important factor on whether they have considered attempting suicide or Race. Specifically, those who are Black and/or Hispanic were likely to consider attempting suicide.

Figure 4a:Confusion Matrix 1

Image

Figure 4b:Decision Tree 3

Image

Also using the variable "considered attempting Suicide" as the root node, this decision tree looked at how different drug and alcohol use and race interact with whether one has considered suicide or not. This decision tree above tells us that whether someone is reports feelings of sadness or not is a great indicator if someone has considered attempting suicide or not. Of those who did not report sadness, the uses of hallucinogenic drugs is an important factor on whether one has considered attempting suicide or not. Of those who don't used hallucinogenic drugs the use of Marijuana is an important indicator. This decision tree, provided some evident to my hypothesis that drug and alcohol use is an important indicator of suicided risk factors.

Preforming Decision Tree in Python

Data from News API was used to create decision trees.

Figure 5:Text Data used for Decision Tree
Image

Figure 6a:Confusion Matrix 1
Image
Figure 6a:Decision Tree 1
Image

The first decision tree was created by adjusting the two parameters, criterion and splitter of sklearn.tree.DecisionTreeClassifier function. Specifically criterion was set to equal "entropy" and splitter was set to equal "best". This way the model's nodes is spilt based on the best information gain (entropy) of a feature. We can estimate the goodness of the decision tree based on the entropy, where lower entropy indicates.This model chose the word "immigrant youth" as its root node.This model was only 54.2% accurate and had the highest accuracy of the three models ran.

Figure 7a:Confusion Matrix 2
Image
Figure 7b:Decision Tree 2
Image

The second decision tree was created by adjusting the two parameters, criterion and splitter of sklearn.tree.DecisionTreeClassifier function. Specifically criterion was set to equal "gini" and splitter was set to equal "best". This way the model's nodes is spilt based on the best information gain (gini) of a feature. We can estimate the goodness of the decision tree based on the gini, where lower entropy indicates.This model chose the word "black youth" as its root node.This model had only a 4.16 % accuracy and had the lowest accuracy incomparison to the three modles that were ran.

Figure 8a:Confusion Matrix 3
Image
Figure 8b:Decision Tree 3
Image
The third decision tree was created by adjusting the two parameters, criterion and splitter of sklearn.tree.DecisionTreeClassifier function. Specifically criterion was set to equal "gini" and splitter was set to equal "random". This way the model's nodes is spilt based on the best information gain (gini) of a feature. We can estimate the goodness of the decision tree based on the gini, where lower entropy indicates.This model chose the word "lgbtq youth" as its root node. This model was only 34.4% accurate.

Conclusion

Quote of the day:"If you torture data long enough, it will tell you whatever you want to hear."