ABOUT ME INTRODUCTION DATA GATHERING DATA CLEANING EXPLORING DATA CLUSTERING ARM and NETWORKING DECISION TREES NAIVE BAYES SVM CONCLUSIONS INFOGRAPHIC

Association Rule Mining and Networking

What is Association Rule Mining?

Association Rule Mining, known as ARM, is a machine learning method often used to discover relationships between variables in a given transaction dataset. Each row contains categorical variables within a single observation. ARM evaluates which variables in the dataset are most likely to occur together. ARM using the Apriori algorithm is performed to discover the relationship between items in our dataset. The Apriori algorithm allows us to see general trends in our dataset based on three different measures of probability: confidence, support, and lift. This analysis seeks the relationship between categorical variables based on survey responses from 10,000 individuals. Notably, this analysis is interested in what variables are contributing factors that lead to attempts of suicide among teens.

Association Rule Mining in R

Preparing Transaction Data:The YRBSS data is used for this analysis. The data contains demographic variables including gender, race, sexual orientation. It also has categorical variables that tell us whether one carries a weapon, gets in physical fights, feelings of sadness, whether one has considered attempting suicide, whether one has made plans to attempt suicide, and more

Figure 1:Youth Risk Behavior Surveillance System (YRBSS) Data Ssed for ARM and Networking
Image

Link to dataset: transaction dataset.


Performing Association Rule Mining

Upon the creation of the final "transaction" datased , ARM is able to be prefomed. To train the model, the parameters that were specified for the apriori algorithm were: support = 0.001, confidence = 0.01, maxlength = 10, and minlength = 2. By setting these as our parameters, the algorithm was able to find 4886 rules associated with observations in the dataset. The code used to perform ARM can be found here: Code for performing ARM.

Figure 2:Top Rules for Confidence
Image

Confidence shows how often two items occur together in a given dataset when the number of time item A occur is given. Confidence is calculated by dividing the frequency of A and B, by the frequency of A. Confidence = Freq(A, B)/ Freq(A)

The top rules here have the confidence of 1, and it shows some of the categories that will always occur together. Some of the most interesting findings include that seriously considering suicide, alcohol consumption, and frequently getting in physical fights have a strong association. This finding could be helpful to identify some behavior patterns that are associated with self-harm.

Figure 3:Top Rules for Support
Image

Support looks at how frequently an items appears in a dataset. Support is calculated by dividing the frequency of an item by transaction. The following equation is used to calculate support. Sup(A)= Freq(A)/T

The most interesting finding from the top rules of support shown in figure 3 includes the frequent occurrence of those who don't report using hallucinogenic drugs and those who said "none" regarding thoughts and plans of suicide. This was also the case between those who never missed school because of safety and those who never planned to attempt suicide.

Figure 4:Top Rules for Lift
Image

Lift measures the strength of any rule. We can caclulate lift by using the following equalition. Lift = Supp(A,B)/Supp(A)*Supp(B)The top rules for the lift are quite large. As shown in figure 4, many of the items in the dataset were subcategories of other items. So if one occurred, the other had also to appear.

Visualizing the Networks

Network Visualisation is used to visualize complex relationships between many variables. A network visualization displays undirected and directed graph structures. This type of visualization illuminates relationships between variables. Variables are expressed as round nodes, and lines show their relationships. The vivid display of network nodes can highlight non-trivial data discrepancies that may be otherwise be overlooked.

R code to create the follow network visualization.

Figure 5:
Image

Figure 6:
Image

Figure 7:
Image

Figure 8: Interactive NetWork Viz

Conclusion

This analysis used association rule mining to discover relationships between variables in the YRSB transaction dataset. Using the R-package apriori, the parameters specified were as the following: support = 0.001, confidence = 0.01, max length = 10, and min length = 2. The analysis provided the top 15 rules for confidence, support, and lift. The results provided interesting insights. Specifically, the top rules for confidence and lift provided me with the most insight into patterns most associated with the likelihood of someone considering self-harm. For example, confidence rules starting from 9 to 14 tell us that consuming alcohol daily, use of marijuana, feelings of sadness, and getting in physical fights are associated with self-harm. These findings seem very initiative and align with an ongoing conversation regarding mental health.

Quote of the day:"If you torture data long enough, it will tell you whatever you want to hear."