Can We Prevent Drug Use in Youth with Data?

Sabrina Lxn
9 min readFeb 28, 2021

When you see a drug addict, have you ever wondered when it all begins? How did an innocent child grow up to gets addicted to drugs? As drugs affect our brain association of pleasure, the brain will crave the euphoria it gets from drug making it very difficult to break a drug addiction. It can also relapse when a recovered person faces tough problems in life making it a lifelong battle. For many, drug abuse started when they were in their teens. If only we could identify these people at the early stage provide help and intervention, they would be able to escape the fate of being a drug addict. Therefore for my final project at Metis Data Science Bootcamp, I imagine myself as a data scientist working for Partnership to End Addiction in the US. (https://drugfree.org/) I hope to use data on youth behaviour to discover the factors that lead to drug use and prevent these youth from becoming drugs addicts.

Data collection

I managed to find data on youth behaviour from the youth risk Behaviour Survey conducted by the US Centers for Disease Control and Prevention. If you are interested in this project too, you may find the data source here. https://www.cdc.gov/healthyyouth/data/yrbs/data.htm

Since 1991, the CDC conducts a biennial survey on youth asking them on adolescent health risk and health-protective behaviours such as smoking, drinking, drug use, diet, and physical activity. The survey consists of 99 multiple-choice questions answered anonymously by high school students. For this project, we will be looking at the latest survey conducted in 2019.

For a better understanding of our data, here are some samples of the survey question:

Question with Categorical answer

Question with Ordinal answers

More details can be found in the 2019 YRBS Data User’s Guide.

But with so many questions being asked in the survey, which question should I choose as my target for my prediction?

Identifying The Target

I begin researching how drug addiction started, then I discovered that Marijuana which is also known as ‘Weed’ has been regarded as the“Gateway drug.” Marijuana is often the 1st type of drug which people started with which then lead to the use of more dangerous drugs such as Cocaine, Heroin.

Although the majority of people who use marijuana do not go on to use other “harder” substances and it has been legalized in some states, using weed can have negative effects on a young developing brain, resulting in difficultly in focusing, poor memory affecting their ability to learn and cope in school.

Question 46 of the YRRB 2019 survey

Therefore I would make this a binary classification prediction project that involves predicting 0 if the student has never tried weed and 1 if the student has tried weed.

Using Qn 46 as my target where I labelled the answers

‘A’ as 0 — Not Tried Weed

All other answers (‘B’ to ‘G’) as 1 — Tried Weed.

Despite being illegal for anyone below 21 years old in the US to use Marijuana, there is a shocking no. of 37.7 % of high school students who have tried weed before.

Let’s see if there is some relationship that we can discover by looking at the student’s age & their school grades.

Qn1 — How old are you?

1) 12 years old or younger
2) 13 years old
3) 14 years old
4) 15 years old
5) 16 years old
6) 17 years old
7) 18 years old or older

The proportion of students who have tried weed started increasing as they grow older. By the age of 18, more than half of the students have already tried weed.

Qn89 — During the past 12 months, how would you describe your grades in school?

1) Mostly A’s
2) Mostly B’s
3) Mostly C’s
4) Mostly D’s
5) Mostly F’s
6) None of these grades
7) Not sure

The proportion of students that tried weed are higher among those students that have lower grades. Could this be due to how weed affects the learning ability of the user?

Data Preparation

To prepare the data for modelling, I went through all the questions and user guide and decide which are categorical and numerical. I also have to look out for questions that can lead to data leakage to our predictive model. I also have to drop questions that have high multicollinearity with the target.

I noticed there are about 30% of the students did not answer all 99 questions. These are Missing at Random (MAR) that were intentional not answer by students or not included as part of the survey in some schools. It would be a waste of useful information to drop rows with missing cells. Since birds of a feather flock together, I decided to use the k-Nearest Neighbours (KNN) imputer to predict how students will answer these questions based on how their neighbouring students who are similar to them answered the missing question.

The KNN imputer is a similar imputation methodology that works on data is k-Nearest Neighbours (kNN) that identifies the neighbouring points through a measure of distance and the missing values can be estimated using completed values of neighbouring observations.

I used LogisticRegression for modelling with repeated 5-fold cross-validation. To decide on no. of nearest neighbour (k), I tried different values of kand measure its accuracy compare with our target.

<script src=”https://gist.github.com/sablxn/fb5e4c6e04a5c45a4b0909bc2611bbac.js"></script>

From the box plot, k of 40 can give an accuracy of over 99.8%.

After imputing the missing data, I get values of many decimal places. For the categorical data, I rounded them to the nearest whole numbers and then created dummy variables. For ordinal data, I rounded the number to the nearest 1 decimal place then merge these data back together. Finally, I get the cleaned data ready for modelling.

Snippets of cleaned data 13246 rows × 137 columns

Feature Selection

I observed that students who have other risky behaviours such as drinking & smoking are more likely to have tried weed and other drugs before however, these behaviours are very personal information as it is illegal for their age. The student will only reveal them in a survey where they are anonymous. Realistically, no students (or anyone) would be willing to share such sensitive information with the school. Therefore, I will only look at those questions which student are more open to sharing which are non-sensitive information such as grades, daily habits as our features.
Now with our clean open data, we begin our features selection which I tried 3 methods.
1) Correlation
2) Lasso Regression
3) Random Forest.
Then we split the data into Train and Test set and scale the data sets.

I tried modelling with all the features where use my results as a benchmark to compare with those that have features selected data set. Then try them with different models using scikit-learn. For those better scoring ones, I will tune them with randomized search CV.

Evaluation

I choose the random forest classifier as the best model as it has the highest accuracy and F1 score among the rest. It gives us a precision of 0.7 which means out of the all students predict to take weed 70% do as shown in the confusion matrix 499 students were correctly predicted.

Interestingly, I discovered some unexpected behaviour for students that have a higher tendency to try weed from the list of features with a high correlation to the target.

I came out with some reasons for these behaviours and how they can link to taking weed.

  1. Risk Appetite — Students who didn’t put on a seat belt have a higher tendency to try weed, this could show that these students are higher risk-takers.

2. Family Care — A high proportion of students who take their breakfast every day have not tried weed before. I wonder if this could be an indication of family care, having their family member preparing breakfast for them every day, these students received more family care.

Qn77 — During the past 7 days, on how many days did you eat breakfast?

1) 0 days / 2) 1 day
3) 2 days / 4) 3 days
5) 4 days / 6) 5 days
7) 6 days / 8) 7 days

3. Self-care — Simple habits such as using sunscreen regularly can show that a person takes good care of themselves hence would stay away from harmful behaviour such as drugs. Another form of self-care is wanting to look good as observed those students who go for indoor tanning have a lesser tendency to try weed.

Qn97 — When you are outside for more than one hour on a sunny day, how often do you wear sunscreen with an SPF of 15 or higher?

1) Never
2) Rarely
3) Sometimes
4) Most of the time
5) Always

Now that we have identified these behaviours relating to drug-taking, we can use it to create a survey to predict if the student is likely to experiment with weed. For students that have been classified to try weed, we can provide early intervention through counselling, guiding them on how to handle stress and solve problems.

  1. Risk — Educate Youth on the risks and the dangers of drugs.
  2. Family-Care — Parents to show more concerns for their child
  3. Self-Care — Teach the students to take good care of themselves and they are responsible for their own actions.

More ways of support can be found in this informative handbook on preventive drug education for parents by the Singapore Central Narcotics Bureau (CNB).

Everyone has a part to play in shaping our younger generation so they do not become victims of drug abuse. If you suspect or are worried that a child might be struggling with substance use or addiction, please seek support from the following organization

US — https://drugfree.org/get-support-now/

Singapore — https://www.cnb.gov.sg or CNB hotline at 1800 325 6666

Thank you for reading, I would also like to thank my lecturer Han Wei for his guidance and encouragement throughout the Bootcamp.

Below you may find links to my GitHub and other resources that I referred to for this project.

Github

https://www.cdc.gov/healthyyouth/data/yrbs/data.htm

NIDA. “Is marijuana a gateway drug?.” National Institute on Drug Abuse, 8 Apr. 2020, https://www.drugabuse.gov/publications/research-reports/marijuana/marijuana-gateway-drug

Accessed 16 Nov. 2020.

NIDA. “How does marijuana use affect school, work, and social life?.” National Institute on Drug Abuse, 11 Jun. 2020, https://www.drugabuse.gov/publications/research-reports/marijuana/how-does-marijuana-use-affect-school-work-social-life Accessed 16 Nov. 2020.

Brownlee, J. (2020, August 17). KNN Imputation for Missing Values in Machine Learning. Retrieved 16 Nov. 2020, from https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/

https://www.cnb.gov.sg/docs/default-source/default-document-library/preventive-drug-education-handbook-for-parents-(english).pdf

--

--

Sabrina Lxn

A Budding Data Analyst who hopes to use data in meaningful ways to solve humanitarian issues one day