You MUST work within your assigned teams.
Each MIDS team MUST collaborate using GitHub. This is not compulsory for non-MIDSters, however, I would recommend also collaborating using GitHub. A blank repository has been created for this project. Follow this link: https://classroom.github.com/a/LSi0yemR to gain access. You should already know how to clone the repository locally once you gain access. The first student to accept the invitation within each team will be responsible for creating the team name. All other members of the team will be able to join the team once the first person has added the team name. You should only join your pre-assigned team. Feel free to create other folders within the repository as needed but you must push your final reports and presentation files to the corresponding folders already created for you.
Each team will have 6 minutes to present their findings in class. Feel free to get creative with the presentations; fun animations are welcome!
Each team MUST turn in only one report with team members’ names at the top of the report, and the different designations (checker, coordinator, presenter, programmer, and writer).
.pdf
.kable
, xtable
, stargazer
, etc.All team members must complete a very short written evaluation, quickly describing the effort put forth by other team members.
Your analyses MUST address the two sets of questions directly. The report should be written so that there is a section which clearly answers the first set of questions for Part I and another section which clearly answers the second set of questions about Part II. If you prefer, you can write two 5-paged reports, one to address each set of questions. Be sure to also include the following in your report:
In this case study, we consider data provided by StreetRx. From their website,
StreetRx (streetrx.com) is a web-based citizen reporting tool enabling real-time collection of street price data on diverted pharmaceutical substances. Based on principles of crowdsourcing for public health surveillance, the site allows users to anonymously report prices they paid or heard were paid for diverted prescription drugs. User-generated data offers intelligence into an otherwise opaque black market, providing a novel data set for public health surveillance, specifically for controlled substances.
Prescription opioid diversion and abuse are major public health issues, and street prices provide an indicator of drug availability, demand, and abuse potential. Such data, however, can be difficult to collect and crowdsourcing can provide an effective solution in an era of Internet-based social networks. Data derived from StreetRx generates valuable insights for pharmacoepidemiological research, health-policy analysis, pharmacy-economic modeling, and in assisting epidemiologists and policymakers in understanding the effects of product formulations and pricing structures on the diversion of prescription drugs.
StreetRx operates under strong partnership with the Researched Abuse, Diversion, and Addiction-Related Surveillance System (RADARS), a surveillance system that collects product- and geographically-specific data on abuse, misuse, and diversion of prescription drugs. The site was launched in the United States in November 2010. Since then, there have been over 300,000 reports of diverted drug prices. StreetRx has expanded into Australia, Canada, France, Germany, Italy, Spain, and the United Kingdom.
This data is NOT to be shared outside of class and specifically, NOT to be shared beyond this case study.
The data can be found on Sakai. The file streetrx.RData
contains the actual data, and the instructions plus data dictionary (also given below) can be found in the file StreetRx Data Dictionary and Instructions_1q19.docx
.
Each team has been assigned a different drug or drug family for investigation as given below. Subset the data and only focus on the drug or drug family assigned to your team.
Team | Drug |
---|---|
Group 1 | Methadone |
Group 2 | Codeine |
Group 3 | Morphine |
Group 4 | Oxymorphone |
Group 5 | Diazepam |
Group 6 | Lorazepam |
Group 7 | Tramadol |
Group 8 | Hydrocodone |
Variables provided to us by StreetRx are given below. You will only use a subset of the variables as indicated below.
Variable | Description |
---|---|
ppm | Price per mg – outcome of interest |
yq_pdate | Year and quarter drug was purchased (format YYYYQ, so a purchase in March 2019 would be coded 20191) – DO NOT USE |
price_date | Date of the reported purchase (MM/DD/YY) (a finer-grained time variable than yq_pdate) – DO NOT USE |
city | manually entered by user and not required – DO NOT USE |
state | manually entered by user and not required |
country | manually entered by user and not required – only has one unique value anyway, so discard! |
USA_region | based on state and coded as northeast, midwest, west, south, or other/unknown |
source | source of information; allows users to report purchases they did not personally make |
api_temp | active ingredient of drug of interest – use it to subset the data to the drug assigned to your team above |
form_temp | this variable reports the formulation of the drug (e.g., pill, patch, suppository) |
mgstr | dosage strength in mg of the units purchased (so ppm*mgstr is the total price paid per unit) |
bulk_purchase | indicator for purchase of 10+ units at once |
primary_reason | data collection for this variable began in the 4th quarter of 2016. Values include: 0 = Reporter did not answer the question 1 = To treat a medical condition (ADHD, excessive sleepiness, etc.) 2 = To help me perform better at work, school, or other task 3 = To prevent or treat withdrawal 4 = For enjoyment/to get high 5 = To resell 6 = Other reason 7 = Don’t know 8 = Prefer not to answer 9 = To self-treat my pain 10 = To treat a medical condition other than pain 11 = To come down 12 = To treat a medical condition (anxiety, difficulty sleeping, etc.) DO NOT USE |
Use a multi-level model to investigate factors related to the price per mg of your drug, accounting for potential clustering by location and exploring heterogeneity in pricing by location.
As part of your analysis, explore how the factors provided are, or are not, associated with pricing per milligram. One challenge with StreetRx data is that they are entered by users, so do bear in mind that exploratory data analysis will be important in terms of identifying unreasonable observations, given that website users may not always be truthful (e.g., I could go on the website now and say I paid a million dollars for one Xanax on the island of Aitutaki, and that would be reflected in the database).
The North Carolina State Board of Elections (NCSBE) is the agency charged with the administration of the elections process and campaign finance disclosure and compliance. Among other things, they provide voter registration and turnout data online (https://www.ncsbe.gov/index.html, https://www.ncsbe.gov/results-data). Using the NC voter files for the general elections in November 2020, you will attempt to identify/estimate how different groups voted in the 2020 elections, at least out of those who registered. Here’s an interesting read on turnout rates for NC in 2016: https://democracync.org/wp-content/uploads/2017/05/WhoVoted2016.pdf (you might consider creating a similar graphic to the one on page 4).
The data for this part of the project can be found on Sakai. The file voter_stats_20201103.txt
contains information about the aggregate counts of registered voters by the demographic variables; the data dictionary can be found in the file DataDictionaryForVoterStats.txt
. The file history_stats_20201103.txt
contains information about the aggregate counts of voters who actually voted by the demographic variables.
You will only work with a subset of thoe overall data. Take a random sample of 25 counties out of all the counties in both datasets. You should indicate the counties you sampled in your final report. You will need to merge the two files voter_stats_20201103.txt
and history_stats_20201103.txt
by the common variables for the counties you care about. Take a look at the set of join
functions in the dplyr
package in R (https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join) or the merge
function in base R. I recommend the functions in dplyr
. You may choose to merge the datasets before or after selecting the samples you want, but be careful if you decide to do the latter.
Unfortunately, the data dictionary from the NCSBE does not provide the exact difference between the variables party_cd
and voted_party_cd
in the history_stats_20201103.txt
file (if you are able to find documentation on the difference, do let me know). However, I suspect that the voted party code encodes the information about people who changed their party affiliation as at the registration deadline, whereas the first party code is everyone’s original affiliation. Voters are allowed to change their party affiliation in NC so that lines up. The two variables are thus very similar and only a small percentage of the rows in the history_stats_20201103.txt
file have different values for the two variables. I would suggest using the voted party code (voted_party_cd
) for the history_stats_20201103.txt
dataset.
You should discard the following variables before merging: election_date
,stats_type
, and update_date
. Also, you can not really merge by or use the voting_method
and voting_method_desc
variables in your analysis either because that information is only available in the history_stats_20201103.txt
data and not the other dataset. That means you should not use those two variables when merging.
Before discarding the variables however, you need to aggregate to make sure that you are merging correctly. As a simple example, suppose 4 males voted in person and 3 males voted by mail, you need to aggregate out the method of voting so that you have 7 males in total. This is because we are unable to separate people who voted by different voting methods in the voter_stats_20201103.txt
we want to merge from. So, the simplest way is to use the aggregate function in R. As an example, the code:
aggregated_data <- aggregate(Data$total_voters,
list(Age=Data$age,Party=Data$party_cd),sum)
will sum all voters by all age groups and party. You can also use the dplyr
package to aggregate in the same way if you prefer that.
Once you have this clean data for the history_stats_20201103.txt
file, you should then go ahead to grab the information on total registered voters from voter_stats_20201103.txt
, by merging by all variables in history_stats_20201103.txt
, except total_voters
.
Your job is to use a hierarchical model to answer the following questions of interest.
For Part II, basic model assessment and model validation is sufficient!
40 points: 20 points for each part.