Skip to content

WQU Unit dw projects #1

@azamkhan06

Description

@azamkhan06

Drug abuse is a source of human and monetary costs in health care. A first step in identifying practitioners that enable drug abuse is to look for practices where commonly abused drugs are prescribed unusually often. Let's try to find practices that prescribe an unusually high amount of opioids. The opioids we'll look for are given in the list below.

import math
import gzip
import pandas as pd
from static_grader import grader
​
​
chem = pd.read_csv('./dw-data/chem.csv.gz', compression='gzip', header=0, sep=',', quotechar='"', error_bad_lines=False)
chem.head()
chem.columns
​
with gzip.open ( './dw-data/201701scripts_sample.csv.gz', 'rb' ) as f:
    scripts = pd.read_csv ( f )
​
with gzip.open ( './dw-data/practices.csv.gz', 'rb' ) as f:
    practices = pd.read_csv ( f )
​
​
​
practices.columns = ['code', 'name', 'addr_1', 'addr_2', 'borough', 'village', 'post_code'] 
practices = practices[['code', 'name']].sort_values (by = ['name'], ascending = True) 
practices = practices [~practices.duplicated(['code'])] 
opioids = ['morphine', 'oxycodone', 'methadone', 'fentanyl', 'pethidine', 'buprenorphine', 'propoxyphene', 'codeine'] 
​
​
check = '|'.join(opioids) 
chem_df1 = chem 
chem_df1 [ 'test' ] = chem_df1 [ 'NAME' ].apply ( lambda x: any ( [ k in x.lower() for k in opioids ] ) ) 
key2 = chem_df1 [ "test" ] == True 
chem_df1 = chem_df1 [ key2 ]  
chem_sub = list (chem_df1['CHEM SUB']) 
​
​
scripts['opioid'] = scripts [ 'bnf_code' ].apply(lambda x: 1 if x in chem_sub else 0)
std_devn = scripts.opioid.std ()
overall_rate = scripts.opioid.mean()
​
scripts = scripts.merge (practices, left_on = 'practice', right_on = 'code')
scripts['cnt'] = 0
​
​
opioids_per_practice = scripts.groupby ( [ 'practice', 'name' ], as_index = False ).agg ( { 'opioid': 'mean', 'cnt': 'count' } )
opioids_per_practice.drop_duplicates()
​
opioids_per_practice['opioid'] = opioids_per_practice ['opioid'] - overall_rate
​
opioids_per_practice['std_err'] = std_devn / opioids_per_practice['cnt'] ** 0.5
opioids_per_practice['z_score'] = opioids_per_practice['opioid'] / opioids_per_practice['std_err']
​
result = opioids_per_practice[['name', 'z_score', 'cnt']]
​
​
result.sort_values(by = 'z_score', ascending = False, inplace = True)
anomalies = [(k[1], k[2], k[3]) for k in result.itertuples()][:100]

:52: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
result.sort_values(by = 'z_score', ascending = False, inplace = True)
opioids = ['morphine', 'oxycodone', 'methadone', 'fentanyl', 'pethidine', 'buprenorphine', 'propoxyphene', 'codeine']
These are generic names for drugs, not brand names. Generic drug names can be found using the 'bnf_code' field in scripts along with the chem table. Use the list of opioids provided above along with these fields to make a new field in the scripts data that flags whether the row corresponds with a opioid prescription.


Now for each practice calculate the proportion of its prescriptions containing opioids.

Hint: Consider the following list: [0, 1, 1, 0, 0, 0]. What proportion of the entries are 1s? What is the mean value?

opioids_per_practice = ...
How do these proportions compare to the overall opioid prescription rate? Subtract off the proportion of all prescriptions that are opioids from each practice's proportion.

relative_opioids_per_practice = ...
Now that we know the difference between each practice's opioid prescription rate and the overall rate, we can identify which practices prescribe opioids at above average or below average rates. However, are the differences from the overall rate important or just random deviations? In other words, are the differences from the overall rate big or small?

To answer this question we have to quantify the difference we would typically expect between a given practice's opioid prescription rate and the overall rate. This quantity is called the standard error, and is related to the standard deviation, 𝜎. The standard error in this case is

𝜎𝑛⎯⎯√
where 𝑛 is the number of prescriptions each practice made. Calculate the standard error for each practice. Then divide relative_opioids_per_practice by the standard errors. We'll call the final result opioid_scores.

standard_error_per_practice = ...
opioid_scores = ...
The quantity we have calculated in opioid_scores is called a z-score:

𝑋¯−𝜇𝜎2/𝑛⎯⎯⎯⎯⎯⎯⎯√
Here 𝑋¯ corresponds with the proportion for each practice, 𝜇 corresponds with the proportion across all practices, 𝜎2 corresponds with the variance of the proportion across all practices, and 𝑛 is the number of prescriptions made by each practice. Notice 𝑋¯ and 𝑛 will be different for each practice, while 𝜇 and 𝜎 are determined across all prescriptions, and so are the same for every z-score. The z-score is a useful statistical tool used for hypothesis testing, finding outliers, and comparing data about different types of objects or events.

Now that we've calculated this statistic, take the 100 practices with the largest z-score. Return your result as a list of tuples in the form (practice_code, practice_name, z-score, number_of_scripts). Sort your tuples by z-score in descending order. Note that some practice codes will correspond with multiple names. In this case, use the first match when sorting names alphabetically.

unique_practices = ...
anomalies = [("NATIONAL ENHANCED SERVICE", 11.6958178629, 7)] * 100
grader.score.dw__script_anomalies(anomalies)
Your solution did not match the expected type: 100 * (string, string, number, number)

Specifically, solution[0][1] did not match {'type': 'string'}:
    11.6958178629

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions