import pandas as pd
      import matplotlib.pyplot as plt
      import seaborn as sns
      %matplotlib inline
      %autosave 30

Autosaving every 30 seconds


              
                #Load dataset from the CSV file.
      df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')


              
                df.head(2)


              
                'Rows: {}'.format(df.shape[0]), 'Columns: {}'.format(df.shape[1])

('Rows: 110527', 'Columns: 14')


              
                #DataFrame DataTypes
      df.info()

<class 'pandas.core.frame.DataFrame'>
      RangeIndex: 110527 entries, 0 to 110526
      Data columns (total 14 columns):
       #   Column          Non-Null Count   Dtype  
      ---  ------          --------------   -----  
       0   PatientId       110527 non-null  float64
       1   AppointmentID   110527 non-null  int64  
       2   Gender          110527 non-null  object 
       3   ScheduledDay    110527 non-null  object 
       4   AppointmentDay  110527 non-null  object 
       5   Age             110527 non-null  int64  
       6   Neighbourhood   110527 non-null  object 
       7   Scholarship     110527 non-null  int64  
       8   Hipertension    110527 non-null  int64  
       9   Diabetes        110527 non-null  int64  
       10  Alcoholism      110527 non-null  int64  
       11  Handcap         110527 non-null  int64  
       12  SMS_received    110527 non-null  int64  
       13  No-show         110527 non-null  object 
      dtypes: float64(1), int64(8), object(5)
      memory usage: 11.8+ MB


              
                'Number of dupicated rows in the dataset is: {}'.format(sum(df.duplicated()))

'Number of dupicated rows in the dataset is: 0'


              
                # Check for null values in every column 
      df.isnull().sum()

PatientId         0
      AppointmentID     0
      Gender            0
      ScheduledDay      0
      AppointmentDay    0
      Age               0
      Neighbourhood     0
      Scholarship       0
      Hipertension      0
      Diabetes          0
      Alcoholism        0
      Handcap           0
      SMS_received      0
      No-show           0
      dtype: int64


              
                df.head(2)


              
                df.info()

<class 'pandas.core.frame.DataFrame'>
      RangeIndex: 110527 entries, 0 to 110526
      Data columns (total 14 columns):
       #   Column          Non-Null Count   Dtype  
      ---  ------          --------------   -----  
       0   PatientId       110527 non-null  float64
       1   AppointmentID   110527 non-null  int64  
       2   Gender          110527 non-null  object 
       3   ScheduledDay    110527 non-null  object 
       4   AppointmentDay  110527 non-null  object 
       5   Age             110527 non-null  int64  
       6   Neighbourhood   110527 non-null  object 
       7   Scholarship     110527 non-null  int64  
       8   Hipertension    110527 non-null  int64  
       9   Diabetes        110527 non-null  int64  
       10  Alcoholism      110527 non-null  int64  
       11  Handcap         110527 non-null  int64  
       12  SMS_received    110527 non-null  int64  
       13  No-show         110527 non-null  object 
      dtypes: float64(1), int64(8), object(5)
      memory usage: 11.8+ MB


              
                df.rename(columns = {'PatientId': 'PatientID'}, inplace=True)


              
                #Confirm changes
      df.columns[0]

'PatientID'


              
                df.rename(columns = {'Hipertension': 'Hypertension'}, inplace=True)


              
                'Hypertension' in df.columns

True


              
                df.rename(columns={'No-show':'No_show'}, inplace=True)


              
                #confirm the changes
      'No_show' in df.columns

True


              
                df.head(1)


              
                # Column Details 
      df['PatientID'].info()

<class 'pandas.core.series.Series'>
      RangeIndex: 110527 entries, 0 to 110526
      Series name: PatientID
      Non-Null Count   Dtype  
      --------------   -----  
      110527 non-null  float64
      dtypes: float64(1)
      memory usage: 863.6 KB


              
                #Convert the column into a String using a simple Lambda Function 
      df['PatientID'] = df['PatientID'].apply(lambda x: str(x).split('.')[0])


              
                df['PatientID'].info()

<class 'pandas.core.series.Series'>
      RangeIndex: 110527 entries, 0 to 110526
      Series name: PatientID
      Non-Null Count   Dtype 
      --------------   ----- 
      110527 non-null  object
      dtypes: object(1)
      memory usage: 863.6+ KB


              
                df.head(5)['PatientID']

0     29872499824296
      1    558997776694438
      2      4262962299951
      3       867951213174
      4      8841186448183
      Name: PatientID, dtype: object


              
                # 1. Split the scheduledDay string using T
      # 2. USe slicing to obtain the 2nd section of the resulting list
      # 3. Use slicing to pick all the indexes but exclude the last indexed item 
      
      df['ScheduledDayTime'] = pd.to_datetime(df['ScheduledDay'].str.split('T').str[1].str[:-1],format='%H:%M:%S').dt.time


              
                #Confirm creation of new column
      'ScheduledDayTime' in df.columns

True


              
                df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'].str.split('T').str[0],format='%Y/%m/%d')


              
                df['ScheduledDay']

0        2016-04-29
      1        2016-04-29
      2        2016-04-29
      3        2016-04-29
      4        2016-04-29
                  ...    
      110522   2016-05-03
      110523   2016-05-03
      110524   2016-04-27
      110525   2016-04-27
      110526   2016-04-27
      Name: ScheduledDay, Length: 110527, dtype: datetime64[ns]


              
                df['AppointmentDayTime'] = pd.to_datetime(df['AppointmentDay'].str.split('T').str[1].str[:-1],format='%H:%M:%S').dt.time


              
                df.AppointmentDayTime.value_counts()

00:00:00    110527
      Name: AppointmentDayTime, dtype: int64


              
                #USE DROP method to delete the column from dataframe
      df.drop(columns='AppointmentDayTime', inplace=True)


              
                df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'].str.split('T').str[0],format='%Y/%m/%d')


              
                df['AppointmentDay']

0        2016-04-29
      1        2016-04-29
      2        2016-04-29
      3        2016-04-29
      4        2016-04-29
                  ...    
      110522   2016-06-07
      110523   2016-06-07
      110524   2016-06-07
      110525   2016-06-07
      110526   2016-06-07
      Name: AppointmentDay, Length: 110527, dtype: datetime64[ns]


              
                df.head(2)


              
                df.duplicated().sum()

0


              
                df.to_csv('clean_noshowappointments_2016.csv', index=False)


              
                df_16 = pd.read_csv('clean_noshowappointments_2016.csv')


              
                patients = df_16['PatientID'].nunique()


              
                female = df_16[df_16['Gender'] == 'F']['PatientID'].nunique()


              
                male = df_16[df_16['Gender'] == 'M']['PatientID'].nunique()


              
                'The patient counts are as follows; Total: {},  Female: {}, Male: {}'.format(patients,female, male)

'The patient counts are as follows; Total: 62299,  Female: 40046, Male: 22253'


              
                male + female == patients

True


              
                'Female: {}%,  Male: {}%'.format(round((female/patients)*100), round((male/patients)*100))

'Female: 64%,  Male: 36%'


              
                #use groupby to specificy column to group on
      #Select the PatientID column
      #count unique values using the nunique method
      #use plot method to specify type of chart
      
      df_16.groupby('Gender')['PatientID'].nunique().plot.bar(title='Gender',color=['brown','green'],rot=1,alpha=0.9,ylabel='Distinct Count',width=0.8, figsize=(5,7) );


              
                #filter data frame where column-NO-_show is No
      #group the resulting data frame using gender
      #select the [PatientID] column
      #count the number of unique IDs
      
      attended_appointment = df_16[df_16['No_show']=='No'].groupby('Gender')['PatientID'].nunique()


              
                attended_appointment

Gender
      F    34961
      M    19193
      Name: PatientID, dtype: int64


              
                attended_appointment.plot(kind='bar',color=['brown','green'],alpha=0.9, width=0.8,title='Gender',figsize=(5,7), ylabel='No. of patients',rot=1 );


              
                'The number of women({}) who attended their appointments is higher than that of men ({})'.format(attended_appointment.F, attended_appointment.M)

'The number of women(34961) who attended their appointments is higher than that of men (19193)'


              
                df_16.head(1)


              
                hyper_status = df_16.groupby(['Hypertension', 'No_show'])['AppointmentID'].count()


              
                hyper_status

Hypertension  No_show
      0             No         70179
                    Yes        18547
      1             No         18029
                    Yes         3772
      Name: AppointmentID, dtype: int64


              
                #to access values from series obeject we will select 0 for the hypertension column and NO for the No_show column
      no_hyper_attended_app = df_16.groupby(['Hypertension', 'No_show'])['AppointmentID'].count()[0]['No']


              
                """
      to access values for NO Hypertension Patients who Skipped appointment from series obeject 
      we will select 0 for the hypertension column and YES for the No_show column
      """
      no_hyper_skipped_app = df_16.groupby(['Hypertension', 'No_show'])['AppointmentID'].count()[0]['Yes']


              
                #Patients without hypertension Appointment attendance rate
      no_hyper_rate = round((no_hyper_attended_app / (no_hyper_attended_app + no_hyper_skipped_app)) *100)


              
                "Patients without Hypertension have a {}% appointment attendance rate".format(no_hyper_rate)

'Patients without Hypertension have a 79% appointment attendance rate'


              
                """
      to access values for patients wwith Hypertension who attended appointment from series object 
      we will select 1 for the hypertension column and NO for the No_show column
      """
      hyper_attended_app = df_16.groupby(['Hypertension', 'No_show'])['AppointmentID'].count()[1]['No']


              
                """
      to access values for patients wwith Hypertension who SKIPPED appointment from series object 
      we will select 1 for the hypertension column and YES for the No_show column
      """
      hyper_skipped_app = df_16.groupby(['Hypertension', 'No_show'])['AppointmentID'].count()[1]['Yes']


              
                hyper_rate = round(hyper_attended_app/(hyper_skipped_app+hyper_attended_app)*100)


              
                "Patients WITH Hypertension have a {}% appointment attendance rate".format(hyper_rate)

'Patients WITH Hypertension have a 83% appointment attendance rate'


              
                df_avg_age = df_16.groupby(['Gender', 'No_show'])['Age'].mean()


              
                plt.figure(figsize = (8,8))
      
      chartbar = sns.barplot(data=df_16, x='Gender',palette=['brown', 'green'],hue='No_show', y='Age')
      chartbar.set(title = 'Average age per Gender per No_show Category')
      
      # function will add data labels to the visual
      for bar in chartbar.containers:
          chartbar.bar_label(bar)


              
                #flter data frame to pick appointment attendees only
      #Select the Age column only from the dataFrame
      #use describe method to return descrption of the values
      df_16[df_16['No_show']=='No']['Age'].describe()

count    88208.000000
      mean        37.790064
      std         23.338878
      min         -1.000000
      25%         18.000000
      50%         38.000000
      75%         56.000000
      max        115.000000
      Name: Age, dtype: float64


              
                df_16.drop(df_16.index[df_16['Age'] < 0],inplace=True)


              
                df_16[df_16['Age'] < 0]


              
                #RE-RUN describe to get new age percentiles for attendees
      df_16[df_16['No_show']=='No']['Age'].describe()

count    88207.000000
      mean        37.790504
      std         23.338645
      min          0.000000
      25%         18.000000
      50%         38.000000
      75%         56.000000
      max        115.000000
      Name: Age, dtype: float64


              
                #create a list of the various descriptions
      bins = [df_16[df_16['No_show']=='No']['Age'].describe()['min'],df_16[df_16['No_show']=='No']['Age'].describe()['25%'],df_16[df_16['No_show']=='No']['Age'].describe()['50%'],df_16[df_16['No_show']=='No']['Age'].describe()['75%'],df_16[df_16['No_show']=='No']['Age'].describe()['max']]


              
                #create a list to hold the names
      bin_names = ['young', 'middle_aged', 'old', 'very_old']


              
                #add a new column age_groups to the df_16
      df_16['age_groups'] = pd.cut(df_16['Age'], bins, labels=bin_names)


              
                #.columns method returns the column headers/labels of the dataframe
      'age_groups' in df_16.columns

True


              
                age_group_counts = df_16[df_16['No_show']=='No'].groupby('age_groups')['AppointmentID'].count()


              
                age_group_counts

age_groups
      young          19619
      middle_aged    22145
      old            21714
      very_old       21829
      Name: AppointmentID, dtype: int64


              
                # Find the appointment count for each age_group
      age_group_counts.plot.bar(x='age_groups', color=['green', 'brown', 'blue','red'],figsize=(8,6),title='No of patients per age-group',width=0.99, alpha=0.85,ylabel='Count of Patients', rot=45);

Project: Investigating No-Show Appointments in Brazil in 2016¶

Table of Contents¶

Introduction¶

Data Dictionary¶

Questions to investigate¶

Data Wrangling¶

General Properties¶

Let's get the general outlook of the dataframe¶

No. of rows and columns in the dataframe¶

Checking for any duplicated entries in the rows¶

We check for and Drop any rows with NULL/NaN entries¶

Printing the dataset for visual inspection¶

Inspect column data types¶

Data Cleaning¶

Issues with the Data¶

1. Column names¶

2. Data Types¶

1. Data Cleaning - Column Names¶

a. Renaming the PatientId column¶

b. Rename the Hipertension column¶

c) Rename the No-show column¶

d) Confirm that column headers are clean¶

2. Data Cleaning - Converting data to correct data types¶

a. Converting PatientID column from float to str¶

Confirming PatientID's new data Type¶

b. Extracting Time from ScheduledDay string and creating a new column (ScheduledDayTime) to insert the new values¶

c. Extracting Date from ScheduledDay Column¶

d. Extracting Time from AppointmentDay string to a new column AppointmentDayTime¶

e. Extracting Date from AppointmentDate¶

Let's visualize our dataframe and confirm that it is clean and ready for analysis¶

Exploratory Data Analysis¶

Research Question 1. what is the gender distribution of the patients?¶

Research Question 2. What is the gender distribution of patients who showed up to their appointments?¶

Research Question 3. Do patients with pre-existing medical conditions (hypertension) adhere to their appointments compared to other patients?¶

Research Question 4. What are the average ages for patients based on their gender for the two no-show categories?¶

Research Question 5. Do medical appointment attendance rates improve as patients grow older?¶

Limitations¶

Conclusions¶

References¶

	PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Scholarship	Hipertension	Diabetes	Alcoholism	Handcap	SMS_received	No-show
0	2.987250e+13	5642903	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	0	1	0	0	0	0	No
1	5.589978e+14	5642503	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	0	0	0	0	No

	PatientID	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Scholarship	Hypertension	Diabetes	Alcoholism	Handcap	SMS_received	No_show	ScheduledDayTime
0	29872499824296	5642903	F	2016-04-29	2016-04-29	62	JARDIM DA PENHA	0	1	0	0	0	0	No	18:38:08
1	558997776694438	5642503	M	2016-04-29	2016-04-29	56	JARDIM DA PENHA	0	0	0	0	0	0	No	16:08:27