Predictors of Teacher Retention
This project is a Python adaptation (with tweaks and extensions) of my Empirical Data Analysis team's project. We use the School and Staffing Survey (SASS) and Teacher Follow-Up Survey (TFS) data files from 1999 to 2000 to understand predictors of teacher retention. Data is prepared, transformed, and then analyzed using a logistic regression model.
About the Data
The data utilized in this analysis are sourced from the School and Staffing Survey (SASS) and Teacher Follow-Up Survey (TFS) results for the academic year 1999-2000. Specifically, the analysis draws from the public school and public teacher files within the SASS, as well as the current teacher and former teacher files within the TFS. These comprehensive datasets provide valuable insights into various aspects of the education system during the specified period.
For access to the original data files, they can be found on the National Center for Education Statistics (NCES) website at the following link: https://nces.ed.gov/surveys/sass/dataprod9901.asp.
Reading in the Data
For reference, here is a table describing the data files used.
Survey | File Name |
---|---|
SASS (Public School) | sass_99_00_s3a_v1_0.sas7bdat |
SASS (Public Teacher) | sass_99_00_s4a_v1_0.sas7bdat |
TFS (Former Teacher) | sass_99_00_t2_v1_0.sas7bdat |
TFS (Current Teacher) | sass_99_00_t3_v1_0.sas7bdat |
For an unknown reason, the data files are only available in ASCII, SAS, and SPSS formats. As such, pyreadstat
and its read_sas7bdat()
function are helpful to access the data due to SAS being proprietary (not free).
While pandas
does have a read_sas()
function, it is not as robust as read_sas7bdat()
found in pyreadstat
.
import pyreadstat
import pandas as pd
import numpy as np
# Public school (SASS)
sass_school_public, meta_sass_school_public = pyreadstat.read_file_multiprocessing(
pyreadstat.read_sas7bdat,
'sass_99_00_s3a_v1_0.sas7bdat',
)
# Public teacher (SASS)
sass_teacher_public, meta_sass_teacher_public = pyreadstat.read_file_multiprocessing(
pyreadstat.read_sas7bdat,
'sass_99_00_s4a_v1_0.sas7bdat',
)
# Former teacher (TFS)
tfs_teacher_former, meta_tfs_teacher_former = pyreadstat.read_file_multiprocessing(
pyreadstat.read_sas7bdat,
'sass_99_00_t2_v1_0.sas7bdat',
)
# Current teacher (TFS)
tfs_teacher_current, meta_tfs_teacher_current = pyreadstat.read_file_multiprocessing(
pyreadstat.read_sas7bdat,
'sass_99_00_t3_v1_0.sas7bdat',
)
Preparing the Data
Replacing -8
and -9
With NaN
Depending on the reason, missing values are originally coded with -8
and -9
. Because the distinction makes no seemingly practical difference, we replace them with np.NaN
. This prevents those values from unknowingly messing with results.
# Public school (SASS)
sass_school_public.replace([-8, -9], np.NaN, inplace = True)
# Public teacher (SASS)
sass_teacher_public.replace([-8, -9], np.NaN, inplace = True)
# Former teacher (TFS)
tfs_teacher_former.replace([-8, -9], np.NaN, inplace = True)
# Current teacher (TFS)
tfs_teacher_current.replace([-8, -9], np.NaN, inplace = True)
Dropping Imputation Flag and SURVEY
Columns
While nice to have, the imputation flag columns are not necessary to operations. As such, we drop them to help improve performance in later operations. Additionally, the SURVEY
column serves little purpose, since all data files will eventually be merged.
# Public school (SASS)
sass_school_public.drop(
list(sass_school_public.filter(regex = 'F_')) + ['SURVEY'],
axis = 1,
inplace = True
)
# Public teacher (SASS)
sass_teacher_public.drop(
list(sass_teacher_public.filter(regex = 'F_')) + ['SURVEY'],
axis = 1,
inplace = True
)
# Former teacher (TFS)
tfs_teacher_former.drop(
list(tfs_teacher_former.filter(regex = 'F_')) + ['SURVEY'],
axis = 1,
inplace = True
)
# Current teacher (TFS)
tfs_teacher_current.drop(
list(tfs_teacher_current.filter(regex = 'F_')) + ['SURVEY'],
axis = 1,
inplace = True
)
Combining the TFS Data Files
To accurately assess later if a teacher has stayed at the same school, relocated to another school, or left teaching altogether, we must combine the two data files. We use the concat()
function to combine the two data frames.
tfs_teacher_all = pd.concat([tfs_teacher_current, tfs_teacher_former])
Merging SASS Data Files
Combining sass_teacher_public
and sass_school_public
provides us with a more comprehensive understanding of factors that might have led to a teacher leaving teaching. Together, they allow for examination of both individual-level factors and school-level factors.
We can describe this merge as a joining sass_school_public
onto sass_teacher_public
using SCHCNTL
as the key. Due to the fact that many teachers can belong to one school, our validation reflects that (m:1
). Lastly, we attach the suffix _SCHOOL
to any overlapping columns from sass_school_public
. We will use this when imputing values later.
sass_teacherSchool_public = pd.merge(
sass_teacher_public,
sass_school_public,
on = 'SCHCNTL',
how = 'left',
validate = 'm:1',
suffixes = (None, '_SCHOOL')
)
Merging sass_teacherSchool_public
and tfs_teacher_all
Our last merge will isolate teachers that have taken the TFS, thus allowing the identification of stayers, movers, and leavers.
We can describe this merge as joining sass_school_public
and sass_teacher_public
using CNTLNUM
as the key. Given that each record in sass_teacherSchool_public
corresponds to exactly one record in tfs_teacher_all
, our validation reflects that one-to-one relationship (1:1
). Finally, similar to the prior merge, we attach the suffix _TFS
to any overlapping columns from tfs_teacher_all
.
public_teacher = pd.merge(
sass_teacherSchool_public,
tfs_teacher_all,
on = 'CNTLNUM',
how = 'inner',
validate = '1:1',
suffixes = (None, '_TFS')
)
Imputing Values and Pruning Columns
The final step in data preparation is identifying opportunities for imputation. Here, imputation will be derived from overlapping columns identified in the prior merges, i.e., columns with either a _SCHOOL
or _TFS
suffix. The following operation identifies those columns.
overlapping_columns = public_teacher.filter(regex='_TFS$').columns.append(
public_teacher.filter(regex='_SCHOOL$').columns
).sort_values()
overlapping_columns
Index(['AGE_T_TFS', 'ASSIGN_TFS', 'COLBARG_TFS', 'ENRK12UG_TFS',
'MINENR_SCHOOL', 'MINTCH_SCHOOL', 'NEWTCH_TFS', 'NUMTCH_SCHOOL',
'PGMTYPE_SCHOOL', 'PGMTYPE_TFS', 'RACETH_T_TFS', 'REGION_SCHOOL',
'REGION_TFS', 'SCHCNTL_TFS', 'SCHLEVEL_SCHOOL', 'SCHLEVEL_TFS',
'SECTOR_SCHOOL', 'SECTOR_TFS', 'STU_TCH_SCHOOL', 'TOTEXPER_TFS',
'URBANIC_SCHOOL', 'URBANIC_TFS', 'YRTCHSC_TFS'],
dtype='object')
Now that overlapping columns have been identified, we should compare them against the original column they overlap. Save for missing values, the overlapping column’s value should be identical to the original column’s value for each observation. Thus, the following block:
- Initializes an empty dictionary
discrepant_columns
. - Loops over every overlapping column to:
- Determine the name of the original column (
original_column
). - Isolate rows with discrepancies where null values in the overlapping column are disregarded, storing them in
discrepancies
. - Count the number of discrepant rows (
discrepancies_count
) and locate where they are (discrepancies_index
). - If there are discrepancies (
discrepancies_count > 0
), add an item to the dictionary, where the:- Key is a tuple containing the names of the original and overlapping columns (
(original_column, overlapping_column)
). - Value is the position of the discrepant rows (
discrepancies_index
).
- Key is a tuple containing the names of the original and overlapping columns (
- Determine the name of the original column (
- Cleans up the temporary variables.
import re
discrepant_columns = {}
for overlapping_column in overlapping_columns:
original_column = re.split(r'_TFS|_SCHOOL', overlapping_column)[0]
discrepancies = public_teacher.loc[
(public_teacher[original_column] != public_teacher[overlapping_column]) &
(public_teacher[overlapping_column].notnull()),
overlapping_column
]
discrepancies_count, discrepancies_index = discrepancies.count(), discrepancies.index
if discrepancies_count > 0:
discrepant_columns[(original_column, overlapping_column)] = discrepancies_index
del overlapping_column, original_column, discrepancies, discrepancies_count, discrepancies_index
discrepant_columns
{}
The dictionary discrepant_columns
ends up being empty, meaning either the columns are identical or any discrepancies arose as a result of non-null values inside the overlapping columns. As such, there is no need for imputation.
Transforming the Data
Before any analysis can be done, we must transform the data. Below are two tables, one for teacher-level variables and another for school-level variables.
Adding a New Data Frame
To prevent overwriting or erasure, we create a new data frame (final_data
) using CNTLNUM
and SCHCNTL
.
final_data = pd.DataFrame(public_teacher[['CNTLNUM', 'SCHCNTL']])
Creating a Custom where_binary()
Function
While it seems logical to go ahead and use np.where()
for some of our transformations, it’s important to remember that it also evaluates null values. This could drastically alter the results of any regressions where NaN
was not preserved.
Thus, we create the where_binary()
function. It accepts a column and a condition (or rather a column of binary values). It first checks for NaN
values and assigns np.NaN
when true. When false, the condition is evaluated in a nested condition, assigning 1
if true and 0
if false.
def where_binary(column: pd.Series, condition: pd.Series) -> np.ndarray:
return np.where(
pd.isnull(column), np.NaN,
np.where(condition, 1, 0)
)
Teacher-Level Variables
The teacher variables provide insights into individual teacher characteristics and experiences within the educational system.
Variable | Source | Description |
---|---|---|
teacher_left | STATUS | Indicates whether the teacher has left their teaching position. |
teacher_age_50_or_above | AGE_T | Indicates whether the teacher is 50 years of age or older. |
teacher_experience_years | TOTEXPER | Represents the total number of years the teacher has been teaching. |
teacher_age_50_or_above*teacher_experience_years | Interaction term between teacher age 50 or above and teacher experience years. | |
teacher_management_perception | MGMNT_T | Reflects the teacher’s perception of the management at their school. |
teacher_female | T0356 | Indicates whether the teacher is female. |
teacher_race_ethnicity_AIAN_non_Hispanic | RACETH_T | Indicates whether the teacher is American Indian or Alaska Native, non-Hispanic. |
teacher_race_ethnicity_API_non_Hispanic | RACETH_T | Indicates whether the teacher is Asian or Pacific Islander, non-Hispanic. |
teacher_race_ethnicity_Black_non_Hispanic | RACETH_T | Indicates whether the teacher is Black, non-Hispanic. |
teacher_race_ethnicity_Hispanic | RACETH_T | Indicates whether the teacher is Hispanic, regardless of race. |
teacher_paid_for_summer_school | T0341 | Indicates whether the teacher has received pay for summer school. |
teacher_salary_below_25000 | T0347 | Indicates whether the teacher’s salary is below $25,001. |
teacher_salary_above_45000 | T0347 | Indicates whether the teacher’s salary is $45,001 or above. |
teacher_union_member | T0355 | Indicates whether the teacher is a member of a union. |
teacher_left
To derive teacher_left
from STATUS
, we utilize a conditional statement. If the value in the STATUS
column equals 'L'
, 1
is assigned to teacher_left
, indicating that the teacher has left. Otherwise, if the condition is not met, 0
is assigned, indicating that the teacher has not left.
final_data['teacher_left'] = where_binary(
public_teacher['STATUS'],
public_teacher['STATUS'] == 'L'
)
teacher_age_50_or_above
By using a conditional, we assign to teacher_age_50_or_above
1
if AGE_T
is 4
(“50 years or older”) and 0
otherwise, save for missing values.
final_data['teacher_age_50_or_above'] = where_binary(
public_teacher['AGE_T'],
public_teacher['AGE_T'] == 4
)
teacher_experience_years
Here, teacher_experience_years
is simply a copy of TOTEXPER
. No transformation is needed.
final_data['teacher_experience_years'] = public_teacher['TOTEXPER']
teacher_age_50_or_above*teacher_experience_years
This is simply an interaction term, thus we multiply teacher_age_50_or_above
and teacher_experience_years
.
final_data['teacher_age_50_or_above*teacher_experience_years'] = final_data['teacher_age_50_or_above'] * final_data['teacher_experience_years']
teacher_management_perception
Similarly, teacher_management_perception
is a copy of MGMNT_T
, needing no transformation.
final_data['teacher_management_perception'] = public_teacher['MGMNT_T']
teacher_female
To get teacher_female
, we subtract one from T0356
, as the binary category uses 1
for male and 2
for female.
final_data['teacher_female'] = public_teacher['T0356'] - 1
teacher_race_ethnicity_AIAN_non_Hispanic
, teacher_race_ethnicity_API_non_Hispanic
, teacher_race_ethnicity_Black_non_Hispanic
, and teacher_race_ethnicity_Hispanic
Because of the hassle using categorical variables in regressions, we separate RACETH_T
into four variables. The category 4
(White, non-Hispanic) is used as the base case, meaning it is not included as a column/variable.
final_data['teacher_race_ethnicity_AIAN_non_Hispanic'] = where_binary(
public_teacher['RACETH_T'],
public_teacher['RACETH_T'] == 1
)
final_data['teacher_race_ethnicity_API_non_Hispanic'] = where_binary(
public_teacher['RACETH_T'],
public_teacher['RACETH_T'] == 2
)
final_data['teacher_race_ethnicity_Black_non_Hispanic'] = where_binary(
public_teacher['RACETH_T'],
public_teacher['RACETH_T'] == 3
)
final_data['teacher_race_ethnicity_Hispanic'] = where_binary(
public_teacher['RACETH_T'],
public_teacher['RACETH_T'] == 5
)
teacher_paid_for_summer_school
Because the desired base case is a teacher that hasn’t been paid for teaching summer school, and the code for yes is 1
, another conditional must be used to derive teacher_paid_for_summer_school
from T0341
.
final_data['teacher_paid_for_summer_school'] = where_binary(
public_teacher['T0341'],
public_teacher['T0341'] == 1
)
teacher_salary_below_25000
and teacher_salary_above_45000
Rather than encoding each marginal category for T0347
, we took the highest and lowest categories and created teacher_salary_below_25000
and teacher_salary_above_45000
using conditionals. The codes for “Less than $25,001” and “$45,001 or more” inside T0347
are 1
and 5
, respectively.
final_data['teacher_salary_below_25000'] = where_binary(
public_teacher['T0347'],
public_teacher['T0347'] == 1
)
final_data['teacher_salary_above_45000'] = where_binary(
public_teacher['T0347'],
public_teacher['T0347'] == 5
)
teacher_union_member
We derive teacher_union_member
from T0355
, using a teacher that is not a union member as the base case. Like teacher_paid_for_summer_school
, 1
is yes, so we must use another conditional.
final_data['teacher_union_member'] = where_binary(
public_teacher['T0355'],
public_teacher['T0355'] == 1
)
School-Level Variables
The school variables offer key details about the educational institutions themselves.
Variable | Source | Description |
---|---|---|
school_has_secondary_grades | SCHLEVEL | Indicates whether the school has any grades within the range of 7 to 12. |
school_minority_enrollment_above_20_percent | MINENR | Indicates whether the school’s minority enrollment is above 20%. |
school_conducts_drug_sweeps | S0200 | Indicates whether the school conducts drug sweeps. |
school_has_daily_police_security | S0201 | Indicates whether the school has daily police or security presence. |
school_region_Northeast | REGION | Indicates whether the school is located in the Northeast region. |
school_region_Midwest | REGION | Indicates whether the school is located in the Midwest region. |
school_region_West | REGION | Indicates whether the school is located in the West region. |
school_has_secondary_grades
To derive school_has_secondary_grades
from SCHLEVEL
, we use another conditional, where 1
is returned when SCHLEVEL
is 2
or higher.
Looking at SCHLEVEL
, we see that category 1
(“school has any of grades K–6 and none of grades 9–12”) precludes schools having any of grades 9–12, and category 2
(“school has any of grades 7-12 and none of grades K-6”) precludes schools having any of grades K–6. Thus, we can safely assume that category 3
(“all other cases”) includes schools having at least any of grades 7–12 in addition to any of grades K–6. We overlook that distinction though, instead focusing on schools that have at least any grades within the range of 7 to 12.
final_data['school_has_secondary_grades'] = where_binary(
public_teacher['SCHLEVEL'],
public_teacher['SCHLEVEL'] >= 2
)
school_minority_enrollment_above_20_percent
To get school_minority_enrollment_above_20_percent
, we subtract one from MINENR_S
, as the binary category uses 1
for “fewer than 20% minority students” and 2
for “20% or more minority students.”
final_data['school_minority_enrollment_above_20_percent'] = public_teacher['MINENR_S'] - 1
school_conducts_drug_sweeps
We want to use schools that aren’t currently conducting drug sweeps as the base case, so we use a conditional to assign 1
when S0200
is 1
and 0
otherwise.
final_data['school_conducts_drug_sweeps'] = where_binary(
public_teacher['S0200'],
public_teacher['S0200'] == 1
)
school_has_daily_police_security
Similar to school_conducts_drug_sweeps
want to use schools that don’t have daily police/security presence as the base case, so we use a conditional to assign 1
when S0201
is 1
and 0
otherwise.
final_data['school_has_daily_police_security'] = where_binary(
public_teacher['S0201'],
public_teacher['S0201'] == 1
)
school_region_Northeast
, school_region_Midwest
, and school_region_West
Similar to how we handled race/ethnicity, we separate REGION
into distinct variables, where South is the base case. Unlike before though, REGION
uses a string to represent its numbers, so we use, for example, '1'
instead of 1
.
final_data['school_region_Northeast'] = where_binary(
public_teacher['REGION'],
public_teacher['REGION'] == '1'
)
final_data['school_region_Midwest'] = where_binary(
public_teacher['REGION'],
public_teacher['REGION'] == '2'
)
final_data['school_region_West'] = where_binary(
public_teacher['REGION'],
public_teacher['REGION'] == '4'
)
Simple Logistic Regression
Equation
We start with a simple logistic regression model like this:
Performing the Regression
To perform this regression, we use the logit()
function from statsmodels.formula.api
. Due to the number of variables, we use Python’s f-strings to help prevent misspellings or forgetting of variables. We assign to y_var
our dependent variable teacher_left
. x_vars
is assigned all columns in final_data
except for y_var
, CNTLNUM
, and SCHCNTL
. The latter two are control numbers for teachers and schools, respectively. Finally, we assign to the missing
parameter 'drop'
to drop all the observations with missing values.
import statsmodels.formula.api as sm
y_var = 'teacher_left'
x_vars = final_data.columns.drop([y_var, 'CNTLNUM', 'SCHCNTL'])
logit_model = sm.logit(
f'{y_var} ~ {" + ".join(x_vars)}',
data = final_data,
missing = 'drop'
).fit()
Optimization terminated successfully.
Current function value: 0.558194
Iterations 6
ROC Curve
Here, we use sklearn
and matplotlib
to generate and plot the ROC curve.
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
false_positive_rate, true_positive_rate, thresholds = roc_curve(
final_data.dropna()[y_var],
logit_model.predict()
)
roc_auc = auc(false_positive_rate, true_positive_rate)
plt.figure()
plt.plot(
false_positive_rate,
true_positive_rate,
color = 'green',
lw = 2,
label = 'ROC curve (area = %0.2f)' % roc_auc
)
plt.plot([0, 1], [0, 1], color = 'navy', lw = 2, linestyle = '--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc ='lower right')
plt.show()
Results and Inferences
Results
We now use the summary()
method on logit_model
to display the output from fitting the model. We also use an alpha of 0.05 to evaluate significance.
logit_model.summary()
Logit Regression Results
==============================================================================
Dep. Variable: teacher_left No. Observations: 3828
Model: Logit Df Residuals: 3807
Method: MLE Df Model: 20
Date: Date Pseudo R-squ.: 0.1722
Time: Time Log-Likelihood: -2136.8
converged: True LL-Null: -2581.4
Covariance Type: nonrobust LLR p-value: 1.525e-175
====================================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------------------
Intercept -1.3920 0.232 -5.997 0.000 -1.847 -0.937
teacher_age_50_or_above -0.1389 0.202 -0.688 0.492 -0.535 0.257
teacher_experience_years 0.0404 0.007 6.164 0.000 0.028 0.053
teacher_age_50_or_above:teacher_experience_years 0.0447 0.009 4.737 0.000 0.026 0.063
teacher_management_perception -0.0104 0.010 -1.015 0.310 -0.031 0.010
teacher_female 0.0389 0.086 0.455 0.649 -0.129 0.207
teacher_race_ethnicity_AIAN_non_Hispanic 0.3976 0.266 1.492 0.136 -0.125 0.920
teacher_race_ethnicity_API_non_Hispanic 0.2846 0.271 1.050 0.294 -0.247 0.816
teacher_race_ethnicity_Black_non_Hispanic 0.0132 0.165 0.080 0.936 -0.309 0.336
teacher_race_ethnicity_Hispanic -0.4089 0.187 -2.181 0.029 -0.776 -0.042
teacher_paid_for_summer_school -0.4059 0.113 -3.608 0.000 -0.626 -0.185
teacher_salary_below_25000 0.2948 0.105 2.808 0.005 0.089 0.501
teacher_salary_above_45000 -0.2320 0.116 -2.007 0.045 -0.459 -0.005
teacher_union_member -0.3861 0.089 -4.354 0.000 -0.560 -0.212
school_has_secondary_grades 1.1125 0.093 12.022 0.000 0.931 1.294
school_minority_enrollment_above_20_percent 0.0528 0.087 0.610 0.542 -0.117 0.222
school_conducts_drug_sweeps 0.0486 0.091 0.536 0.592 -0.129 0.226
school_has_daily_police_security -0.2612 0.089 -2.949 0.003 -0.435 -0.088
school_region_Northeast -0.1278 0.136 -0.939 0.348 -0.395 0.139
school_region_Midwest -0.1006 0.108 -0.933 0.351 -0.312 0.111
school_region_West 0.3898 0.104 3.757 0.000 0.186 0.593
====================================================================================================================
teacher_age_50_or_above
We might be tempted to say something regarding this variable; however, given its p-value (0.492) and consequent inclusion of zero in the 95% confidence interval, it is not statistically significant. Thus, we cannot say if a teacher being age fifty or above has any impact on a teacher leaving teaching.
teacher_experience_years
Looking at the effect of the number of years of teaching experience a given teacher has, we can say that, holding all else constant, the more years of experience a teacher has, the more likely they are to leave teaching. We base this off of the p-value (<0.001).
This interpretation aligns with our supposition, though it’s important to note that we can’t explicitly say why this is. Is it because a teacher’s patience has worn thin or because they realized teaching isn’t what they wanted to do?
teacher_age_50_or_above*teacher_experience_years
The interaction term for a teacher being age fifty or above and teaching experience in years shows a similar result. The p-value (<0.001) indicates that it is statistically significant. Thus, holding all else constant, the more years of experience a teacher age fifty or above has, they more likely they are to leave teaching.
Additionally, as this is an interaction term, the impact of years of experience on a teacher leaving is comparatively higher for those age fifty or above.
We repeat our note: we cannot say why this is. Here, it could be that these teachers are that much closer to becoming eligible to retire.
teacher_management_perception
Given the p-value (0.310), we cannot say that a teacher’s perception of management at their school influences a teacher leaving teaching.
teacher_female
The same rings true here based on the p-value (0.649). We cannot say, one way or another, that a teacher being female has any impact on that teacher leaving teaching.
teacher_race_ethnicity_AIAN_non_Hispanic
, teacher_race_ethnicity_API_non_Hispanic
, teacher_race_ethnicity_Black_non_Hispanic
, and teacher_race_ethnicity_Hispanic
As a reminder, the base case for this group of variables is a White, non-Hispanic teacher. Compared to that base case, only a teacher that is Hispanic (regardless of race) is statistically significant based on p-value (0.029). Thus, holding all else constant, a teacher that is Hispanic (regardless of race) is less likely to leave teaching than that of a White, non-Hispanic teacher.
teacher_paid_for_summer_school
Based on the p-value (<0.001), this variable is statistically significant, so we can say that, all else being equal, a teacher paid for teaching summer school is less likely to leave teaching.
We found this to be in opposition to what we initially thought. While we cannot explicitly say why this is, we can provide a few suppositions using this as a proxy for determining if a teacher had taught summer school:
- Teachers that teach summer school are more experienced and thus less likely to leave in general.
- Teachers that teach summer school are more “dedicated” to teaching and thus more likely to continue teaching despite adverse climate, job conditions, etc.
teacher_salary_below_25000
and teacher_salary_above_45000
Conversely, we found the results here aligned with expectations. The p-value (0.005) for a teacher with a salary below $25,001, and the p-value (0.045) for a teacher with a salary above $45,001 indicate that both variables are statistically significant. As such, we can say that a teacher with a salary below $25,001 is more likely to leave teaching, while a teacher with a salary above $45,001 is less likely to leave teaching, holding all else constant. We should note that these are in comparison to teachers with a salary between those two figures, not with each other.
teacher_union_member
Based on the p-value (<0.001), this variable is statistically significant. Holding all else constant, we can say that a teacher being a member of a union is less likely to leave teaching.
As before, we cannot offer an explicit explanation, but we can create a supposition using simplified economic concepts. In the absence of a union, the relationship between a school district and teachers can be described as a monopsony—a market with one buyer (the district) and many sellers (individual teachers). Here, the district has substantial buying power, allowing it to almost unilaterally set wages, terms, etc.
To combat this, the sellers may effectively merge into a monopoly (union), consolidating selling power, often referred to as bargaining power for unions. This can enable negotiable wages, terms, etc. This new relationship between one buyer (the district) and one seller (the newfound union) would be roughly described as a bilateral monopoly. Analysis of this market structure often requires use of game theory or other methods for studying bargaining dynamics.
school_has_secondary_grades
As we expected, teachers who teach at schools with grades seven through twelve are, holding all else constant, more likely to leave teaching. This variable is also statistically significant based on the p-value (<0.001).
We reasoned that this may be a result of potentially increased stress resulting from teaching older students, though we can’t say for sure.
school_minority_enrollment_above_20_percent
As this variable has a p-value of 0.542, it is not statistically significant. Thus, we cannot say a school with minority enrollment above 20% influences a teacher leaving teaching.
school_conducts_drug_sweeps
Likewise, with a p-value of 0.592, this variable is also not statistically significant, meaning we cannot say a school conducting drug sweeps has an impact on a teacher leaving teaching.
school_has_daily_police_security
With a p-value of 0.003, this variable is statistically significant, meaning can say, holding all else equal, that the daily presence of police/security in a school reduces the likelihood of a teacher leaving teaching.
This was opposite what we initially thought. We supposed that daily presence of police/security would have indicated an unsafe school or otherwise made teachers feel unsafe. Rather, we possibly see the converse.
school_region_Northeast
, school_region_Midwest
, and school_region_West
As a reminder, the base case for this group of variables is school with region South. Compared to that base case, only a school with region West is statistically significant based on p-value (<0.001). Thus, holding all else constant, a teacher in a West region school is less likely to leave teaching than that of a teacher in a South region school.