Name: Matt Berseth
Location: Jacksonville FL
Competition: Practice Fusion Diabetes Classification
To extract features I created a script that allowed me to create features by specifying individual diagnosis codes, groups of diagnosis codes or a fuzzy match on the diagnosis description. I have provided more details on this in section 2.
I used boosted trees as my base learners. I blended my best learners linearly and used the coefficients to weight the contribution of each individual learner.
The primary data source that I used was wikipedia's page on type 2 diabetes [1]. I went through this page and mapped each of the complicating conditions with their ICD9 codes / groups. In addition to wikipedia, I also leveraged HCUP [2] and cross-walked some of the ICD9 codes to their ICD10 groupings [3].
In total, I extracted nearly 400 features from the training data. These features can be categorized into the following buckets:
Patient features: {Age, Gender, BMI, blood pressure, etc ...}
Diagnosis features: {Most frequent ICD9 code, Has occurrence of specific code like hyperlipidemia or hypertension or any of the other ICD9 codes that could be considered complicating conditions.}
_01, ICD9_02 ... ICD9_07_01 ... MedICD9_04_280_289_240_246_249_259_01 .. NDC_04_DrugProfile_ClusterNumber_DrugProfile_Cluster1Prob ... DMX_DrugProfile_Cluster5Prob_DxLevelClusters_ClusterNumber_DxLevelClusters_Cluster1Prob ... DMX_DxLevelClusters_Cluster5ProbFrom these feature groups, I selected feature by doing the following:
My base learners were boosted trees. I used cross validation on the training data to select model parameters to use. My best submissions used some where between 275 to 375 trees, a learning rate of 0.1, had a maximum depth of 3 and required 20-30 cases in each of the leaf nodes.
I had different learners based on different sets of the above features. These base learners were blended linearly using the weights of the coefficients to weight each of the learners contribution to the overall prediction.
I also used clustering models to segment the patients to develop a drug profile, patient profile (age, weight, bmi, etc ...) and a diagnosis profile. These models were also used to generate outlier features (i.e. given the cluster model, how likely is the case data).
The external data sources I used was primarily wikipedia [1]. I also included features from HCUP [2] and ICD10 cross-walk [3].
I used a combination of .Net/C#, SQL Sever and Python to create my solution. I stored the data in a sql server database and used a utility console application to generate the features from the training tables. After the features were generated, I exported the training and test case tables to flat files and used this as input into my python scripts that contained the actual model.
The source tree I used has the following structure
/source
/attic
/extternal_data
/last_weekend_notes
/net
/notes
/python
/R
/sql
The attic is for any prototype code. external_data was to keep track of my data sources. last_weekend_notes was from my last push at finalizing my model. I had started to look at the gbm R package and compare that to the python packages I was using. R contains some of this work as well. notes has general notes that I was taking along the way, mostly regarding the feature selection.
The sql net and python folders are the most important directories. sql contains a number of scripts that I used to clean the data. It also contains the analysis services project I used to create the cluster models.
The net folder contains a visual studio project and corresponding C# code for generating the features. The solution, ml_practice_fusion, contains 3 projects:
ml_practice_fusion
ml_practice_fusion.etl
ml_practice_fusion.etl.features
ml_practice_fusion contains common code like connection strings and data access logic. ml_practice_fusion.etl is the driver application and entry point. It also contains some basic file parsing and generation logic.
Finally, the majority of code in the last project, ml_practice_fusion.etl.features. This project has a class file for each feature. The class is responsible for adding the column to the test and training case tables and populating it with the proper value.
For example, some of my most important features were groups of ICD9 codes - either matching a pattern on the actual code, or the diagnosis description. For these types of features, I created a base class that would handle the generic processing and then have the derived class specify the actual expression that needs to be matched on.
Concretely, here is the class definition for my HasHypertension feature:
public class DxFeature_HasHypertension : DxFeature
{
public DxFeature_HasHypertension()
: base()
{
this.ColumnName = "HasHypertension";
this.LikeExpressions.Add("40[1-5]%");
}
}
You can see, this class is just metadata, defining the column name and the like expression to use. A majority of the features are generated following a similar pattern.
All of the models are created using the python scikit-learn [4] package. Like the .net project, I have a couple of files for common things. The io.py file contains logic to read/write csv files. score.py implements the log loss function and features.py contains different listing of features.
Individual models, the base learners I used as input into my blending/stacking routine, were trained using model.py. model.py contains an eval function that is passed a set of features and a tree classifier from the scikits ensemble module. I used this common function to train and cross validate GradientBoostingClassifier, ExtraTreesClassifier and RandomForestClassifier. The run.py file was used to setup the model and features and then call the model.eval function. Here is an example:
import numpy as np
import model
import features
from sklearn.ensemble import GradientBoostingClassifier as gb
f = features.submission21
loss = model.eval(f, gb(n_estimators=300, learn_rate=.1, max_depth=3, random_state=1))
print 'Avg Loss:', np.mean(loss), np.std(loss)
In this example, I am using the features I used for my 21st submission, passing this and the GradientBoostingClassifier to my model.eval function. model.eval returns the log loss for each fold of the cross validation.
Blended models are trained and cross validated using the stack.py. Like model.eval, stack has a similar interface, except it takes a list of features and models. Here is an example:
items = [
{
'clf': gb(n_estimators=400, learn_rate=.09, max_depth=3, max_features=10, min_samples_leaf=30, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=400, max_depth=3, max_features=10, min_samples_leaf=30, learn_rate=.09, random_state=1),
'features': features.next_submission_v2
},
{
'clf': gb(n_estimators=400, max_depth=3, max_features=10, min_samples_leaf=30, learn_rate=.09, random_state=1),
'features': features.beat_blind_ape
}
]
eval(items)
grid_search.py is used to attempt tune the model parameters. The top of this file includes custom classifiers that allow plugging in the log loss for the evaluation/scoring. I used ga.py to search the feature space to attempt to identify groups of features that worked well together.
submit.py contains the logic for submitting a blended model.
The training and test data are in the root of the solution directory. They are dbo.training_Features.csv and dbo.test_Features.csv. To recreate the solution, do the following:
I used the following 5 submissions as my final submissions
submission 41
items = [
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v2
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v1
},
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v1
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=400, learn_rate=.09, max_depth=3, max_features=10, min_samples_leaf=30, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=400, max_depth=3, max_features=10, min_samples_leaf=30, learn_rate=.09, random_state=1),
'features': features.next_submission_v2
},
{
'clf': gb(n_estimators=400, max_depth=3, max_features=10, min_samples_leaf=30, learn_rate=.09, random_state=1),
'features': features.beat_blind_ape
},
{
'clf': gb(n_estimators=400, max_depth=3, max_features=10, min_samples_leaf=30, learn_rate=.09, random_state=1),
'features': features.next_submission_v1
},
]
submission 38
items = [
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v2
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v1
},
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v1
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.beat_blind_ape
},
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.beat_blind_ape
}
]
submission 36 / 37
items = [
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v2
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v1
},
{
'clf': gb(n_estimators=300, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.next_submission_v1
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
}
]
submission 34
items = [
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=20, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=275, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=325, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
},
{
'clf': gb(n_estimators=375, max_depth=3, min_samples_leaf=30, min_samples_split=10, learn_rate=.1, random_state=1),
'features': features.submission21
}
]
To generate the above five submissions do the following:
submit.pyitems list from the submission above.dbo.training_Features_no_outliers.csv, so you will need to update the training file path before you generate the scored fileTrust your cross validation scores and use the public leaderboard as a measurement of competitiveness. When I selected my final models for submission, I picked the models with the lowest cross-validation scores, not the ones with the lowest public leaderboard scores. This worked well for me in this competition, using this formula, I ended up picking my five best models.
I did not generate any interesting figures or plots as part of my analysis.
http://en.wikipedia.org/wiki/Diabetes_mellitus_type_2http://www.ahrq.gov/data/hcup/http://www.ama-assn.org/ama1/pub/upload/mm/399/crosswalking-between-icd-9-and-icd-10.pdfhttp://scikit-learn.org/stable/