Machine Learning Steps
These are common well known steps for machine learning projects:
Define a problem
eg. detect iris flower from sepal and petal length and witdth
Practice Fusion Diabetes ClassificationPrepare Data
Installing the Python and SciPy platform.
Loading the dataset.import pandas dataset = pandas.read_csv(url,names=names) // from csv file data load dataset = pandas.read_sql_query(query, conn) // from database table dataload
Summarizing the dataset.
Dimensions of the dataset.
print(dataset.shape) // print dimension of the dataset
4.Peek at the data itself.
print(dataset.head(20)) //print first 20 rows of the dataset
Statistical summary of all attributes.
print(dataset.describe()) //this includes mean,max,min,count values as well as some percentiles
Breakdown of the data by the class variable.
print(dataset.groupby(‘class').size()) // this will group by each class and count
5.Evaluate algorithms
Visualizing the dataset.
dataset.plot(kind=’box’,subplots=True,layout=(2,2),sharex=False,sharey=False) //box and whisker plots
dataset.hist() // plot histograms
plt.show()
6.Evaluating some algorithms.
6.1Creating validation dataset
array = dataset.values
X = array[:,0,4]
Y = array[:,4]
validation_size = 0.20 // split out the validation dataset 80% to train model and 20 % validation dataset
seed = 7
X_train,X_validation,Y_train,Y_validation =model_selection.train_test_split(X,Y,test_size=validation_size,random_state=seed)
6.2 Test harness
seed = 7;
scoring = ‘accuracy’ //test options and evaluation matric to evaluate models
6.3 Build Models
Evaluate 6 different algorithms:
Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).
Support Vector Machines (SVM).
models = []
models.append((‘LR’,LogisticRegression()))
models.append((‘LDA’,LinearDiscriminatAnalysis()))
models.append((‘KNN’,KneighborsClassifier()))
models.append((‘CART’,DecisionTreeClassifier()))
models.append((‘NB’,’GaussianNB()))
models.append((‘SVM’,SVC()))
result = []
names = []
For name,model in models:
kfold = models_selection.Kfold(n_splits =10,random_state =seed)
cv_results = model_selection.cross_val_score(model,X_train,Y_train,cv=kfold,scoring=scoring)
result.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name,cv_results.mean(),cv_results.std())
print msg
7.Improve results
8.Present results
Making some predictions