Machine learning steps - Python programming

Machine Learning Steps

These are common well known steps for machine learning projects:

Define a problem
eg. detect iris flower from sepal and petal length and witdth
Practice Fusion Diabetes Classification

Prepare Data
Installing the Python and SciPy platform.
Loading the dataset.

    import pandas   
    dataset = pandas.read_csv(url,names=names)   // from csv file data load       
    dataset =  pandas.read_sql_query(query, conn) // from database table dataload

Summarizing the dataset.

 Dimensions of the dataset.
 print(dataset.shape)   // print dimension of the  dataset

4.Peek at the data itself.

print(dataset.head(20))  //print first 20 rows of the dataset

Statistical summary of all attributes.
print(dataset.describe())  //this includes mean,max,min,count values as well as some percentiles

Breakdown of the data by the class variable.
print(dataset.groupby(‘class').size()) // this will group by each class and count

5.Evaluate algorithms
Visualizing the dataset.

    dataset.plot(kind=’box’,subplots=True,layout=(2,2),sharex=False,sharey=False) //box and whisker plots
    dataset.hist() // plot histograms
    plt.show()

6.Evaluating some algorithms.
6.1Creating validation dataset

    array = dataset.values
    X = array[:,0,4]
    Y = array[:,4]
    validation_size = 0.20  // split out the validation dataset 80% to train model and 20 % validation dataset 
    seed = 7
    X_train,X_validation,Y_train,Y_validation   =model_selection.train_test_split(X,Y,test_size=validation_size,random_state=seed)

6.2 Test harness

    seed = 7;
    scoring = ‘accuracy’  //test options and evaluation matric to evaluate models

6.3 Build Models
Evaluate 6 different algorithms:
Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).
Support Vector Machines (SVM).

models = []
 models.append((‘LR’,LogisticRegression()))
models.append((‘LDA’,LinearDiscriminatAnalysis()))
models.append((‘KNN’,KneighborsClassifier()))
models.append((‘CART’,DecisionTreeClassifier()))
models.append((‘NB’,’GaussianNB()))
models.append((‘SVM’,SVC()))

result = []
names = []

For name,model in models:
    kfold = models_selection.Kfold(n_splits =10,random_state =seed)
    cv_results = model_selection.cross_val_score(model,X_train,Y_train,cv=kfold,scoring=scoring)
    result.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name,cv_results.mean(),cv_results.std())
    print msg

7.Improve results

8.Present results
Making some predictions