Machine learning steps - Python programming

Machine Learning Steps

These are common well known steps for machine learning projects:

  1. Define a problem
    eg. detect iris flower from sepal and petal length and witdth
    Practice Fusion Diabetes Classification

  2. Prepare Data
    Installing the Python and SciPy platform.
    Loading the dataset.

        import pandas   
        dataset = pandas.read_csv(url,names=names)   // from csv file data load       
        dataset =  pandas.read_sql_query(query, conn) // from database table dataload  
  3. Summarizing the dataset.

 Dimensions of the dataset.
 print(dataset.shape)   // print dimension of the  dataset

4.Peek at the data itself.

print(dataset.head(20))  //print first 20 rows of the dataset

Statistical summary of all attributes.
print(dataset.describe())  //this includes mean,max,min,count values as well as some percentiles

Breakdown of the data by the class variable.
print(dataset.groupby(‘class').size()) // this will group by each class and count

5.Evaluate algorithms
Visualizing the dataset.

    dataset.plot(kind=’box’,subplots=True,layout=(2,2),sharex=False,sharey=False) //box and whisker plots
    dataset.hist() // plot histograms

6.Evaluating some algorithms.
6.1Creating validation dataset

    array = dataset.values
    X = array[:,0,4]
    Y = array[:,4]
    validation_size = 0.20  // split out the validation dataset 80% to train model and 20 % validation dataset 
    seed = 7
    X_train,X_validation,Y_train,Y_validation   =model_selection.train_test_split(X,Y,test_size=validation_size,random_state=seed)

6.2 Test harness

    seed = 7;
    scoring = ‘accuracy’  //test options and evaluation matric to evaluate models

6.3 Build Models
Evaluate 6 different algorithms:
Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).
Support Vector Machines (SVM).

models = []

result = []
names = []

For name,model in models:
    kfold = models_selection.Kfold(n_splits =10,random_state =seed)
    cv_results = model_selection.cross_val_score(model,X_train,Y_train,cv=kfold,scoring=scoring)
    msg = “%s: %f (%f)” % (name,cv_results.mean(),cv_results.std())
    print msg

7.Improve results

8.Present results
Making some predictions

