Preprocessing data for training with tensorflow

in utopian-io •  7 years ago  (edited)

Introduction

What Will I Learn?

In this tutorial we will download the Special Database 19, then sort
the images and rename them. After that, we use OpenCV to read the images as a numpy array and scale them down to 32x32
and convert them to greyscale. We then saved these arrays to disk.

Requirements

  • Python 3
  • OpenCV for Python (I downloaded it from here)

Difficulty

  • Intermediate

References

All the files and scripts presented are available on my GitHub Page here.
This tutorial is part of a series. I will explain in detail how I achieved the handwriting recognition engine with
tensorflow. This is the first part, so look out for more!

##So, let's get started!

1. Downloading the data

In this example, I will use the Special Database 19 published by
the National Institute for Standards And Technology. It contains over 800,000 pre-classified
images of handwritten letters and digits. It differentiates between 47 classes: All uppercase letters, all numbers and a
few lower case letters. I downloaded the by_merge.zip file and saved in in my projects folder.

2. Preparing the data for conversion

The database contains over 800,000 images. That's a bit much for my purpose, because the more images we have, the longer
the training process will take later. About 100,000 images should be enough. To make working with the files easier, I
wrote a python script to
move 1/8th the images into the same folder and rename them class_[class]_index_[index].png, for example
class_25_Index_3743.png.

def get_class(str):
    return str.split("\\")[1]

Simply get the class from a file path. E.g.:

>>> get_class(r"./by_merge\4e\hsf_3\hsf_3_00002.png")
'4e'

The get_class() function as to be modified if you change by_merge_dir, because the path might look different.\

by_merge_dir = "./by_merge"
output_dir = "./data/"
index = 0
class_index = -1
counter = 0
n_copied = 0
classes = []

by_merge_dir and output dir are self-explanatory.
index is a variable to keep track of the number of images in a class. This is used to guarantee unique file names.
class_index is the class that is currently being processed. Every time we start traversing a new class folder, this
variable will be increased.
counter is used to keep track of the number of files traversed.
n_copied is used to keep track of the number of copied files.
classes is a list of all the folder names. If a new one is encountered class_index is increased.

for subdir, dirs, files in os.walk(by_merge_dir):
    for file in files:

This loops through all files including files in subdirectories.

if get_class(subdir) not in classes:
    classes.append(get_class(subdir))
    class_index += 1
    index = 0

If we have not seen this class yet, add it to the list of classes and increase the class index by 1. If you want you
can also reset the index to 0, so that the first image of every class has the index 0.

if counter % 8 == 0 and file.endswith(".png"):

Everything after this if counter is divisible by 8, essentially only copying every 8th file and we are dealing with a
png file.

copyfile(os.path.join(subdir, file),
                     os.path.join(output_dir, "class_{}_index_{}.png".format(class_index, index)))

Copyfile syntax: copyfile(src, dst)
The source path is just constructed by joining the subdirectory and the file name.
The destination path is constructed by joining the output directory and the string "class_[class_index]_index_[index].png"
This may not be the fastest possible way to copy a file, but it works for our needs.

print("Copied " + os.path.join(subdir, file) + " to "
                  + os.path.join(output_dir, "class_{}_index_{}.png".format(class_index, index))
            index += 1

Log that we copied a file and increase the index.

counter += 1

Lastly, increase the counter.

print("Total images: " + str(n_copied))

When we're done with the script, print out how many images we're copied.

And that's it. Here's the
full script. This script could take some time to complete, so be patient.

Converting the data to numpy arrays

We will now convert the images to numpy arrays. First we will read the single image as an array with the shape [32,32,1]
We will then put this array into an array with the shape [101784, 32, 32, 1]. For this I used
data_handler.py.

def get_2d_array(im_path, shape):
    im_color = cv2.imread(im_path)

First I use cv2 to get the image as a numpy array. The images have 64x64 pixels, so cv2.imread will return a numpy
array with the shape [64,64,3]: [x_pixels, y_pixels, color_channels].

im_color = cv2.resize(im_color, (32,32))

I then resize the image to 32x32, because for handwriting 32x32 is enough.

im = np.zeros(shape=(32,32,1))

Since cv2.imread returns a color image, but we want greyscale, so we'll have to take the average of all color channels
and turn it into an array with the shape [32,32,1].

for i, x in enumerate(im_color):  # Fill the array
    for n, y in enumerate(x):  # Note: We cannot use cv2.cvtColor(im_color, cv2.COLOR_BGR2GRAY), because
        im[i][n][0] = (y[0] + y[1] + y[2]) / 3

Fill the greyscale array with the averages of the 3 color channels.
for i, x in enumerate(im_color): i is an index of the loop, x is the object at the position of the index in the
parameter array.
im[i][n][0] = (y[0] + y[1] + y[2]) / 3. Put the average öf tbe 3 color channels into the final array.

return im 

That's all for the get_2d_array function.

def get_label(name):
    return int(name.split("_")[1])

Just a helper function to get the class from a filename. E.g.:

>>> get_label("class_10_index_3454.jpeg")
10
n_labels = 47
n_images = 101784
path = "./data"

This should be self-explanatory.

images = np.zeros(shape=(n_images, 32, 32, 1))
labels = np.zeros(shape=(n_images, n_labels))

These are the array that will contain our database. images will contain the image data and labels the classes.
labels will be a one-hot-encoded array, thus having as many rows as we have labels. The label for class 3 will look
like this: [0, 0, 1, 0, .., 0].

for i, file in enumerate(os.listdir(path)):
    label = get_label(file)

This should be self-explanatory. enumerate was explained further up.

image = get_2d_array(os.path.join(path, file))

Get the single image. Shape: [32,32,1]

images[i] = image
labels[i, label] = 1

Add the single image to the array of all images. Also one-hot-encode the label into the labels array by setting the
corresponding index to 1.

print(str(i / n_images * 100) + "% done")

Log how far we are.

np.save("nist_labels_32x32", labels)
np.save("nist_images_32x32", images)

When we are done, save the arrays.

And that's it. Here's the
full script. And, again, this could take a lot of time to complete, so be patient.

Recap

In this tutorial we downloaded the Special Database 19, then sorted
the images and renamed them. After that, we used OpenCV to read the images as a numpy array and scaled them down to 32x32
and converted them to greyscale. We then saved these arrays to disk.

Curriculum

Thank you for reading my tutorial! I hope you liked it. If you have any recommendations for future tutorials please
leave a comment. I'll upvote any constructive criticism



Posted on Utopian.io - Rewarding Open Source Contributors

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Hey @leatherwolf I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]