PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands

in utopian-io •  6 years ago  (edited)

Related Repositories

robot.jpg

What Will I Learn?

In almost any data science project you need to find, clean and prepare data. We are doing some digging on the web in order to prepare two perfectly formed CSV files (eng.csv and fra.csv) according to our requirements. These files must contain random phrases in English and French for further processing by PHP scripts.

Requirements

  • Basic concepts of machine learning
  • A few Linux commands
  • Some PHP
  • Be a little patient

Difficulty

  • Intermediate

Tutorial Contents

Who said that you cannot do machine learning with PHP?

I am learning about it, and today I'm sharing with you a useful tip to curate a dataset consisting of random phrases written in any imaginable language.

This may interest you if you want to train a machine learning model for pattern recognition purposes, text classification, language detection, and so on. The list could go on.

The reason behind today's tutorial is to help you get familiar with a few basic concepts at the same time that I myself learn about the topic. I want to share with the world my learn by doing process!

And I am excited because I've learned already that finding and curating data by hand is an important thing to keep in mind.

Remember: First things first, in almost any data science project you need to find, clean and prepare data. The more tools you can master for this purpose, the better.

A Long Time Ago...

Let me start by giving some context. My story began a few months ago, when after working as a web developer for a while I just thought, "Why don't I write a chess engine in PHP?"

My rhetorical question might sound a bit naive in terms of the mainstream data science trend because the vast majority of data scientists are using Python on their projects. However, PHP web devs may well want to do some machine learning with PHP-ML, which is in the process of being developed by the way -- currently on version 0.6.2.

Then, I did some research to find out that my chess engine could rely on a multilayer perceptron (MLP) classifier in a similar way as it is described in the paper entitled Learning to Evaluate Chess Positions with Deep Neural Networks and Limited Lookahead.

Here is a conclusion:

The results show how relatively simple Multilayer Perceptrons (MLPs) outperform Convolutional Neural Networks (CNNs) in all the experiments that we have performed.

My ultimate goal would consist in normalizing a bunch of PGN Chess board positions to train the MLP model -- if I am correct, this can be achieved with just transforming the output of the status() method into a format that the MLP classifier can understand.

Mmm

I scratched my head a little harder to only conclude that MLPs are still a bit too much for my current machine learning skill set, and mastering them will take some time as well.

Here is what has to happen: I first need to digest all this tacit knowledge by taking baby steps. I am being patient.

A cool thing about machine learning algorithms is that they can be approached as if they were black boxes, meaning that you don't actually need a mathematics background to use them. Just be curious and try experiments by yourself.

The Naive Bayes algorithm -- this one can be used for language identification purposes -- is definitely easier to start off than MLP.

So let's forget about chess for now. Take it easy. Listen to some classical music for brain power!

Preparing and Cleaning Data with Linux Commands

Suppose we're working on a new exciting data science project on language detection and we're using a Naive Bayes classifier. Going back to the issue of preparing data for machine learning and AI, the things to do now are:

  • Collect quality data
  • Adapt the collected data to our requirements

Regarding the collection of data, there's Tatoeba:

Tatoeba is a collection of sentences and translations. It's collaborative, open, free and even addictive.

Just download this huge TSV file (355.8 Mb) with thousands and thousands, millions of phrases written in any imaginable language in the world.

Here is how Tatoeba's sentences.csv file looks like:

1   cmn 我們試試看!
2   cmn 我该去睡觉了。
3   cmn 你在干什麼啊?
...
5630    rus Тем не менее, обратное также верно.
5631    rus Мы видим вещи не такими, какие они есть, а такими, каковы мы сами.
5632    rus Мир - это клетка для безумных.
...
5994    eng Maria has long hair.
5995    fra Maria a les cheveux longs.
5996    jpn あしたは、来なくていいよ。
...
7088863 hun Tom épp most mondta nekünk, hogy kirúgták.

random-chars.jpg

The problem is that we'd want a tidy, concise, perfectly formed CSV file like the following one containing random sentences in English only.

eng,What do you want for Christmas?
eng,There are pictures on alternate pages of the book.
eng,This language is perfectly clear to me when written but absolutely incomprehensible when spoken.
...
eng,In my opinion a well-designed website shouldn't require horizontal scrolling.

No worries, let's create a bash shell script with some Linux shell commands:

CommandDescription
shufWrites a random permutation of the input lines to standard output.
awkA pattern scanning and processing language.
trTranslates, squeezes, and/or deletes characters from standard input, writing to standard output.
cutPrints selected parts of lines from each file to standard output.
rmRemoves (unlink) file(s).

Here is the bash shell script:

#!/bin/bash
shuf -n 5000 sentences.csv > lang_sample.tsv
awk '$2=="eng"' lang_sample.tsv > eng_sample.tsv
cat eng_sample.tsv | tr -d \, | tr "\\t" "," > eng_sample.csv
cut -d, -f1 --complement < eng_sample.csv > eng.csv
rm eng_sample.csv eng_sample.tsv lang_sample.tsv

With the help of Linux pipes the commands above can be merged into one:

shuf -n 5000 sentences.csv | awk '$2=="eng" {print}' | tr -d \, | tr "\\t" "," | cut -d, -f1 --complement > eng.csv

Cool! We can easily fetch any other bunch of random phrases, in French for example:

shuf -n 5000 sentences.csv | awk '$2=="fra" {print}' | tr -d \, | tr "\\t" "," | cut -d, -f1 --complement > fra.csv

The brand new clean CSVs are a good thing. We are keeping things simple. Now it is a piece of cake for our PHP scripts to process the information:

...
$file = fopen($this->filepath, 'r');
while (($line = fgetcsv($file)) !== false) {
    $this->labels[] = $line[0];
    $this->samples[] = $line[1];
}
fclose($file);
...

That's all for now. I hope you liked today’s post. Thank you for reading and sharing your human thoughts.

Conclusion

PHP machine learning can be done with PHP-ML -- currently on version 0.6.2 -- which is in the process of being developed. Web developers can use multiple different algorithms in their PHP projects: Apriori, SVC, KNearestNeighbors, NaiveBayes, LeastSquares, MLPClassifier, among others.

A cool thing about machine learning algorithms is that they can be approached as if they were black boxes, meaning that you don't actually need a mathematics background to use them.

Just be curious and a little patient in the beginning.

The first thing to do in almost any data science project is to find, clean and prepare the data. And that is what we did today. We prepared a couple of concise, perfectly formed CSV files (eng.csv and fra.csv) for our purposes, containing random phrases in English and French for further processing by PHP scripts.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Thank you for your contribution.
After analyzing your tutorial we suggest the following:

  • The tutorial in technical terms is quite short, we recommend that the next tutorial be more technical.
  • It's important to explain in detail the code that is in the tutorial.
  • We suggest you always put comments in your code.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you for your review, @portugalcoin!

So far this week you've reviewed 17 contributions. Keep up the good work!

excellent post I love from this moment I follow you, in this way causes to see content in steemit greetings and my respects and my support with my vote

Thanks for your comment @malpica1, it is encouraging! Happy that you liked this post :)

Congratulations @programarivm! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes received

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

You can upvote this notification to help all Steemit users. Learn why here!

Hi @programarivm! We are @steem-ua, a new Steem dApp, using UserAuthority for algorithmic post curation! Your post is eligible for our upvote, thanks to our collaboration with @utopian-io! Thanks for your contribution, keep up the good work, and feel free to join our @steem-ua Discord server

Hey, @programarivm!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!