Babylon, a New Machine Learning Repo for Language Detection

in utopian-io •  6 years ago  (edited)

Repository

https://github.com/programarivm/babylon

languages.jpg

New Project

Babylon is a language detector for PHP being implemented with machine learning techniques.

The repo is a bit rare. Is valuable in a sense that uses PHP-ML, a machine learning library for PHP which is currently in the process of being developed -- version 0.6.2. And this is not very common given the fact that these days Python is the de facto ecosystem for data science.

However, PHP web developers may well want to use machine learning on their projects too. It is high time to start creating PHP projects with machine learning features in order to generate a unique feedback to the open source community.

Technology Stack

  • PHP

New Features

I created the Babylon project a few hours ago. Feature/data preparation #3 is a recent feature that was merged just yesterday. This one is about cleaning and preparing the data.

Remember: First things first, in almost any data science project you need to find, clean and prepare data.

The cli/prepare.php command reads the files stored in the dataset/input folder and generates the dataset/output.csv file.

Idea: Copy and paste in dataset/input a few public domain ebooks to get a rough idea on the most frequent words in a language. That is what I initially did. Click here to see how it looks right now.

Here is how the command preparing the data looks like:

php cli/prepare.php 
This will create a CSV with the most frequent words in all of the files in the dataset/input folder.
The operation may take a few seconds to be completed.
Do you want to proceed? (Y/N): y
OK! The most frequent words in deu.txt were transformed into CSV format...
OK! The most frequent words in eng.txt were transformed into CSV format...
OK! The most frequent words in fra.txt were transformed into CSV format...
OK! The most frequent words in por.txt were transformed into CSV format...
OK! The most frequent words in spa.txt were transformed into CSV format...
The output.csv file has been updated.

To read the new generated CSV file:

cat dataset/output.csv 
deu,der die und das den ein dem ist auf in er so hat zachenhesselhans sich im mit sie wie zu ich einer noch an da aber nicht nit einen aus auch über eine wenn von wir fei hanstonl vom denn sagt es jetzt daß schon was fanele haben sein sind um wieder am du helari weil wald ihm wird einmal is muß kann wawrl geht hab mir als des peterl nur für zum wind s zwei doch ihr franzl durch viel schnee hans alte vor einem nach nun gar sonnenwirbel waldland kein immer mehr oder kommt will unter denkt dann 
eng,the and of to a i he in that was his it had you with as at which for my is have him be there on this me said upon but from we all they not no her were by one so been them when are man up an would out what who or some if into could down will their holmes she over your do then more little has now before very two about time other come any way face came asked our how see through answered eyes well hand than however found should hope ferrier us these long 
fra,de la à et le les je un en nous des du il que ne une mon pas dans se ce au me qui oncle plus mais sur par cette ces avec pour a son dun sa est ses bien si professeur sans tout comme hans dune peu ma mes notre était donc deux sous elle lui dont où nos avait non cest encore ou on heures fut mer cela fait même y moi quil là cependant tu pendant pieds quelques aux tête rien oui après terre alors leur ainsi ni être cet car quand quelque puis dit radeau axel 
por,de a e que o da os do em um as uma com não se por na dos no das para como á ao mais era sua é sem ella eu mas lhe nos ou elle já nas seu seus aos luiza toda pela num numa tudo dum grande tinha sobre casa me entre pelo quando olhos rogério então onde ás tão foi mesmo annos velho muito vida todos ia nem ainda noite suas duma meu tempo quem essa pouco cada disse minha tinham vez até mulher bem quasi estava ter depois dia cabeça mãos esse voz dois duas homem 
spa,de la que y el en a se los no un las su con del por era una lo había le más como al pero don para sus él es sin si ella todo vetusta o muy estaba ni aquella doña ya aquel ana tenía ser poco esta porque cuando magistral todos sobre podía eran siempre mucho menos ojos este habían entre bien señor yo tan vez hasta sí regenta después decía allí esto mesía casa otro anita nada dos sabía aquellos ozores así quería hombre algo qué tal mujer eso años iba antes mundo todas usted aunque pues también 

As you can see, the CSV file prepared by Babylon contains the most frequent words in German, English, French, Portuguese and Spanish. The model is to be trained with this information.

train-the-model.jpg

How Did I Implement It?

The key idea relies on the babylon/src/File/TxtStats.php class; a bit more specifically:

// babylon/src/File/TxtStats.php
...
/**
 * The n most frequent words in the text.
 *
 * @param int $n
 * @return array
 * @throws \InvalidArgumentException
 */
public function freqWords(int $n): array
{
    if ($n <= 0) {
        throw new \InvalidArgumentException(
            "The number of words $n must be a positive number."
        );
    }
    $this->readWords();
    $this->freq = array_count_values($this->words);
    arsort($this->freq);

    return array_slice($this->freq, 0, $n);
}

/**
 * Reads the words from the text storing them into $this->words.
 */
private function readWords(): void
{
    if ($file = fopen($this->filepath, 'r')) {
        while (!feof($file)) {
            $line = mb_strtolower(fgets($file));
            $line = preg_replace('/[[:punct:]]/', '', $line);
            $line = preg_replace('/(“|”)/', '', $line);
            $line = preg_replace('/(\"|\")/', '', $line);
            $line = preg_replace('/’/', "'", $line);
            $line = preg_replace('/[0-9]+/', '', $line);
            $line = preg_replace('!\s+!', ' ', $line);
            $exploded = explode(' ', $line);
            $this->words = array_merge($this->words, $exploded);
        }
        fclose($file);
    }

    $this->words = array_map('trim', array_filter($this->words));
}
...

On a different side note, Feature/iso 8859 training #5 shows how I started to train and test a naive Bayes model. This is meant to be an iterative process.

Here are a few todos already:

  • Review the sample data in dataset/input/iso-8859
  • Compare the results of the naive Bayes model with the results of a support vector model

What Brought Me to Babylon?

For further information please read PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands. In that post I explained that at some point in the future I'd like to write a PHP chess AI with PHP-ML's Multilayer Perceptron Classifier, which will require some research and time.

Babylon is my first machine learning project with PHP-ML.

Roadmap

  • Train the model
  • Test the model
  • Write documentation
  • Provide feedback to PHP-ML

Stay curious!

GitHub Account

https://github.com/programarivm

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Looks sharp.

Just a reminder for future utopian submissions, even though we love new projects and their constant updates and progress, we expect to see the first announcement post in a stable version. And for the updates, we favor major updates instead of micro updates.


Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Okay. Thanks again @emrebeyler for the review.

Thank you for your review, @emrebeyler!

So far this week you've reviewed 1 contributions. Keep up the good work!

Hey, @programarivm!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Hi @programarivm, I'm @checky ! While checking the mentions made in this post I noticed that @throws doesn't exist on Steem. Did you mean to write @thow ?

If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with !help.

You should disable your checks on code blocks.

Thanks @emrebeyler! How could I disable the checks? :)

ahh, I was directed this suggestion to the owner of @checky bot. :)

Congratulations @programarivm! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes received
Award for the number of comments received

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

To support your work, I also upvoted your post!

You can upvote this notification to help all Steemit users. Learn why here!

Congratulations @programarivm! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the total payout received

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

You can upvote this notification to help all Steemit users. Learn why here!