JavaScript Pills - 2. Scraping an email providersteemCreated with Sketch.

in coding •  6 years ago  (edited)

Let's say that your email provider does not offer you a simple way to export your emails directly from the website.
How can we scrape some data that we need, for example some specific information sent from a specific sender by email in the past few months?

You could use http libraries that handle plain and httponly cookies but another simple solution is to let the browser (eg.
Chrome, Tor, Firefox, etc) do the work for you, managing its requests and tabs.

First let's see how this is possible with a brief example.

Open the web console while you are on Google (for security reasons open works only on the same domain) and enter the following instructions:

const newTab = open('https://www.google.com/search?q=steemit');
setTimeout(() => {
   // Let's give some time to the page to load
   const gooStats =
     newTab.document.getElementById('resultStats').innerText;
   newTab.close();
   console.log('Google stats for my search:', gooStats.split('About ')[1]);
},2000);

That's right! window.open allows you to open a new tab, access its context however you want and close it when you're done with it. Pretty nice!



Let's now see with a bit of (bad :p) JS code how we can scrape the emails we talked about in the intro.

const TARGET_SENDER = '[email protected]'
const SESSION = 'SESSION_ID_HERE';
var openedWindows = {}; // our tab identifiers will be stored here
var err = 0; // count of the errors, we'll allow only a certain amount
var idx = 12000; // Number of emails in the mailbox, you can easily find this using the Network tab in the Developer Tools
localStorage.setItem('FAILED', '');
localStorage.setItem('FAILURE_CAUSES', '')

var extract = (idx) => {
   openedWindows['tab' + idx] =
      window.open(`https://PROVIDER_URL_HERE/email?operation=get&id=${idx}&folder=myFolder&session=${SESSION}`);
   setTimeout(() => { // let the new tab load..
     try {
       console.log('------ Extracting data and then closing tab for id ', idx)
       const scraped = openedWindows['tab' + idx]
         .document.getElementsByTagName('pre')[0].innerHTML;
       const json = JSON.parse(scraped).data;
       if(json.sender === TARGET_SENDER) {
         const msg = json.message.content; // or just a part of it
         localStorage.setItem(idx, msg);
         console.log('DONE - stored in LocalStorage');
       }
     } catch(e) {
       err++;
       let failed = localStorage.getItem('FAILED');
       failed += `${idx},`;
       localStorage.setItem('FAILED', failed);
       let causes = localStorage.getItem('FAILURE_CAUSES');
       if(causes.indexOf(''+e) == -1) {
         causes += `${e},`;
         localStorage.setItem('FAILURE_CAUSES', causes);
       }
     } finally {
       openedWindows['tab' + idx].close();
       delete openedWindows['tab' + idx];
     }
   }, 3000);
}

setInterval(() => {
   if(err > 10) {
         extract = () => {}; // stops the execution overriding the function
         alert('100 errs in LS. curr id: ', idx);
     }
     if(idx >= 0) {
         extract(idx--);
     } else if (idx == -1) {
         // download localStorage to local FS (see my other article)
     }
}, 4000);


Since different windows share the local storage for the same domain, in order to poll the status of the requests, in another window we can use:

let succCount = 0, failCount = 0, percent = 0;
let oldS, oldF;
let currTime = new Date().getTime();
const report = () => {
   oldS = succCount;
   oldF = failCount;
   succCount = Object.keys(localStorage).length - AMOUNT_OF_PREEXISTING_KEYS; // Same pre-existing keys may be stored by the email provider
   failCount = localStorage.getItem('FAILED').split(',').length - 1;
   // e.g. 2.3 %
   percent = Math.round(failCount/(succCount+failCount)10010) / 10;
   // eg. 60.5 seconds
   var diffTime = Math.round(((new Date().getTime() - currTime) / 1000) * 10) / 10;
   console.log(`-------------- ${new Date()}
         Executed ${(succCount + failCount) - (oldS + oldF)} requests in ${diffTime} seconds.
         - TOT SUCCEEDED: ${succCount}
         - TOT FAILED: ${failCount} (${percent ? percent : 0}%)
         Causes: ${localStorage.getItem('FAILURE_CAUSES')}`
   );
   currTime = new Date().getTime();
}
setInterval(() => report(), 60000);

To stop the scripts when the whole process is comlete, or in case of errors, you can use clearInterval with as argument the id returned when setInterval was executed. Eg. clearInterval(13383);

That's all Folks! Thanks for reading my article! :D

OTHER JAVASCRIPT ARTICLES:


Taking a nap
Scraping an email provider
Download files programmatically + XSS
Take screenshots programmatically

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Post too old. Cannot Re-Steem.
@gaottantacinque I urgently need to talk to you about javascript. Can you get back to me on Discord or Facebook? My facebook link is in my Bio and my discord username is @kayyam09

The email scraping is the online process of extracting addresses so that you can contact them later. For example, when you search a website for email addresses, your next step is to chat with them and convert them into leads. This is a great technique because it helps sales reps find potential buyers much faster.