PDFx - Extract references and metadata from PDF documents

in coding •  8 years ago 

pdf.jpg

Reading over many scientific papers and its references recently, I thought it would be great to be able to download all the references at once… This inspired me to write a little tool to do just that, and now it’s done and released under the Apache open source license:

https://github.com/metachris/pdfx

Features

  • Extract references and metadata from a given PDF
  • Detects pdf, url, arxiv and doi references
  • Fast, parallel download of all referenced PDFs
  • Find broken hyperlinks (using the -c flag)
  • Output as text or JSON (using the -j flag)
  • Extract the PDF text (using the --text flag)
  • Use as command-line tool or Python package
  • Compatible with Python 2 and 3
  • Works with local and online pdfs

source

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!