Reading over many scientific papers and its references recently, I thought it would be great to be able to download all the references at once… This inspired me to write a little tool to do just that, and now it’s done and released under the Apache open source license:
https://github.com/metachris/pdfx
Features
- Extract references and metadata from a given PDF
- Detects pdf, url, arxiv and doi references
- Fast, parallel download of all referenced PDFs
- Find broken hyperlinks (using the -c flag)
- Output as text or JSON (using the -j flag)
- Extract the PDF text (using the --text flag)
- Use as command-line tool or Python package
- Compatible with Python 2 and 3
- Works with local and online pdfs