Command line tool to quickly inspect Parquet files

in utopian-io •  7 years ago  (edited)

Apache Parquet is a columnar storage format commonly used in the Hadoop ecosystem. If you work in Big Data space, you probably work with Parquet files. Unlike commonly used data storage formats like CSV, JSON etc Parquet doesn't have tools needed to quickly preview and inspect. I often needed to write Spark or Python code just to do very simple debugging.

In order to solve this problem, I created a CLI tool aptly named parquet-cli (parq as command). It is released on PyPi and can be conveniently installed using pip: pip install parquet-cli

Initial features

It currently supports basic but very useful feature set to work with Parquet files. They are:

  • view file metadata
  • get schema information
  • get total count of rows in a file
  • get top N records (head)
  • get bottom N records (tail)

It only works with single file as of now. However, I am planning to support for directories. It means you can give path to partitioned directory and parq should still work in similar way as for single file.

I wanted this tool to be very easy to install. Thus, I specifically tried to keep dependencies very minimal. For example, I really like click but it has many third party dependencies, thus I decided to use built-in library argparse for CLI parsing. Only hard dependencies are Apache Arrow (reading Parquet files) and pandas (manipulating them). They both are part of Python Data stack and are well maintained.

Example usage

This initial feature set is something that I need. If you have any suggestion or found any bugs, you can open ticket on Github. Needless to day, any code contribution is very welcome too.



Posted on Utopian.io - Rewarding Open Source Contributors

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Congratulations @chhantyal! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes received

Click on any badge to view your own Board of Honor on SteemitBoard.

To support your work, I also upvoted your post!
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

Upvote this notification to help all Steemit users. Learn why here!

Hi, your contribution was rejected because I found out this this already exists, so I thought maybe your project is a little bit redundant. The only difference currently seems to be the --tail command. Since I don't know anything about Parquet I asked others about it and they agreed it wasn't unique enough to be accepted - if you continue working on this project please highlight reasons why it's unique in future contributions.

Also, have you ever heard of Click? It's made by Armin Ronacher, the same guy who made Flask, and it's amazing! I'd definitely recommend using that over argparse when creating a CLI.


Need help? Write a ticket on https://support.utopian.io.
Chat with us on Discord.

[utopian-moderator]

Fair enough. A colleague sent me that link after I released this tool. I understand where you are coming from.
I think it's more of a convenience thing, I don't think anyone using Python data stack would be up to install & build big java source code so that he/she could check some file contents.