Apache Parquet is a columnar storage format commonly used in the Hadoop ecosystem. If you work in Big Data space, you probably work with Parquet files. Unlike commonly used data storage formats like CSV, JSON etc Parquet doesn't have tools needed to quickly preview and inspect. I often needed to write Spark or Python code just to do very simple debugging.
In order to solve this problem, I created a CLI tool aptly named parquet-cli (parq as command). It is released on PyPi and can be conveniently installed using pip: pip install parquet-cli
Initial features
It currently supports basic but very useful feature set to work with Parquet files. They are:
- view file metadata
- get schema information
- get total count of rows in a file
- get top N records (head)
- get bottom N records (tail)
It only works with single file as of now. However, I am planning to support for directories. It means you can give path to partitioned directory and parq should still work in similar way as for single file.
I wanted this tool to be very easy to install. Thus, I specifically tried to keep dependencies very minimal. For example, I really like click but it has many third party dependencies, thus I decided to use built-in library argparse for CLI parsing. Only hard dependencies are Apache Arrow (reading Parquet files) and pandas (manipulating them). They both are part of Python Data stack and are well maintained.
This initial feature set is something that I need. If you have any suggestion or found any bugs, you can open ticket on Github. Needless to day, any code contribution is very welcome too.
Posted on Utopian.io - Rewarding Open Source Contributors