Posted by on
Categories: Apache Cloudera Drill Hadoop Parquet Spark

John Snow Labs now delivers all datasets in #Apache #Parquet format. The new format drastically accelerates queries on common benchmarks. It also reduces disk space, bandwidth as well as CPU usage. It is available alongside with the existing CSV and JSON data formats and can be found on all subscriptions. Apache Parquet is an efficient and a general-purpose columnar file format. It is self-describing, language-independent and also supports multiple compression algorithms and partitioning for big data sets and nested data structures. John Snow Labs is the first to deliver a data repository in Parquet format in the healthcare space, which is experiencing fast growing adoption of big data analytics technologies. Parquet was designed for Apache #Hadoop and has been adopted by Apache #Spark, #Cloudera #Impala, #Hive, #Presto and Apache #Drill. The majority of big data analytics platform now recommend it as the most efficient, highest performing data format. Here are recent publicly available benchmarks:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.