How can I acces data from an hdfs in parquet format

We have a large dataset stored in parquet files on an hadoop file system and would like to use a matlab datastore to analyse them. Unfortunately I couldn't find any reports, that anybody has done this yet.
Does mathworks provide a native way to access parquet data? Perhaps one can use the fileDatastore or a matlab custom datastore? Is there a template for that?

 Respuesta aceptada

Hitesh Kumar Dasika
Hitesh Kumar Dasika el 20 de Dic. de 2018
Mathworks has added support for Parquet files. it is available in the following link.

5 comentarios

André Horn
André Horn el 27 de Dic. de 2018
This is really good news right before Christmas. Thanks mathworks for its efforts. We will definitely check this out.
Following the instructions it is possible to read and write parquet files which is really a good news. But creating a parquetDatastore is currently not supported on windows, right ? Are there any plans to extend the datastore support for windows systems ?
The windows support is less complete due to HDFS support on windows.
This hasn’t been tested much, but if instead of writing
ds = parquetDatastore('*.parquet')
you write
ds = bigdata.parquet.ParquetDatastore('*.parquet')
you’re basically good to go.
Indeed, I'm able to access parquet files hosted on a remote hadoop linux cluster from a local Windows PC Matlab.
For me it worked considering following steps:
1. I got a local Hadoop Windows installation according to
https://github.com/MuhammadBilalYar/Hadoop-On-Window/wiki/Step-by-step-Hadoop-2.8.0-installation-on-Window-10
2. log4j.properties must be copied from \hadoop-X.X.X\ect\hadoop\ to \matlab-parquet-master\Software\MATLAB\lib\jar\
3. The HADOOP_HOME environment variable should then point to the local hadoop home directory instead to Winutils.exe
4. The check for unix-style filename was removed in matlab-parquet-master\Software\MATLAB\+bigdata\+parquet\Reader.m
5. The OS-check must be removed from \matlab-parquet-master\Software\MATLAB\functions\parquetDatastore.m
or like proposed above it is possible to use directly \matlab-parquet-master\Software\MATLAB\+bigdata\+parquet\ParquetDatastore.m
After this steps I could initialize a Datastore via a remote hadoop url like
ds=bigdata.parquet.ParquetDatastore('hdfs://server:port/dir','IncludeSubfolders',true)
Hatem Helal
Hatem Helal el 10 de Abr. de 2019
R2019a adds support for working with parquet files, see this answer and let us know if you have any further feedback.

Iniciar sesión para comentar.

Más respuestas (2)

Hatem Helal
Hatem Helal el 10 de Abr. de 2019
MATLAB R2019a adds support for reading and writing Apache Parquet files (doc). Here are the relevant release notes:
1. Import and export column-oriented data from Parquet files in MATLAB. Parquet is a columnar storage format that supports efficient compression and encoding schemes. To work with the Parquet file format, use these functions.
2. The write function now supports writing tall arrays to Parquet files. To write a tall array, set the FileType parameter to 'parquet', for example:
write('C:\myData',tX,'FileType','parquet')
3. Read a collection of Parquet files into MATLAB workspace using parquetDatastore.
For more information on the Parquet file format, see https://parquet.apache.org/.
Hitesh Kumar Dasika
Hitesh Kumar Dasika el 24 de Sept. de 2018
Currently, there is no support to Apache Arrow and Parquet files in MATLAB.

3 comentarios

André Horn
André Horn el 24 de Sept. de 2018
Editada: André Horn el 24 de Sept. de 2018
OK, but has mathworks any plans to go into this direction? What is if I'd spend some efford in implementing a bridge? Is there a typical approach? How could I start?
I feel that mathworks makes a big fuss about comprehensive big data support, but when it comes to professional application it fails so fast. This is rather disappointing.
Thank you for your feedback. We have raised this concern with our developers and they are actively looking at including this feature in our future releases. Unfortunately, there is no workaround in this case for now. Sorry for the trouble.
Hatem Helal
Hatem Helal el 10 de Abr. de 2019
R2019a adds support for working with parquet files, see this answer and let us know if you have any further feedback.

Iniciar sesión para comentar.

Categorías

Más información sobre Data Import and Analysis en Centro de ayuda y File Exchange.

Productos

Versión

R2017a

Preguntada:

el 17 de Sept. de 2018

Comentada:

el 10 de Abr. de 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by