Big data question. how to generate a variable efficiently and aggregate
Mostrar comentarios más antiguos
I have a file of tens of millions observations with a string identifier, which I load as a datastore:
- ............. V1 ..... V2 ............ V3 ........ V4
- # # * # KLM88 2001-06-30 10 COMPANY1
- # # * # KLM88 2000-12-31 20 COMPANY1
- # # * # MNH7C 2001-09-30 23 COMPANY1
- # # * # MNH7C 2001-06-30 15 COMPANY1
- # # * # MNH7C 2000-12-31 6 COMPANY1
- # # * # HG9LB 2000-12-31 2 COMPANY1
I also have a mat file with some extra information and matching of first variable:
- # KLM88 COUNTRYA
- # MNH7C COUNTRYA
- # HG9LB COUNTRYB
I wish for an end result such that I aggregate on country and date and company my dataset :
- # * # 2001-09-30 23 COMPANY1 COUNTRYA
- # * # 2001-06-30 25 COMPANY1 COUNTRYA
- # * # 2000-12-31 26 COMPANY1 COUNTRYA
- # * # HG9LB 2000-12-31 2 COMPANY1 COUNTRYB
I know I can do so by reading per dataChunk and with for loop assigning the country. However, that takes a huge amount of time. Any other suggestions of how to do so? I am fairly new to the concepts of tall arrays/ mapreduce etc. Thus, I am not sure how could I arrive to what I want more efficiently.
Respuesta aceptada
Más respuestas (0)
Categorías
Más información sobre MapReduce en Centro de ayuda y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!