Unusual size of saved variable as .mat file

11 visualizaciones (últimos 30 días)
Milad Ghobadi
Milad Ghobadi el 27 de Feb. de 2020
Editada: Simon el 7 de Sept. de 2023
Hello everyone,
Can anybody explain me why the size of my .mat file is going to get too large as I save, for example a variable containing cells .
here I show you my case :
Imagine I have a text file that is just 50 M and each lines on this file is look like this:
1567461735.497778;5051030001001903;800001E1;Sttel_ZLG1;INV;0.093750
1567461735.498107;40403C3C00000000;80000201;Sttel_ZLG;TKK1;24.000000
.
.
.
so as i read this text file with textscan( ) and then save this variable C with save( ) as below:
fid = fopen('samples.txt','rt');
C = textscan(fid,'%f %s %s %s %s %f','delimiter',';');
fclose(fid);
save('test.mat','C')
the test.mat file will be over 1 G big . Now how is that possible to get 100 time bigger wenn the original file ist just 50 M ? and is there any solution for this .
I hope you can help me
Thank you
  6 comentarios
Milad Ghobadi
Milad Ghobadi el 29 de Feb. de 2020
Editada: Milad Ghobadi el 29 de Feb. de 2020
I attached a chunk of the data and here below I have a version of my code which I'm sure it works and I need to use save() to save my structure there.
tic
[fid,msg] = fopen('CANDec(test).txt','rt');
assert(fid>=3,msg)
C = textscan(fid,'%f %s %s %s %s %f','delimiter',';');
fclose(fid);
zt = C{1};
hx = C{2};
tn = C{4};
sn = C{5};
dw = C{6};
[fn,~,ix] = unique(sn);
can = struct();
for i = 1:max(ix)
can(i).signal_name = fn{i};
can(i).telegramm_name = tn{i};
can(i).decimal_wert = dw(i==ix);
can(i).zeit = zt(i==ix);
can(i).hex = hx(i==ix);
end
save('CANDec(day_1_9)_2.mat','can', '-v7.3')
toc
I expect ,the size of this structure that I saved, not to be much much bigger than my data but it will .
thank you for your support.
dpb
dpb el 5 de Mzo. de 2020
Editada: dpb el 5 de Mzo. de 2020
"I'm sure the problem is in textscan and how it read my text..."
Nothing wrong with textscan itself, it's how you're using it. It returned a cell array of 1x6, one cell for each of the six columns in the text file. Since the data aren't all of the same class, and you clearly didn't use the 'CollectOutput' input parameter to merge like data types into minimum number of cell arrays, it built one for each.
There are/will be the 977K lines of data as rows in the cell arrays; you can treat them as wish as demonstrated below; a table would be one way.
The internal memory is owing to the choice of which data structure you pick; and is the implementation overhead of whichever one you select; but that's in those datatype classes overhead and not the fault of textscan.
It's your responsibility to use the proper functionality within textscan or to otherwise manipulate the data returned from it to match what the data itself are and to pick the most appropriate structure for the problem; MATLAB can't necessarily do that automagically for you. The most effiicient is always to use the base numeric classes for absolutely everything possible; that's not always the convenient programming way, however.

Iniciar sesión para comentar.

Respuesta aceptada

dpb
dpb el 29 de Feb. de 2020
Editada: dpb el 1 de Mzo. de 2020
"I expect ,the size of this structure ... not to be much much bigger than my data..."
That's perhaps a reasonable expectation on the surface, unfortunately as the famous aphorism has it, "In theory, there is no difference between practice and theory. In practice, there is."
The overhead in the struct is significant in comparison to the raw data that the structure contains; for the organization you have created by the signal, you've paid a large price in memory to hold that structure. Perhaps TMW should have found a more efficient implementation for the struct class, but that's what it is so to use it, that's the price, unfortunately.
I tried one optimization that cut the overhead on the sample case down by over a factor of two; whether you can make use of it in the real application I don't know -- I turned the HEX string variable into a categorical one, but only for the categories in each message, NOT the entire file--doing the latter actually increases the storage significantly because the categorical datatype contains the information for every level in every instantiation so when it gets duplicated, that overhead also proliferates. However, if you can process the data such that you're looking only at the one set at a time, then you should be able to get by with this.
The only line I changed in your code is:
can(i).hex = categorical(hx(i==ix));
which resulted in the following comparison-- CELLCAN is a copy the original cell struct saved for the comparison.
>> whos CELLCAN can
Name Size Bytes Class Attributes
CELLCAN 1x149 477380 struct
can 1x149 210680 struct
>>
>> CELLCAN(1)
ans =
struct with fields:
signal_name: 'AKHZ3S1'
telegramm_name: 'Stat_NEM_1'
decimal_wert: [17×1 double]
zeit: [17×1 double]
hex: {17×1 cell}
>> can(1)
ans =
struct with fields:
signal_name: 'AKHZ3S1'
telegramm_name: 'Stat_NEM_1'
decimal_wert: [17×1 double]
zeit: [17×1 double]
hex: [17×1 categorical]
>>
Another alternative you might consider would be to not build the struct but do the processing on the fly...depends upon just what it is that one is after in the end as to whether practical or not, I suspect, but don't have any idea of what the next step(s) is(are).
t=table(zt, hx, tn, sn, dw); % make a table of the desired variables
t.hx=categorical(t.hx); % turn those appropriate into categorical
t.tn=categorical(t.tn); % that saves memory in single location
t.sn=categorical(t.sn); % as long as don't duplicate the whole thing
This shows a distinct memory savings over both the struct and the input file as far as saving the data in usable form...
>> whos t CELLCAN
Name Size Bytes Class Attributes
CELLCAN 1x149 477380 struct
t 2426x5 93910 table
>> t(1:10,:)
ans =
10×5 table
zt hx tn sn dw
__________ ________________ _______________ ____________ _______
1.5675e+09 400100C080000000 Stat_NEM_1 HEAT_ON 0
1.5675e+09 400100C080000000 Stat_NEM_1 COOL_ON 1
1.5675e+09 400100C080000000 Stat_NEM_1 CLIM_ON 0
1.5675e+09 0E000C0014000000 Stat_Bat_BMU_09 BMU_UBAT_09 0.69641
1.5675e+09 0E000C0014000000 Stat_Bat_BMU_9 BMU_IBAT_09 0.17234
1.5675e+09 0E000C0014000000 Stat_Bat_BMU_9 BMU_TKK_09 20
1.5675e+09 0E000C0014000000 Stat_Bat_BMU_9 BMU_ANLU1_09 0
1.5675e+09 0E000C0014000000 Stat_Bat_BMU_9 BMU_ANLU2_09 1
1.5675e+09 7F00A800F4010000 Bat1_Aw_T1 BATIMAXELA1 509
1.5675e+09 7F00A800F4010000 Bat1_Aw_T1 BATIMAXLA1 0
>>
To use this format that doesn't have the structure by sn as your struct, use:
>> [g,ig]=findgroups(t.sn);
>> whos g ig
Name Size Bytes Class Attributes
g 2426x1 19408 double
ig 149x1 19261 categorical
>>
Here, ig is the same set of 149 unique values of sn and g is the lookup vector for finding each group in t as is the index variable from unique before.
splitapply and/or rowfun let you process each group and return whatever results you wish for each as you go. You trade some processing for storage, although you can create a summary new table of length 149 as output which may be where you're trying to get to anyways...
ADDENDUM:
tCAN=struct2table(can,'asarray',1); % convert the struct to table with arrays
results in table organized similarly to the struct...
>> tCAN(1:10,:)
ans =
10×5 table
signal_name telegramm_name decimal_wert zeit hex
___________ _______________ _____________ _____________ __________________
AKHZ3S1 Stat_NEM_1 {17×1 double} {17×1 double} {17×1 categorical}
AKHZ3S2 Stat_NEM_1 {17×1 double} {17×1 double} {17×1 categorical}
AKHZ4S1 Stat_NEM_1 {17×1 double} {17×1 double} {17×1 categorical}
AKHZ4S2 Stat_Bat_BMU_09 {17×1 double} {17×1 double} {17×1 categorical}
BATIMAXELA1 Stat_Bat_BMU_9 {33×1 double} {33×1 double} {33×1 categorical}
BATIMAXELA2 Stat_Bat_BMU_9 {33×1 double} {33×1 double} {33×1 categorical}
BATIMAXLA1 Stat_Bat_BMU_9 {33×1 double} {33×1 double} {33×1 categorical}
BATIMAXLA2 Stat_Bat_BMU_9 {33×1 double} {33×1 double} {33×1 categorical}
BATUMAXELA1 Bat1_Aw_T1 {17×1 double} {17×1 double} {17×1 categorical}
BATUMAXELA2 Bat1_Aw_T1 {17×1 double} {17×1 double} {17×1 categorical}
>> save canDECstruct2table tcan -v7.3
>> !dir can*.mat
...
02/29/2020 10:08 AM 1,462,856 CANDec(day_1_9)_2.mat
02/29/2020 02:12 PM 1,081,744 canDECcateg.mat
03/01/2020 07:47 AM 802,303 canDECstruct2table.mat
02/29/2020 02:28 PM 195,628 canDECtable.mat
5 File(s) 3,561,171 bytes
0 Dir(s) 819,800,248,320 bytes free
>>
>> whos t tCAN
Name Size Bytes Class Attributes
t 2426x5 93911 table
tCAN 149x5 196162 table
>>
Altho the use of the cell arrays inside the table also doubles the memory, the footprint is still much better than the structure. But the -v7.3 storage problem comes back with the cell arrays in the table. Not quite as bad as the struct but still 4X memory.
Guess the moral of the story is there is no free lunch but this one is an expensive dinner!
ADDENDUM DEUX:
This might be a place where the 'RowNames' property is useful -- dunno, again depends on what the next step(s) is(are)
  1 comentario
Simon
Simon el 7 de Sept. de 2023
Editada: Simon el 7 de Sept. de 2023
>> The only line I changed in your code is:
>> can(i).hex = categorical(hx(i==ix));
I use the same method to downsize cell arrays. In my case, they have mixture of double and string data, but mainly string. The string variables can be reasonablly seen as categorical. Though a couple string variables generate a large number of categories, categorical recasting still reduce storage size significantly.
However, in some processing the categorical variables would be much slower than string. Then I recast them back to string for that job.

Iniciar sesión para comentar.

Más respuestas (1)

Walter Roberson
Walter Roberson el 29 de Feb. de 2020
Cell array and struct are represented inefficiently in v7.3 files. Although hdf5 does provide a compound data type, it does not not provide much in the way of nested data types. The compression is also not as effective in hdf5.
Basically if you have a large object to save and it is not pure numeric, you will certainly get a much larger v7.3 file.
  1 comentario
dpb
dpb el 29 de Feb. de 2020
Editada: dpb el 29 de Feb. de 2020
Yeah, that's disappointing, indeed. Not being able to do anything about that problem, looked for ways might be able to shrink the storage internally and hope that to spill over to the external storage as well. Seemed to help at least some...
>> !dir can*.mat
...
02/29/2020 10:08 AM 1,462,856 CANDec(day_1_9)_2.mat
02/29/2020 02:12 PM 1,081,744 canDECcateg.mat
2 File(s) 2,544,600 bytes
0 Dir(s) 819,996,459,008 bytes free
>>
where the second is the above struct with the categorical variable for the HEX field.
I suppose another memory-saving device to use might be to use single() for the decimal_wert floating point field instead of double(); doesn't look as though it holds high precision values altho the POSIX time requires double so can't with it.
But, the .mat-file storage explosion with struct is really amazing when compare to the table form above...and, the effectiveness of compression without the struct, it would appear.
>> whos t
Name Size Bytes Class Attributes
t 2426x5 93910 table
>> save canDECtable t -v7.3
>> !dir can*.mat
...
02/29/2020 10:08 AM 1,462,856 CANDec(day_1_9)_2.mat
02/29/2020 02:12 PM 1,081,744 canDECcateg.mat
02/29/2020 02:28 PM 195,628 canDECtable.mat
3 File(s) 2,740,228 bytes
0 Dir(s) 819,996,258,304 bytes free
>>

Iniciar sesión para comentar.

Etiquetas

Productos


Versión

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by