How to preallocate memory for building this structure, indexing fieldnames?

I have in several files a structure called "Result" and would like to merge all of them into one structure. My difficulty is, that the fieldnames following right after "Result." are build by a string identifying an experiment name, and as this experiment name and the amount of experiment names are unknown to this moment, I have to address them by indexing.
So far this indexing works, it merges my data correctly, but preallocation of memory is missing:
START HERE A LOOP THROUGH MANY FILES, RETRIEVING THE NEXT ID
NewData = load(id); % the file referenced in id contains a structure called "Result"
casename = fieldnames(NewData.Result);
cases = size(casename,1);
% preallocation of memory could fit in here, in this line
for caseIndex = 1:cases
Result.(casename{caseIndex}).MyValue = ...
NewData.Result.casename{caseIndex}).MyValue;
end
END HERE THE LOOP THROUGH MANY FILES
Now I tried to preallocate memory by the following failing attempt:
Result.(casename{1:cases}).MyValue = zeros(cases,1);
This one also failed:
Result.(casename{[1:cases]}).MyValue = zeros(cases,1);
Do you have any idea how the correct syntax has to look like?

2 comentarios

How many files are you talking about? Are the case names in each file unique, or is there potential overlap of names amongst files? There may be a way to do some meaningful pre-allocation for your proposed struct organization, but are we talking about a Result struct with 100's or 1000's (or more) of field names?
Marco
Marco el 9 de Mzo. de 2015
Editada: Marco el 9 de Mzo. de 2015
The casenames are unique, there will be in total about 50 fieldnames, just to give an dimension. But it also could be only 20 or up to 100.
In one of my experiment series, I produce about 5 files, each containing the structure with about 10 fieldnames. In another experiment series, I produce about 10 files, each containing the structure with about 5 fieldnames.
Following advice given by Adam (see his answer), y learned about and meanwhile used the Profiler and found that 98% of the time my code is busy with accessing the HDD in the load(id) line. So, my question is clearly not targeting performance anymore, but I am still interested in learning how I "could" code the preallocation in a clean way, just for learning how to program such thing.

Iniciar sesión para comentar.

 Respuesta aceptada

Stephen23
Stephen23 el 9 de Mzo. de 2015
Editada: Stephen23 el 9 de Mzo. de 2015
Unlike numeric and character arrays, according to the documentation both structures and cell arrays do not require completely contiguous memory. It is sufficient to preallocate just the cell array or structure itself, but this does not require also preallocating the arrays stored inside that cell array or structure: these can simply be empty, as they are not stored in the same memory location as the structure or cell array itself. You can read more about them here:
It is apparently slower to try to preallocate the data arrays (inside the structure or cell array):
Quoting Jan Simon from the above link: For this reasons it is e.g. useless to "pre-allocate" the elements of a cell array, while pre-allocating the cell itself is strongly recommended. The same also applies to structures.
This topic is also addressed very well by Loren Shure in one of her blogs:
Where she says: Of course it depends on your specifics, but since each field is its own MATLAB array, there is not necessarily a need to initialize them all up front. The key however is to try to not grow either the structure itself or any of its contents incrementally.

5 comentarios

I am sorry for not understanding, not even after studying the literature you linked me to. Loren states: "The key however is to try to not grow either the struct itself or any of its contents incrementally." But isn't it my case, that my main structure dynamically grows by merging more structures to it, because the fieldname "ExperimentNameX" is dynamically created?
For example, if I have initially this structure:
Result.ExperimentName1.MyValue = AnyNumber
Result.ExperimentName2.MyValue = AnyNumber
Result.ExperimentName3.MyValue = AnyNumber
and I want to add to it later on the following fields (outer loop loads more data, inner loop adds the fields):
Result.ExperimentName4.MyValue = AnyNumber
Result.ExperimentName5.MyValue = AnyNumber
and again, the outer loops loads more data, the inner loop adds the newly found fields to the structure:
Result.ExperimentName6.MyValue = AnyNumber
Result.ExperimentName7.MyValue = AnyNumber
Result.ExperimentName8.MyValue = AnyNumber
Result.ExperimentName9.MyValue = AnyNumber
Let me clarify that I of course have more than only the "MyValue" field under Result.ExperimentNameX:
Result.ExperimentName1.MyValueA = AnyNumber
Result.ExperimentName1.MyValueB = AnotherNumber
Result.ExperimentName1.MyValueC = AnyText
Result.ExperimentName1.MyValueD = AnotherText
etc.
I just didn't show this detail in my initial post because I thought it would not be needed to ask for teh indexing syntax problem.
If the indexing syntax is not possible to solve, if I can't preallocate structure header memory by something like:
Result.{1:cases}.MyValue = [];
then I probably should think about designing my structure different, to something like:
Result.ExperimentName = casename{case}
Result.MyValue = AnyNumber
and then would later on have to search in my structure for which index in ExperimentName(index) a certain casename is stored, and once having found the index I would have to use that index to correspondingly adress MyValueA(index), MyValueB(index), MyValueC(index), etc.. for accessing the set of data collected for a certain experiment.
Is this how I have to understand it, how I am recommended to program?
You should consider using a non-scalar structure. If the fields are the same for all of your experiments (or are mostly the same), then instead of creating a scalar structure with lots of different levels and fieldnames like this:
Result.Exp1.Value = ...
Result.Exp1.Param = ...
Result.Exp1.Name = ...
Result.Exp2.Value = ...
Result.Exp2.Param = ...
Result.Exp2.Name = ...
you should instead use a non-scalar structure, which would be a simpler and much more versatile to access. It would work like this:
Result(1).Value = ...
Result(1).Param = ...
Result(1).Name = ...
Result(2).Value = ...
Result(2).Param = ...
Result(2).Name = ...
In this case you end up with a structure that has size Nx1, where N is the number of experiments that you have. This structure has only three fields: Value, Param and Name. It is very convenient to use too, e.g. to access Param for the fourth experiment:
Result(4).Param
or to access all of the Param values in a comma separated list you can use:
Result.Param
which means you could put all of the Param values in a cell array like this:
{Result.Param}
or concatenate all of the Value arrays together like this:
[Result.Value]
which is thus simply equivalent to
[Result(1).Value, Result(2).Value, ...]
There are lots of other useful tools for working with structures or extracting the in different ways. If you choose to use a non-scalar structure, then you should preallocate the structure itself using struct (or a backwards loop, see below):
s = struct(field,value)
where the docs states: If value is a cell array, then s is a structure array with the same dimensions as value. Each element of s contains the corresponding element of value. Note you do not need to preallocate the data inside the structure.
s = struct('Value',cell(1,50))
will create an 1x50 sized structure, with [] for each Value element. Doing this preallocates the structure itself, but not the data inside it, as my original answer recommends. Another neat way to "preallocate" any array (numeric, cell, etc) is to loop over it backwards, so that the last element is the first to be created:
for k = 50:-1:1
Result(k).Value = ...
Result(k).Param = ...
Result(k).Name = ...
end
Thus the whole structure is created on the first loop iteration, with exactly the required size. Doing this is likely to be the fastest way of creating a non-scalar structure.
Thanks a lot for all the explanations and efforts, Stephen! The thing is, that I do not know to which final size (to which quantity of fields) my structure might grow, gathering more and more data while looping through all my data files. I thought, instead of growing my structure field by field, I could at least grow it in blocks of fields, because once one data file is opened I can at least see how many new fields will become added from that file. Like this there wouldn't have to be processed so many changes (incremental grows) of the structure, but just one change of the structure size per data file. I commented already to the answer of Adam, that I at this point might better think about my code to not be improvable and to not implement any preallocation step, or to in general redesign the concept of how I store and access my data in a better way, considering (and studying) what you pointed me to.
These are two different issues: the number of fields and the number of experiments. What you are doing now mixes these two concepts together, with the resulting difficulties that you are facing.
Your statements, e.g. "that I do not know to which final size (to which quantity of fields) my structure might grow, gathering more and more data while looping through all my data files" do not actually tell us anything about how your data is organized: does each file correspond to one experiment, or multiple experiments? Do the measured values (fields) change between experiments?
You need to seriously consider using a non-scalar structure, depending on how your data is arranged, and in particular based on this question: Are the fields the same for each experiment?
For example, every experiment might have the following four values:
Results.Temperature = ...
Results.Parameters = ...
Results.Sensor1 = ...
Results.Sensor2 = ...
If they are the same, then a non-scalar structure would be the simplest, fastest and neatest option for storing your data.
Marco
Marco el 9 de Mzo. de 2015
Editada: Marco el 10 de Mzo. de 2015
There is a saying, which I try to translate to the english: a bad concept can never be patched to the same success like you could gain success by a good concept.
Stephen, you did remind me on this saying, thanks! I am already re-studying my literature and the official docs about structures, cells, and I will also have a look on tables, keeping in mind your recommendation to use the non-scalar structure. My data is organizable like in your last example.
I think that I ran into the bad idea to use dynamically generated fieldnames, because I didn't see how to later on extract from a structure a set of data if I only remember the casename, but do not know with which index this case became located in the structure. I will especially watch out for a chapter explaining how to search in [Result.ExperimentName] for my casename string and if it is found, how to then derive the corresponding index number of it to find also the rest of the data via Result(index), then. It could work by looping through the structure, but maybe there is some more elegant solution for this.
Everyday something new on my list. But hey(!), I am progressing, little by little, and my tiny program grows bigger and bigger. By the way, as my interest is in image processing, and I have written some code which loops over my test images applying many combinations of different parameters for selecting image enhancement and segmentation algorithms, you are helping me to get the result files collected in the various runs to once become merged, so that I could better hunt for the most promising parameter set in the consolidated data. I could do so much quicker by Copy&Paste to Excel, and for sure also the statistical analysis would be faster for me to do there, but: I also want to learn MATLAB, so will do it in MATLAB. Fortunately I have absolutely no time pressure, no dead line for this project :-)

Iniciar sesión para comentar.

Más respuestas (1)

Adam
Adam el 9 de Mzo. de 2015
Why do you need to pre-allocate? Aren't you simply copying values from one struct to another without any dynamic resizing going on of any individual field of the new struct? I don't see that pre-allocating zeros and then over-writing them with the same size of your actual data will gain you anything.

10 comentarios

I updated my question to explain myself more clearly, and add here, as you are asking: my structure "Result" grows dynamically each time loading a next file. From a loaded next file a structure variable also called "Result" is found, I load it to "NewData" thus receive it as "NewData.Result". All this content from "NewData.Result" shall then be added to the structure variable "Result". Like this I merge all the structures "Result" found in the various files. The merging works, no problems with that part of the algorithm. My structure "Result" grows each time loading from a next file by the fields found in "NewData.Result".
From my understanding of your code (admittedly I have only glanced over it) you are dynamically creating a new field of your structure each iteration of the for loop though, not dynamically expanding an array within and existing structure field.
In this case I don't see a need for any pre-allocating as mentioned above. Dynamically created fields don't require presizing when you create the struct (and they can't be since a field can contain anything).
You could try to create the struct upfront with all its fields already containing pre-allocated arrays, but as mentioned this is un-necessary and slower rather than faster if you are simply going to copy data over the top of those pre-sized arrays anyway.
Marco
Marco el 9 de Mzo. de 2015
Editada: Marco el 9 de Mzo. de 2015
Adam said: "From my understanding of your code you are dynamically creating a new field of your structure each iteration of the for loop though, not dynamically expanding an array within and existing structure field."
Yes, exactly this is what I am doing, exactly as you describe it in your words. And I thought it would be a good idea, if I tell to the structure how many new fields are going to be added, before I actually add them.
Adam
Adam el 9 de Mzo. de 2015
Editada: Adam el 9 de Mzo. de 2015
The problem is though that a field can contain anything so memory cannot be pre-allocated anyway based only on the fact that a field with some name will exist.
Dynamically expanding a structure with new fields is, as far as I am aware, not a problem (though Stephen's answer gives you plenty of more solid content on that and I too am a little unsure as to what Loren Shure's quote that you queried means so I may be wrong).
Does your code have a problem with speed or memory usage or something else that is causing you to feel you need to do this?
> Does your code have a problem with speed or memory usage or something else that is causing you to feel you need to do this?
Well, my still tiny programs perform more or less acceptable. But I am learning to program, unfortunately without any other MATLAB user or even teacher around. So, I try to always proceed in the proper way, following "good style" and "speed relevant" recommendations as I find them documented or discussed somewhere. But here, in this case, I got stuck. I just couldn't figure out how to index correctly and really would have liked to learn know to correctly do so. The construct in the inner loop was quite interesting to get aware about, this thing with the round brackets around the casename{caseIndex} :
Result.(casename{caseIndex}).MyValue
I now thought to advance this from using a single {caseIndex} to a vector of indices, like maybe {[1:cases]}, but nothing worked. From the received answers I conclude that it might not be possible and the solution is to accept my code to not be improvable in performance at that point without changing the concept in my code to in general design my structure differently. I think, for readability (which is also important) I just skip the idea to improve it and keep it as it is, without any preallocations. Nevertheless, I will still continue to study the ideas of Stephen to more detail, for sure for me there is something important to learn in it. Stephen and all the people he cited will have a reason to present the use of structure their way, and to nowhere have pointed to a case like mine, but actually recommended to avoid incrementally growing a structure where possible.
Adam
Adam el 9 de Mzo. de 2015
Editada: Adam el 9 de Mzo. de 2015
One valuable piece of advice across any programming language and project is not to try to pre-optimise code.
Obviously where you know one method is faster than another and it does not take any more effort to use the faster version (e.g. vectorised code instead of for loops) then clearly you should do this. Sometimes you may also want to try to speed up a bit of code purely as a learning process. This is also fine, but is still subject to the following advice:
Before you start doing any code optimisation:
  • Determine whether your code actually needs optimising (is it running too slowly or are you just optimising it because you think it can be optimised therefore you assume it should be?)
  • Use the Matlab profiler on your code to tell you which parts of the code are the bottlenecks. It is very easy to use compared to profilers for many other languages. There is no point spending ages trying to speed up a piece of code if that piece of code actually only contributes to 1% of your program's total time. Even if you could speed it up to be instantaneous your overall program still won't improve its speed by more than that 1%
One thing I do quite often is to create quick test scripts comparing different ways of achieving the same thing. I wrap up each of these with the timeit function
doc timeit
and then decide which I should use, having already determined if the particular piece of code needs speeding up at all of course.
This can be very useful when you are considering, for example, comparing bsxfun to using reshape or a standard for loop or whether to use arrayfun rather than a for loop. If nothing else it furthers your understanding of these constructs and where they are useful.
Instead of nesting structures for each experiment like this:
Result.(casename{caseIndex}).MyValue
why not just create a non-scalar structure like this:
Result(caseIndex).MyValue
which gives you access to use lots of neat inbuilt tools and functions, and would probably be a lot easier!
THANKS TO BOTH OF YOU, REALLY! As I could mark only one answer as the accepted one, although I would have liked to be allowed to accept both, I at least gave you a vote, Adam!
Stephen's answer is the more complete so the right one to accept, but if you gained something useful from my answer too then that is good :)
James Tursa
James Tursa el 9 de Mzo. de 2015
Editada: James Tursa el 9 de Mzo. de 2015
Some clarification about comments above:
"... Dynamically created fields don't require presizing when you create the struct (and they can't be since a field can contain anything)."
Assuming we are only talking about the field names here (not the field elements themselved). While they don't require pre-allocation, there is a benefit. The amount of benefit depends on the number of fields to be added. Adding field names dynamically (e.g. in a loop) causes MATLAB to re-allocate memory for the field names and add more value addresses iteratively as well ... it is the equivalent of assigning to a cell array index in a loop without pre-allocating the cell array first (cells and structs are stored very similarly internally). Since you are only copying field variable addresses each iteration the copying overhead isn't likely to be much, but it is extra overhead that could potentially be avoided (if one knows all the field names up front).
"... You could try to create the struct upfront with all its fields already containing pre-allocated arrays, but as mentioned this is un-necessary and slower rather than faster if you are simply going to copy data over the top of those pre-sized arrays anyway."
Yes and no. If one is talking only about creating a struct with the proper field names up front, then pre-allocation does make sense and will be faster ... although the overhead savings could be quite small and negligible depending on the number of fields in question (and in fact the extra code to do this may wipe out the small savings altogether). If one is talking about pre-allocating the field elements themselves with variables (e.g., zeros), then this doesn't typically make sense as the references discuss (they get overwritten downstream anyway so the pre-allocation can be a waste of time and resources).
DISCLAIMER: I add these comments for clarification only. The fact is I am in agreement with others who have already posted that there are better ways to organize the data for easier and more efficient access (using dynamic field names in code is notoriously slow and limits how you can access and manipulate the data).

Iniciar sesión para comentar.

Categorías

Productos

Preguntada:

el 9 de Mzo. de 2015

Editada:

el 10 de Mzo. de 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by