speed up renamecats/categorical multiple columns
    5 visualizaciones (últimos 30 días)
  
       Mostrar comentarios más antiguos
    
I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.
I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.
See below an example
tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["123", "234", "345", "456", "567"]', ...
    'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["apple", "banana", "cat", "dog", "elephont"]', ...
    'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
    ["East", "West", "North", "South", "Middle"]', ...
    'VariableNames', {'Code', 'Meaning'});
Vars   = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
    tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
        dictionary.(Vars{iC}).Meaning);
end
Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).
Thank you!
0 comentarios
Respuestas (1)
  Campion Loong
    
 el 9 de Oct. de 2020
        Hi Peng,
It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).
In this case, why not create the table with categorical to begin with:
tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"],      dictionary.A.Code, dictionary.A.Meaning),...
            categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
            'VariableNames', {'A', 'B'});
There is no loop, faster and much more readable.
3 comentarios
  Campion Loong
    
 el 9 de Oct. de 2020
				If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.
If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:
Ver también
Categorías
				Más información sobre Large Files and Big Data en Help Center y File Exchange.
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

