how to get rid of NaN column by column

Dear members,
I have a program that allows me to what I set out to do on a single time series. Now that I’m satisfied with that, I’ve got to roll it out on a whole bunch of time series. Say I have an indicator for many countries, where data availability differs, so lots of NaN-there are plenty of members posting answers and questions about getting rid of NaN in case of vectors, or, as far as matrices are concerned, getting rid of entire rows or columns whenever a NaN appears.
For instance:
However, I don’t want to get rid of whole rows or columns, but simply to purge each column independently of its NaN regardless of what happens to the others. How would I go about doing this? Any ideas would be much appreciated.
I started with something like this:
% clearvars
data = xlsread('DMdata.xlsx','B2:AA270','Sheet1');
datacolumns=size(data,2);
for bycolmn =1:length(datacolumns);
And then I got stuck.
This:
data(isnan(data))= []
deletes the NaN and returns my data into one big long vector. Clearly something of this fashion ought to go in the loop above, but I'm not sure how.
Many thanks in advance

6 comentarios

dpb
dpb el 17 de Sept. de 2017
How are these data stored? If in 2D array or one of the class structures TMW has included in Matlab such as table or timeseries you can't have different-length series in different columns; they must be rectangular arrays.
You could save each column as a cell in a cell array but is probably more trouble in using the result for analysis later on than simply using the NaN as missing value indicator as is.
Need to know more details to respond in more detail on specifics...
Fede C 2018 London
Fede C 2018 London el 17 de Sept. de 2017
Hi,
It's a two dimensional array-see attachment.
Cedric
Cedric el 17 de Sept. de 2017
There is no "empty placeholder" in numeric arrays. You cannot remove NaN entries without loosing the structure of the numeric array. Other data structures, such as cell arrays, can store empty arrays, but you cannot compute with cell arrays.
What you need to define before we can answer, is what you want to do with the data.
Hi Cedric,
Thanks for your email. Absolutely. What I am trying to do is to compute the spectral density of each time series (where each series is a country-specific indicator). That is straightforward if all I had is one series. The trouble is that I have about 30 indicators for a number of countries (the excel spreadsheet I uploaded contains one such indicator), and that is why it matters that I find a way to deal with the NaN values. So my code -in the absence of NaN values would be something along these lines:
if true
% clearvars
data = xlsread('DMdata.xlsx','B2:AA270','Sheet1');
for count=1:26;
lhnfbBLS=data(1:248,count);
%smoothing kernel
kr = hamming(13);
kr = kr/sum(kr);
% call spectrum estimation function
[s,F] = DatSpec(lhnfbBLS,kr,1,1024);
%%plot result for periods 4-80
frmn = 1/80; % lowest frequency to plot
frmx = 1/4; % highest frequency to plot
fsz = 16; % font size for axis/legend
tfsz = 16; % font size for title
lnwd = 4; % line width for plots
xtickper = [2, 4, 6, 24, 32, 40, 50, 60, 80, 128];
% vector of xtick mark locations
ntick = numel(xtickper); % number of xtick marks
xticklab = cell(1,ntick); % cell to hold xtick labels
for j = 1:ntick % for each tick location
xticklab{j} = num2str(xtickper(j),'%i'); % label=string value of xtick mark
end
xtickvec = log(xtickper); % get logs of tick marks
tkind = find((1./xtickper>frmn)&(1./xtickper<frmx));
xax = log(min(1./F,realmax)); % get log of periodicities associated with frequencies
xl = [frmn,frmx]; % min and max of x-axis in frequencies
xl = log(1./min(xl([2,1]),realmax)); % min and max of x-axis in log of periodicity
figure(count) % open figure
clf % clear it
h1=plot(xax,s,'-','LineWidth',lnwd,'Color','k'); % plot data spectrum
xlabel('Periodicity','FontName','Times','FontSize',fsz,'Interpreter','latex')
xlim(xl)
yl = ylim;
yl(1) = 0;
ylim(yl)
set(gca,'XTick',xtickvec,'XTickLabel',xticklab,'FontName','Times','FontSize',fsz,'box','on')
end;
end
dpb
dpb el 17 de Sept. de 2017
What is an "indicator" and what is meaning of the variables? The data in the attached spreadsheet shows
>> sum(isfinite(DM))
ans =
Columns 1 through 17
214 203 192 230 249 189 230 230 220 108 100 108 230 249 230 230 182
Columns 18 through 26
230 125 218 164 230 170 185 248 249
>>
so that there are no records of full length of 250 (albeit a few are close). The base methods for spectral estimation in Matlab presume uniform sampling; if the 250 observations are the time series then you'll need other techniques than simply analyzing the finite elements of each column; getting those is the least of your problems.
Fede C 2018 London
Fede C 2018 London el 17 de Sept. de 2017
Sorry, loose talk on my part- I meant economic indicators, i.e. just time series of variables like non-farm business hours, etc. I'm happy with what the periodogram does in the program- there's a function in there (DatSpec) that does the job. However, try as I might, my programming skills aren't good enough to do just that-analysing the finite elements of each column

Iniciar sesión para comentar.

 Respuesta aceptada

Cedric
Cedric el 17 de Sept. de 2017
Editada: Cedric el 18 de Sept. de 2017
I think that the major problem is basic data manipulation, and once you have that right you'll be fine for the rest. Here are a few hints and we'll see how it goes from there.
Loading data : you could use the other outputs of XLSREAD; the first is the numeric part, the second the text, and the third raw data (a mix of text and numbers).
[num,txt,raw] = xlsread( 'DMdata.xls' ) ;
If we look at what we get with that, indexing a 3 by 6 block (so we see the top-left part of the output of XLSREAD):
>> raw(1:3,1:6)
ans =
3×6 cell array
Columns 1 through 4
[ NaN] 'AU UNEMPLOYMENT R…' 'AU UNEMPLOYMENT:R…' 'OE UNEMPLOYMENT R…'
'Q2 1950' 'NA' 'NA' 'NA'
'Q3 1950' 'NA' 'NA' 'NA'
Columns 5 through 6
'BG UNEMPLOYMENT R…' 'CN UNEMPLOYMENT:R…'
'NA' 'NA'
'NA' 'NA'
We see that in cell array raw we have everything and the NA values are strings. Doing the same thing for the num array, we get:
>> num(1:3,1:6)
ans =
NaN NaN NaN NaN 4.8000 NaN
NaN NaN NaN NaN 4.4300 NaN
NaN NaN NaN NaN 4.0700 NaN
Well, that's interesting: it appears that the first row doesn't correspond to Q2 1950 but to Q1 1995, so XLSREAD outputs in fact the smallest block that encompasses all numeric values. Good to know, because if you need to associate a date to each row, you'll have to "sync" with what XLSREAD does if you use its numeric output.
Processing row and column headers : taking a little detour, let's imagine that we need the dates and the column headers. The cell array of all dates is column 1 of cell array raw, elements 2 to the end: raw(2:end,1). If we need the quarter IDs and the years as distinct numbers, many approaches are possible. I propose the following: we can use SSCANF to read/convert numbers from a string; let's see how it works on the first date entry: raw{2,1}:
>> sscanf( raw{2,1}, 'Q%d %d' )
ans =
2
1950
Simple, now we need to see if we can apply this to each element of raw(2:end,1). We could loop, or use CELLFUN as follows (more concise):
>> temp = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;
where
  • @(x) sscanf( x, 'Q%d %d' ) : is an anonymous function of x. It is almost the same as creating an M-File: function numValues = myScan( x ), numValues = sscanf( x, 'Q%d %d' ) ; end and passing its handle to CELLFUN (handles are roughly "function pointers" in MATLAB, they are one way to pass a function to another function), with the advantage that we can declare it inline with no name.
  • raw(2:end,1) : cell array on which CELLFUN must operate. As a reminder, {} indexing accesses the content of a cell, and () indexing is normal block indexing: it returns the (sub-) cell array indexed (and not the content of cells).
  • 'UniformOutput', false : tell CELLFUN to output a cell array of outputs of the anonymous function, which is required because the function outputs an array (e.g. [2; 1950] and not a scalar). You'll have to experiment a little with CELLFUN to see what are its limitations.
Now we can for example concatenate the output and create a vertical numeric array out of it:
>> dates = horzcat( temp{:} ).' ;
dates =
2 1950
3 1950
4 1950
1 1951
...
We can also extract the column headers, which are stored in the first row of raw, elements 2 to the end: raw(1,2:end). This could be useful for creating legends, labels, etc.
>> headers = raw(1,2:end)
headers =
1×26 cell array
'AU UNEMPLOYMENT R…' 'AU UNEMPLOYMENT:R…' 'OE UNEMPLOYMENT R…' ...
This is interesting, because now we have all the dates, all the column headers, and we could work without hard coding boundaries like in your code.
>> nCols = numel( headers ) ;
>> nDates = size( dates, 1 ) ;
which can be used when defining loop boundaries:
for colId = 1 : nCols
...
end
Using dates, we can also build vectors of logicals for extracting relevant data, e.g. for a given period:
>> selectYears = dates(:,2) >= 1951 & dates(:,2) <= 1953
selectYears =
269×1 logical array
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
...
Processing data : we saw that the numeric array num is defined in a way that can be difficult to use, as XLSREAD truncates it to the smallest block that encompasses all numeric values. For this reason I generally prefer working with raw data. Assuming that you just want raw data, we can update the first call to XLSREAD by the way:
[~,~,raw] = xlsread( 'DMdata.xls' ) ;
so we don't have num and txt that we won't use. Working with raw, the data are in the block raw(2:end,2:end). Let's see how it looks at a place with a few data and a few 'NA':
>> raw(19:22,5:7)
ans =
4×3 cell array
'NA' 'NA' 'NA'
'NA' 'NA' 'NA'
'NA' [4.8000] 'NA'
'NA' [4.4300] 'NA'
we see that it contains strings and numbers. We can check it:
>> class( raw{21,6} )
ans =
double
>> class( raw{21,7} )
ans =
char
There are multiple ways to proceed here; I propose to replace all entries of type/class char by NaN, and then to convert the cell array into a numeric array. To detect if a cell content is char we can use ISCHAR. We also saw how to apply a function to all elements of a cell array:
>> temp = raw(2:end,2:end) ;
>> selectChar = cellfun( @ischar, temp ) ;
where
  • @ischar is a shortcut for @(x)ischar(x) (there is a difference but it doesn't matter now).
  • There is no need for the 'UniformOutput' argument, because ISCHAR outputs a scalar.
We can see that selectChar is a 269x26 array of logical
>> whos selectChar
Name Size Bytes Class Attributes
selectChar 269x26 6994 logical
that we can use to index (logical indexing) cell array temp for a replacement:
>> temp(selectChar) = {NaN} ;
Now we see what 'NA' strings were replaced with numeric NaN:
>> temp(18:21,4:6)
ans =
4×3 cell array
[NaN] [ NaN] [NaN]
[NaN] [ NaN] [NaN]
[NaN] [4.8000] [NaN]
[NaN] [4.4300] [NaN]
and we can convert this cell array to a numeric array:
>> data = cell2mat( temp ) ;
If you display data, you will see that it contains all the rows and columns present in your Excel file and that there is no truncation.
Summary: I developed a lot so it is clear, but the whole thing until here can be condensed into:
[~,~,raw] = xlsread( 'DMdata.xls' ) ;
dates = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;
dates = horzcat( dates{:} ).' ;
headers = raw(1,2:end) ;
data = raw(2:end,2:end) ; % Cell array mixed str/values.
data(cellfun( @ischar, data )) = {NaN} ; % Cell array NaN/values
data = cell2mat( data ) ; % Numeric array.
[nDates, nCols] = size( data ) ;
which outputs the following when run from scratch:
>> whos
Name Size Bytes Class Attributes
data 269x26 55952 double
dates 269x2 4304 double
headers 1x26 5050 cell
nCols 1x1 8 double
nDates 1x1 8 double
raw 270x27 871220 cell
This approach has the advantage that it outputs "clean" arrays, whose size do match: dates has 269 rows that correspond to the 269 rows of data, headers has 26 columns that correspond to the 26 columns of data, there was no truncation, and we know the size of everything programmatically (no need for hard coding "weird" boundaries).
Further processing: now we can work with this: first, many functions can work with NaN and/or have a special version and/or special parameters that make them support NaNs. However, if we needed to extract non-NaN values, e.g. from column 3, we could proceed as follows:
select = ~isnan( data(:,3) ) ; % ~ = NOT, can be used for logical indexing
values = data(select,3) ;
or directly
values = data(~isnan( data(:,3) ),3) ;
This can easily be used in a loop over columns:
for colId = 1 : nCols
values = data(~isnan( data(:,colId) ),colId) ;
... do something with values
end
Note that if we needed corresponding dates and/or quarters, we could easily get them as well:
select = ~isnan( data(:,3) ) ;
values = data(select,3) ;
datesForValues = dates(select,:) ;
or again in a loop, using the relevant header for a title:
for colId = 1 : nCols
select = ~isnan( data(:,colId ) ) ;
values = data(select,colId ) ;
datesForValues = dates(select,:) ;
results = myAnalyzer( dates(select,:), values ) ;
plot( ... ) ;
title( headers{colId} ) ;
end
My guess is that if you go over these explanations, understand logical indexing, how to operate on the various types of arrays, etc, your problem will disappear!
Cheers, Cedric

3 comentarios

Fede C 2018 London
Fede C 2018 London el 17 de Sept. de 2017
Amazing. I'm gonna go through it now- thanks ever so much Cedric, I really appreciate that.
Cedric
Cedric el 17 de Sept. de 2017
My pleasure. I edited it 18 minutes ago, so be sure to reload the page ;-)
Cedric
Cedric el 17 de Sept. de 2017
Editada: Cedric el 17 de Sept. de 2017
Corrected a few typos. Updated a bit the code (in the summary) to simplify.

Iniciar sesión para comentar.

Más respuestas (1)

dpb
dpb el 17 de Sept. de 2017
Editada: dpb el 18 de Sept. de 2017
DM=xlsread('DMdata.xlsx)';
for i=1:size(DM,2)
data=isfinite(DM(:,i));
% do whatever with each column here; _data_ holds the non-NaN data for each column in turn
...

Preguntada:

el 16 de Sept. de 2017

Editada:

dpb
el 18 de Sept. de 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by