how to get rid of NaN column by column

Question

0 votos

Dear members,

I have a program that allows me to what I set out to do on a single time series. Now that I’m satisfied with that, I’ve got to roll it out on a whole bunch of time series. Say I have an indicator for many countries, where data availability differs, so lots of NaN-there are plenty of members posting answers and questions about getting rid of NaN in case of vectors, or, as far as matrices are concerned, getting rid of entire rows or columns whenever a NaN appears.

For instance:

https://uk.mathworks.com/matlabcentral/answers/179142-how-can-i-remove-nan-values-from-a-matrix
https://stackoverflow.com/questions/5202680/matlab-how-to-efficiently-remove-nan-elements-from-matrix#
http://uk.mathworks.com/matlabcentral/fileexchange/41941-snip-m-snip-elements-out-of-vectors-matrices

However, I don’t want to get rid of whole rows or columns, but simply to purge each column independently of its NaN regardless of what happens to the others. How would I go about doing this? Any ideas would be much appreciated.

I started with something like this:

 % clearvars
 data = xlsread('DMdata.xlsx','B2:AA270','Sheet1');
 datacolumns=size(data,2);
 for bycolmn =1:length(datacolumns);

And then I got stuck.

This:

data(isnan(data))= []

deletes the NaN and returns my data into one big long vector. Clearly something of this fashion ought to go in the loop above, but I'm not sure how.

Many thanks in advance

6 comentarios
Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

Fede C 2018 London el 17 de Sept. de 2017

Abrir en MATLAB Online

Hi Cedric,

Thanks for your email. Absolutely. What I am trying to do is to compute the spectral density of each time series (where each series is a country-specific indicator). That is straightforward if all I had is one series. The trouble is that I have about 30 indicators for a number of countries (the excel spreadsheet I uploaded contains one such indicator), and that is why it matters that I find a way to deal with the NaN values. So my code -in the absence of NaN values would be something along these lines:

if true
  % clearvars
   data = xlsread('DMdata.xlsx','B2:AA270','Sheet1');
   for count=1:26;
   lhnfbBLS=data(1:248,count);
   %smoothing kernel
   kr = hamming(13);
   kr = kr/sum(kr);
   % call spectrum estimation function
   [s,F] = DatSpec(lhnfbBLS,kr,1,1024);
   %%plot result for periods 4-80
   frmn = 1/80;            % lowest frequency to plot
   frmx = 1/4;             % highest frequency to plot
   fsz = 16;                   % font size for axis/legend
   tfsz = 16;                  % font size for title
   lnwd = 4;                   % line width for plots
   xtickper = [2, 4, 6, 24, 32, 40, 50, 60, 80, 128];   
   % vector of xtick mark locations
   ntick = numel(xtickper);              % number of xtick marks
   xticklab = cell(1,ntick);     % cell to hold xtick labels
   for j = 1:ntick               % for each tick location
   xticklab{j} = num2str(xtickper(j),'%i'); % label=string value of xtick mark
  end
  xtickvec = log(xtickper);    % get logs of tick marks
  tkind = find((1./xtickper>frmn)&(1./xtickper<frmx));
   xax = log(min(1./F,realmax));  % get log of periodicities  associated with frequencies
   xl = [frmn,frmx];                           % min and max of x-axis in frequencies
   xl = log(1./min(xl([2,1]),realmax));        % min and max of x-axis in log of periodicity
   figure(count)                                % open figure
   clf                                         % clear it
    h1=plot(xax,s,'-','LineWidth',lnwd,'Color','k');   % plot     data spectrum
        xlabel('Periodicity','FontName','Times','FontSize',fsz,'Interpreter','latex')
xlim(xl)
yl = ylim;
yl(1) = 0;
ylim(yl)
set(gca,'XTick',xtickvec,'XTickLabel',xticklab,'FontName','Times','FontSize',fsz,'box','on')
end;
  end

dpb el 17 de Sept. de 2017

Abrir en MATLAB Online

What is an "indicator" and what is meaning of the variables? The data in the attached spreadsheet shows

>> sum(isfinite(DM))
ans =
Columns 1 through 17
 214   203   192   230   249   189   230   230   220   108   100   108   230   249   230   230   182
Columns 18 through 26
 230   125   218   164   230   170   185   248   249
>>

so that there are no records of full length of 250 (albeit a few are close). The base methods for spectral estimation in Matlab presume uniform sampling; if the 250 observations are the time series then you'll need other techniques than simply analyzing the finite elements of each column; getting those is the least of your problems.

Fede C 2018 London el 17 de Sept. de 2017

Sorry, loose talk on my part- I meant economic indicators, i.e. just time series of variables like non-farm business hours, etc. I'm happy with what the periodogram does in the program- there's a function in there (DatSpec) that does the job. However, try as I might, my programming skills aren't good enough to do just that-analysing the finite elements of each column

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Cedric el 17 de Sept. de 2017

Editada: Cedric el 18 de Sept. de 2017

Abrir en MATLAB Online

0 votos

I think that the major problem is basic data manipulation, and once you have that right you'll be fine for the rest. Here are a few hints and we'll see how it goes from there.

Loading data : you could use the other outputs of XLSREAD; the first is the numeric part, the second the text, and the third raw data (a mix of text and numbers).

[num,txt,raw] = xlsread( 'DMdata.xls' ) ;

If we look at what we get with that, indexing a 3 by 6 block (so we see the top-left part of the output of XLSREAD):

 >> raw(1:3,1:6)
 ans =
  3×6 cell array
  Columns 1 through 4
    [    NaN]    'AU UNEMPLOYMENT R…'    'AU UNEMPLOYMENT:R…'    'OE UNEMPLOYMENT R…'
    'Q2 1950'    'NA'                     'NA'                     'NA'                 
    'Q3 1950'    'NA'                     'NA'                     'NA'                 
Columns 5 through 6
  'BG UNEMPLOYMENT R…'    'CN UNEMPLOYMENT:R…'
  'NA'                     'NA'                 
  'NA'                     'NA'

We see that in cell array raw we have everything and the NA values are strings. Doing the same thing for the num array, we get:

 >> num(1:3,1:6)
 ans =
       NaN       NaN       NaN       NaN    4.8000       NaN
       NaN       NaN       NaN       NaN    4.4300       NaN
       NaN       NaN       NaN       NaN    4.0700       NaN

Well, that's interesting: it appears that the first row doesn't correspond to Q2 1950 but to Q1 1995, so XLSREAD outputs in fact the smallest block that encompasses all numeric values. Good to know, because if you need to associate a date to each row, you'll have to "sync" with what XLSREAD does if you use its numeric output.

Processing row and column headers : taking a little detour, let's imagine that we need the dates and the column headers. The cell array of all dates is column 1 of cell array raw, elements 2 to the end: raw(2:end,1). If we need the quarter IDs and the years as distinct numbers, many approaches are possible. I propose the following: we can use SSCANF to read/convert numbers from a string; let's see how it works on the first date entry: raw{2,1}:

 >> sscanf( raw{2,1}, 'Q%d %d' )
 ans =
           2
        1950

Simple, now we need to see if we can apply this to each element of raw(2:end,1). We could loop, or use CELLFUN as follows (more concise):

 >> temp = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;

where

@(x) sscanf( x, 'Q%d %d' ) : is an anonymous function of x. It is almost the same as creating an M-File: function numValues = myScan( x ), numValues = sscanf( x, 'Q%d %d' ) ; end and passing its handle to CELLFUN (handles are roughly "function pointers" in MATLAB, they are one way to pass a function to another function), with the advantage that we can declare it inline with no name.
raw(2:end,1) : cell array on which CELLFUN must operate. As a reminder, {} indexing accesses the content of a cell, and () indexing is normal block indexing: it returns the (sub-) cell array indexed (and not the content of cells).
'UniformOutput', false : tell CELLFUN to output a cell array of outputs of the anonymous function, which is required because the function outputs an array (e.g. [2; 1950] and not a scalar). You'll have to experiment a little with CELLFUN to see what are its limitations.

Now we can for example concatenate the output and create a vertical numeric array out of it:

 >> dates = horzcat( temp{:} ).' ;
 dates =
           2        1950
           3        1950
           4        1950
           1        1951
           ...

We can also extract the column headers, which are stored in the first row of raw, elements 2 to the end: raw(1,2:end). This could be useful for creating legends, labels, etc.

 >> headers = raw(1,2:end)
 headers =
  1×26 cell array
    'AU UNEMPLOYMENT R…'    'AU UNEMPLOYMENT:R…'    'OE UNEMPLOYMENT R…'    ...

This is interesting, because now we have all the dates, all the column headers, and we could work without hard coding boundaries like in your code.

 >> nCols  = numel( headers ) ;
 >> nDates = size( dates, 1 ) ;

which can be used when defining loop boundaries:

 for colId = 1 : nCols
     ...
 end

Using dates, we can also build vectors of logicals for extracting relevant data, e.g. for a given period:

 >> selectYears = dates(:,2) >= 1951 & dates(:,2) <= 1953
 selectYears =
  269×1 logical array
   0
   0
   0
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   1
   0
   0
   0
   0
  ...

Processing data : we saw that the numeric array num is defined in a way that can be difficult to use, as XLSREAD truncates it to the smallest block that encompasses all numeric values. For this reason I generally prefer working with raw data. Assuming that you just want raw data, we can update the first call to XLSREAD by the way:

[~,~,raw] = xlsread( 'DMdata.xls' ) ;

so we don't have num and txt that we won't use. Working with raw, the data are in the block raw(2:end,2:end). Let's see how it looks at a place with a few data and a few 'NA':

 >> raw(19:22,5:7)
 ans =
  4×3 cell array
    'NA'    'NA'        'NA'
    'NA'    'NA'        'NA'
    'NA'    [4.8000]    'NA'
    'NA'    [4.4300]    'NA'

we see that it contains strings and numbers. We can check it:

 >> class( raw{21,6} )
 ans =
  double
 >> class( raw{21,7} )
 ans =
  char

There are multiple ways to proceed here; I propose to replace all entries of type/class char by NaN, and then to convert the cell array into a numeric array. To detect if a cell content is char we can use ISCHAR. We also saw how to apply a function to all elements of a cell array:

 >> temp = raw(2:end,2:end) ;
 >> selectChar = cellfun( @ischar, temp ) ;

where

@ischar is a shortcut for @(x)ischar(x) (there is a difference but it doesn't matter now).
There is no need for the 'UniformOutput' argument, because ISCHAR outputs a scalar.

We can see that selectChar is a 269x26 array of logical

 >> whos selectChar
  Name              Size            Bytes  Class      Attributes
  selectChar      269x26             6994  logical

that we can use to index (logical indexing) cell array temp for a replacement:

>> temp(selectChar) = {NaN} ;

Now we see what 'NA' strings were replaced with numeric NaN:

 >> temp(18:21,4:6)
 ans =
  4×3 cell array
    [NaN]    [   NaN]    [NaN]
    [NaN]    [   NaN]    [NaN]
    [NaN]    [4.8000]    [NaN]
    [NaN]    [4.4300]    [NaN]

and we can convert this cell array to a numeric array:

>> data = cell2mat( temp ) ;

If you display data, you will see that it contains all the rows and columns present in your Excel file and that there is no truncation.

Summary: I developed a lot so it is clear, but the whole thing until here can be condensed into:

 [~,~,raw] = xlsread( 'DMdata.xls' ) ;
 dates     = cellfun( @(x) sscanf( x, 'Q%d %d' ), raw(2:end,1), 'UniformOutput', false ) ;
 dates     = horzcat( dates{:} ).' ;
 headers   = raw(1,2:end) ;
 data      = raw(2:end,2:end) ;             % Cell array mixed str/values.
 data(cellfun( @ischar, data )) = {NaN} ;   % Cell array NaN/values
 data      = cell2mat( data ) ;             % Numeric array.
 [nDates, nCols] = size( data ) ;

which outputs the following when run from scratch:

 >> whos
    Name             Size             Bytes  Class     Attributes
    data           269x26             55952  double              
    dates          269x2               4304  double              
    headers          1x26              5050  cell                
    nCols            1x1                  8  double              
    nDates           1x1                  8  double              
    raw            270x27            871220  cell

This approach has the advantage that it outputs "clean" arrays, whose size do match: dates has 269 rows that correspond to the 269 rows of data, headers has 26 columns that correspond to the 26 columns of data, there was no truncation, and we know the size of everything programmatically (no need for hard coding "weird" boundaries).

Further processing: now we can work with this: first, many functions can work with NaN and/or have a special version and/or special parameters that make them support NaNs. However, if we needed to extract non-NaN values, e.g. from column 3, we could proceed as follows:

 select = ~isnan( data(:,3) ) ;   % ~ = NOT, can be used for logical indexing
 values = data(select,3) ;

or directly

values = data(~isnan( data(:,3) ),3) ;

This can easily be used in a loop over columns:

 for colId = 1 : nCols
    values = data(~isnan( data(:,colId) ),colId) ;
    ... do something with values
 end

Note that if we needed corresponding dates and/or quarters, we could easily get them as well:

 select = ~isnan( data(:,3) ) ;
 values = data(select,3) ;
 datesForValues = dates(select,:) ;

or again in a loop, using the relevant header for a title:

 for colId = 1 : nCols
    select = ~isnan( data(:,colId ) ) ;
    values = data(select,colId ) ;
    datesForValues = dates(select,:) ;
    results = myAnalyzer( dates(select,:), values ) ;
    plot( ... ) ;
    title( headers{colId} ) ;
 end

My guess is that if you go over these explanations, understand logical indexing, how to operate on the various types of arrays, etc, your problem will disappear!

Cheers, Cedric

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Cedric el 17 de Sept. de 2017

My pleasure. I edited it 18 minutes ago, so be sure to reload the page ;-)

Cedric el 17 de Sept. de 2017

Editada: Cedric el 17 de Sept. de 2017

Corrected a few typos. Updated a bit the code (in the summary) to simplify.

Iniciar sesión para comentar.

Answer 2

dpb el 17 de Sept. de 2017

Editada: dpb el 18 de Sept. de 2017

Abrir en MATLAB Online

0 votos

DM=xlsread('DMdata.xlsx)';
for i=1:size(DM,2)
  data=isfinite(DM(:,i));
  % do whatever with each column here; _data_ holds the non-NaN data for each column in turn
  ...

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

how to get rid of NaN column by column

6 comentarios
Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Más respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Categorías

Etiquetas

Community Treasure Hunt

how to get rid of NaN column by column

6 comentarios Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Más respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Categorías

Etiquetas

Ver también

Community Treasure Hunt

6 comentarios
Mostrar 4 comentarios más antiguos Ocultar 4 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos