splitlabels
Syntax
Description
Use this function when you are working on a machine or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.
idxs = splitlabels(___,Name,Value)'UnderlyingDatastoreIndex',3 splits the labels only in the third
        underlying datastore of a combined datastore.
Examples
Read William Shakespeare's sonnets with the fileread function. Extract all the vowels from the text and convert them to lowercase.
sonnets = fileread("sonnets.txt"); vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';
Count the number of instances of each vowel.
cnts = countlabels(vowels)
cnts=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a      4940     18.368 
      e      9028     33.569 
      i      4895     18.201 
      o      5710     21.232 
      u      2321     8.6302 
Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.
spltn = splitlabels(vowels,[500 300]); for kj = 1:length(spltn) cntsn{kj} = countlabels(vowels(spltn{kj})); end cntsn{:}
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a       500       20   
      e       500       20   
      i       500       20   
      o       500       20   
      u       500       20   
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a       300       20   
      e       300       20   
      i       300       20   
      o       300       20   
      u       300       20   
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a      4140     18.083 
      e      8228      35.94 
      i      4095     17.887 
      o      4910     21.447 
      u      1521     6.6437 
Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.
spltp = splitlabels(vowels,[0.5 0.3]); for kj = 1:length(spltp) cntsp{kj} = countlabels(vowels(spltp{kj})); end cntsp{:}
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a      2470     18.367 
      e      4514     33.566 
      i      2448     18.203 
      o      2855      21.23 
      u      1161     8.6333 
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a      1482     18.371 
      e      2708     33.569 
      i      1468     18.198 
      o      1713     21.235 
      u       696     8.6277 
ans=5×3 table
    Label    Count    Percent
    _____    _____    _______
      a       988     18.368 
      e      1806     33.575 
      i       979       18.2 
      o      1142     21.231 
      u       464     8.6261 
Read William Shakespeare's sonnets with the fileread function. Remove all nonalphabetic characters from the text and convert to lowercase.
sonnets = fileread("sonnets.txt"); letters = lower(sonnets(regexp(sonnets,"[A-z]")))';
Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.
type = repmat("consonant",size(letters)); type(regexp(letters',"[aeiou]")) = "vowel"; T = table(letters,type,'VariableNames',["Letter" "Type"]); head(T)
    Letter       Type    
    ______    ___________
      t       "consonant"
      h       "consonant"
      e       "vowel"    
      s       "consonant"
      o       "vowel"    
      n       "consonant"
      n       "consonant"
      e       "vowel"    
Display the number of instances of each category.
cnt = countlabels(T,'TableVariable',"Type")
cnt=2×3 table
      Type       Count    Percent
    _________    _____    _______
    consonant    46516    63.365 
    vowel        26894    36.635 
Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.
splt = splitlabels(T,0.6,'TableVariable',"Type"); sixty = countlabels(T(splt{1},:),'TableVariable',"Type")
sixty=2×3 table
      Type       Count    Percent
    _________    _____    _______
    consonant    27910    63.366 
    vowel        16136    36.634 
forty = countlabels(T(splt{2},:),'TableVariable',"Type")forty=2×3 table
      Type       Count    Percent
    _________    _____    _______
    consonant    18606    63.363 
    vowel        10758    36.637 
Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the letter y, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.
splt = splitlabels(T,0.6,'Exclude',"y"); sixti = countlabels(T(splt{1},:),'TableVariable',"Type")
sixti=2×3 table
      Type       Count    Percent
    _________    _____    _______
    consonant    26719    62.346 
    vowel        16137    37.654 
forti = countlabels(T(splt{2},:),'TableVariable',"Type")forti=2×3 table
      Type       Count    Percent
    _________    _____    _______
    consonant    17813    62.349 
    vowel        10757    37.651 
Split the table into two sets of the same size. Include only the letters e and s. Randomize the sets.
halves = splitlabels(T,0.5,'randomized','Include',["e" "s"]); cnt = countlabels(T(halves{1},:))
cnt=2×3 table
    Letter    Count    Percent
    ______    _____    _______
      e       4514     64.385 
      s       2497     35.615 
Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers as A, 30 as B, and 30 as C. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels. 
dsData = arrayDatastore(randn(100,1)); dsLabels = arrayDatastore([repmat("A",40,1); ... repmat("B",30,1); repmat("C",30,1)]); dsDataset = combine(dsData,dsLabels); cnt = countlabels(dsDataset,UnderlyingDatastoreIndex=2)
cnt=3×3 table
    Label    Count    Percent
    _____    _____    _______
      A       40        40   
      B       30        30   
      C       30        30   
Split the data set into two sets, one containing 60% of the numbers and the other with the rest.
splitIndices = splitlabels(dsDataset,0.6,UnderlyingDatastoreIndex=2);
dsDataset1 = subset(dsDataset,splitIndices{1});
cnt1 = countlabels(dsDataset1,UnderlyingDatastoreIndex=2)cnt1=3×3 table
    Label    Count    Percent
    _____    _____    _______
      A       24        40   
      B       18        30   
      C       18        30   
dsDataset2 = subset(dsDataset,splitIndices{2});
cnt2 = countlabels(dsDataset2,UnderlyingDatastoreIndex=2)cnt2=3×3 table
    Label    Count    Percent
    _____    _____    _______
      A       16        40   
      B       12        30   
      C       12        30   
Input Arguments
Input label source, specified as one of these:
- A categorical vector. 
- A string vector or a cell array of character vectors. 
- A numeric vector or a cell array of numeric scalars. 
- A logical vector or a cell array of logical scalars. 
- A table with variables containing any of the previous data types. 
- A datastore whose - readallfunction returns any of the previous data types.
- A - CombinedDatastoreobject containing an underlying datastore whose- readallfunction returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.
lblsrc must contain labels that can be converted to a vector with a discrete set of categories.
Example: lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C"
                "D"]) creates the label source as a ten-sample categorical vector with
            four categories: A, B, C, and
                D.
Example: lblsrc = [0 7 2 5 11 17 15 7 7 11] creates the label source
            as a ten-sample numeric vector.
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical | char | string | table | cell | categorical
Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.
- If - pis a scalar,- splitlabelsfinds two splitting index sets and returns a two-element cell array in- idxs.- If - pis an integer, the first element of- idxscontains a vector of indices pointing to the first- pvalues of each label category. The second element of- idxscontains indices pointing to the remaining values of each label category.
- If - pis a value in the range (0, 1) and- lblsrchas Ki elements in the ith category, the first element of- idxscontains a vector of indices pointing to the first- p× Ki values of each label category. The second element of- idxscontains the indices of the remaining values of each label category.
 
- If - pis a vector with N elements of the form p1, p2, …, pN,- splitlabelsfinds N + 1 splitting index sets and returns an (N + 1)-element cell array in- idxs.- If - pis a vector of integers, the first element of- idxsis a vector of indices pointing to the first p1 values of each label category, the next element of- idxscontains the next p2 values of each label category, and so on. The last element in- idxscontains the remaining indices of each label category.
- If - pis a vector of fractions and- lblsrchas Ki elements of the ith category, the first element of- idxsis a vector of indices concatenating the first p1 × Ki values of each category, the next element of- idxscontains the next p2 × Ki values of each label category, and so on. The last element in- idxscontains the remaining indices of each label category.
 
Note
- If - pcontains fractions, then the sum of its elements must not be greater than one.
- If - pcontains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
Name-Value Arguments
Specify optional pairs of arguments as
      Name1=Value1,...,NameN=ValueN, where Name is
      the argument name and Value is the corresponding value.
      Name-value arguments must appear after other arguments, but the order of the
      pairs does not matter.
    
      Before R2021a, use commas to separate each name and value, and enclose 
      Name in quotes.
    
Example: 'TableVariable',"AreaCode",'Exclude',["617" "508"] specifies
        that the function split labels based on telephone area code and exclude numbers from Boston
        and Natick.
Labels to include in the index sets, specified as a vector or cell array of label
              categories. The categories specified with this argument must be of the same type as
              the labels in lblsrc. Each category in the vector or cell array
              must match one of the label categories in lblsrc.
Labels to exclude from the index sets, specified as a vector or cell array of
              label categories. The categories specified with this argument must be of the same type
              as the labels in lblsrc. Each category in the vector or cell
              array must match one of the label categories in lblsrc.
Table variable to read, specified as a character vector or string scalar. If this argument is
            not specified, then splitlabels uses the first table
            variable.
Underlying datastore index, specified as an integer scalar. This argument applies when
                lblsrc is a CombinedDatastore
            object. splitlabels counts the labels in the datastore obtained
            using the UnderlyingDatastores property of
                lblsrc.
Output Arguments
Splitting indices, returned as a cell array.
Version History
Introduced in R2021a
See Also
countlabels (Signal Processing Toolbox) | filenames2labels (Signal Processing Toolbox) | folders2labels (Signal Processing Toolbox)
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Seleccione un país/idioma
Seleccione un país/idioma para obtener contenido traducido, si está disponible, y ver eventos y ofertas de productos y servicios locales. Según su ubicación geográfica, recomendamos que seleccione: .
También puede seleccionar uno de estos países/idiomas:
Cómo obtener el mejor rendimiento
Seleccione China (en idioma chino o inglés) para obtener el mejor rendimiento. Los sitios web de otros países no están optimizados para ser accedidos desde su ubicación geográfica.
América
- América Latina (Español)
- Canada (English)
- United States (English)
Europa
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)