Find indices to split labels according to specified proportions

Since R2021a

Syntax

``idxs = splitlabels(lblsrc,p)``
``idxs = splitlabels(lblsrc,p,'randomized')``
``idxs = splitlabels(___,Name,Value)``

Description

Use this function when you are working on a machine or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.

````idxs = splitlabels(lblsrc,p)` finds logical indices that split the labels in `lblsrc` based on the proportions or number of labels specified in `p`.```

````idxs = splitlabels(lblsrc,p,'randomized')` randomly assigns the specified proportion of label values to each index set in `idxs`.```

````idxs = splitlabels(___,Name,Value)` specifies additional input arguments using name-value pairs. For example, `'UnderlyingDatastoreIndex',3` splits the labels only in the third underlying datastore of a combined datastore.```

Examples

Read William Shakespeare's sonnets with the `fileread` function. Extract all the vowels from the text and convert them to lowercase.

```sonnets = fileread("sonnets.txt"); vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';```

Count the number of instances of each vowel.

`cnts = countlabels(vowels)`
```cnts=5×3 table Label Count Percent _____ _____ _______ a 4940 18.368 e 9028 33.569 i 4895 18.201 o 5710 21.232 u 2321 8.6302 ```

Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.

```spltn = splitlabels(vowels,[500 300]); for kj = 1:length(spltn) cntsn{kj} = countlabels(vowels(spltn{kj})); end cntsn{:}```
```ans=5×3 table Label Count Percent _____ _____ _______ a 500 20 e 500 20 i 500 20 o 500 20 u 500 20 ```
```ans=5×3 table Label Count Percent _____ _____ _______ a 300 20 e 300 20 i 300 20 o 300 20 u 300 20 ```
```ans=5×3 table Label Count Percent _____ _____ _______ a 4140 18.083 e 8228 35.94 i 4095 17.887 o 4910 21.447 u 1521 6.6437 ```

Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.

```spltp = splitlabels(vowels,[0.5 0.3]); for kj = 1:length(spltp) cntsp{kj} = countlabels(vowels(spltp{kj})); end cntsp{:}```
```ans=5×3 table Label Count Percent _____ _____ _______ a 2470 18.367 e 4514 33.566 i 2448 18.203 o 2855 21.23 u 1161 8.6333 ```
```ans=5×3 table Label Count Percent _____ _____ _______ a 1482 18.371 e 2708 33.569 i 1468 18.198 o 1713 21.235 u 696 8.6277 ```
```ans=5×3 table Label Count Percent _____ _____ _______ a 988 18.368 e 1806 33.575 i 979 18.2 o 1142 21.231 u 464 8.6261 ```

Read William Shakespeare's sonnets with the `fileread` function. Remove all nonalphabetic characters from the text and convert to lowercase.

```sonnets = fileread("sonnets.txt"); letters = lower(sonnets(regexp(sonnets,"[A-z]")))';```

Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.

```type = repmat("consonant",size(letters)); type(regexp(letters',"[aeiou]")) = "vowel"; T = table(letters,type,'VariableNames',["Letter" "Type"]); head(T)```
``` Letter Type ______ ___________ t "consonant" h "consonant" e "vowel" s "consonant" o "vowel" n "consonant" n "consonant" e "vowel" ```

Display the number of instances of each category.

`cnt = countlabels(T,'TableVariable',"Type")`
```cnt=2×3 table Type Count Percent _________ _____ _______ consonant 46516 63.365 vowel 26894 36.635 ```

Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.

```splt = splitlabels(T,0.6,'TableVariable',"Type"); sixty = countlabels(T(splt{1},:),'TableVariable',"Type")```
```sixty=2×3 table Type Count Percent _________ _____ _______ consonant 27910 63.366 vowel 16136 36.634 ```
`forty = countlabels(T(splt{2},:),'TableVariable',"Type")`
```forty=2×3 table Type Count Percent _________ _____ _______ consonant 18606 63.363 vowel 10758 36.637 ```

Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the letter y, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.

```splt = splitlabels(T,0.6,'Exclude',"y"); sixti = countlabels(T(splt{1},:),'TableVariable',"Type")```
```sixti=2×3 table Type Count Percent _________ _____ _______ consonant 26719 62.346 vowel 16137 37.654 ```
`forti = countlabels(T(splt{2},:),'TableVariable',"Type")`
```forti=2×3 table Type Count Percent _________ _____ _______ consonant 17813 62.349 vowel 10757 37.651 ```

Split the table into two sets of the same size. Include only the letters e and s. Randomize the sets.

```halves = splitlabels(T,0.5,'randomized','Include',["e" "s"]); cnt = countlabels(T(halves{1},:))```
```cnt=2×3 table Letter Count Percent ______ _____ _______ e 4514 64.385 s 2497 35.615 ```

Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers as `A`, 30 as `B`, and 30 as `C`. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.

```dsData = arrayDatastore(randn(100,1)); dsLabels = arrayDatastore([repmat("A",40,1); repmat("B",30,1); repmat("C",30,1)]); dsDataset = combine(dsData,dsLabels); cnt = countlabels(dsDataset,'UnderlyingDatastoreIndex',2)```
```cnt=3×3 table Label Count Percent _____ _____ _______ A 40 40 B 30 30 C 30 30 ```

Split the data set into two sets, one containing 60% of the numbers and the other with the rest.

```splitIndices = splitlabels(dsDataset,0.6,'UnderlyingDatastoreIndex',2); dsDataset1 = subset(dsDataset,splitIndices{1}); cnt1 = countlabels(dsDataset1,'UnderlyingDatastoreIndex',2)```
```cnt1=3×3 table Label Count Percent _____ _____ _______ A 24 40 B 18 30 C 18 30 ```
```dsDataset2 = subset(dsDataset,splitIndices{2}); cnt2 = countlabels(dsDataset2,'UnderlyingDatastoreIndex',2)```
```cnt2=3×3 table Label Count Percent _____ _____ _______ A 16 40 B 12 30 C 12 30 ```

Input Arguments

Input label source, specified as one of these:

• A categorical vector.

• A string vector or a cell array of character vectors.

• A numeric vector or a cell array of numeric scalars.

• A logical vector or a cell array of logical scalars.

• A table with variables containing any of the previous data types.

• A datastore whose `readall` function returns any of the previous data types.

• A `CombinedDatastore` object containing an underlying datastore whose `readall` function returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.

`lblsrc` must contain labels that can be converted to a vector with a discrete set of categories.

Example: ```lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C" "D"])``` creates the label source as a ten-sample categorical vector with four categories: `A`, `B`, `C`, and `D`.

Example: `lblsrc = [0 7 2 5 11 17 15 7 7 11]` creates the label source as a ten-sample numeric vector.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64` | `logical` | `char` | `string` | `table` | `cell` | `categorical`

Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.

• If `p` is a scalar, `splitlabels` finds two splitting index sets and returns a two-element cell array in `idxs`.

• If `p` is an integer, the first element of `idxs` contains a vector of indices pointing to the first `p` values of each label category. The second element of `idxs` contains indices pointing to the remaining values of each label category.

• If `p` is a value in the range (0, 1) and `lblsrc` has Ki elements in the ith category, the first element of `idxs` contains a vector of indices pointing to the first `p` × Ki values of each label category. The second element of `idxs` contains the indices of the remaining values of each label category.

• If `p` is a vector with N elements of the form p1, p2, …, pN, `splitlabels` finds N + 1 splitting index sets and returns an (N + 1)-element cell array in `idxs`.

• If `p` is a vector of integers, the first element of `idxs` is a vector of indices pointing to the first p1 values of each label category, the next element of `idxs` contains the next p2 values of each label category, and so on. The last element in `idxs` contains the remaining indices of each label category.

• If `p` is a vector of fractions and `lblsrc` has Ki elements of the ith category, the first element of `idxs` is a vector of indices concatenating the first p1 × Ki values of each category, the next element of `idxs` contains the next p2 × Ki values of each label category, and so on. The last element in `idxs` contains the remaining indices of each label category.

Note

• If `p` contains fractions, then the sum of its elements must not be greater than one.

• If `p` contains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `'TableVariable',"AreaCode",'Exclude',["617" "508"]` specifies that the function split labels based on telephone area code and exclude numbers from Boston and Natick.

Labels to include in the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels in `lblsrc`. Each category in the vector or cell array must match one of the label categories in `lblsrc`.

Labels to exclude from the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels in `lblsrc`. Each category in the vector or cell array must match one of the label categories in `lblsrc`.

Table variable to read, specified as a character vector or string scalar. If this argument is not specified, then `splitlabels` uses the first table variable.

Underlying datastore index, specified as an integer scalar. This argument applies when `lblsrc` is a `CombinedDatastore` object. `splitlabels` counts the labels in the datastore obtained using the `UnderlyingDatastores` property of `lblsrc`.

Output Arguments

Splitting indices, returned as a cell array.

Version History

Introduced in R2021a