function with multiple datasets
Mostrar comentarios más antiguos
I have been asked to make af function which calculates the total and average tax payed in 15 different states. My data is imported from excel and looks like:
1 (5000x4 cell)
'Serialnumber' 'Name' 'Gender' 'State'
'4654194314624' 'Sara ' 'M' [ 14.00]
'4565196234864' 'Michael Sims ' 'M' [ 4.00]
'6548197914565' 'Mazon Watt ' 'M' [ 7.00]
2 (5000x2 cell)
'Serialnumber' 'Income'
'4567456766023' [ 76500.00]
'2567467775130' [ 90750.00]
3 (16x4 cell):
'State' 'lower tax' 'tax limit' 'higher tax'
'7' [27.00] [ 55900.00] [51.00]
'8' [25.00] [ 55100.00] [45.00]
'9' [26.00] [ 56000.00] [45.00]
'10' [28.00] [ 54000.00] [43.00]
To calculate the average and total tax i need all three files, but i haven't been able to see through it yet. I thought about creating a new cell containing all relevant data. I also tried the following, but that would only give me the total for a single state. Any simple ways to go about it?
for i=2:length(Income)
if strcmp(Income(i,1),Citizens(i,1))
personalIncome = Income{i,2};
region = Citizens{i,4};
end
if Citizens{i,4}==1
personalincome=Income{i,2};
incomesum=incomesum+personalincome;
people=people+Citizens{i,4};
end
for v=2:length(Taxes)
if strcmp(Taxes(v,1),num2str(region))
tax1=Taxes{v,2};
tax2lim=Taxes{v,3};
tax2=Taxes{v,4};
end
end
if incomesum<=Taxes{v,3}
taxesPaid = incomesum*tax1/100;
else
taxesPaid = taxesPaid+((tax2lim*tax1/100)+((personalincome-tax2lim)*tax2/100))
end
tax=incomesum/people*taxesPaid;
totaltax=totaltax+taxesPaid;
6 comentarios
dpb
el 29 de Mayo de 2017
I've given you two alternate paths to progress and much background discussion on how to normalize your data to rid it of the mixed data types across the tables (and the reasons therefore) at the previous query.
Unfortunately, you don't seem to be willing to heed such advice to "simplify, simplify" as suggested which would serve to make computing the above and other questions bound to arise directly computable across whatever section of the dataset desired.
Either of those two techniques would serve; the table data class as illustrated would be the better and even more so over the previous example as it supports grouping variables with varfun precisely for such computations.
Chriss
el 29 de Mayo de 2017
Show that work using the table, then. You "don't need no steenkin' loops" :) and the cell arrays should be long gone including the mishmash of data types across variables.
Show a short section of each of the tables you've created instead of the input data and it would be useful then if you attached enough of a dataset then that could use for demonstration.
Guillaume
el 29 de Mayo de 2017
Identical question asked by somebody else. Obviously, some homework.
dpb
el 29 de Mayo de 2017
Ewww...guess should've realized it would have been given the subject and data. :(
Well, didn't provide the total solution in "one swell foop" but tried to make sure Chriss did enough to be able to reproduce on own from some sample (similar to yours albeit taken a step further).
Seems great reluctance here to actually dive into Matlab itself rather than continue to poke at it from around the edges... :)
Chriss
el 29 de Mayo de 2017
Respuestas (1)
Repeating with only some minor changes from the previous to get you started...begin with the script to read the files and create the tables--this is what I showed earlier excepting I'd just created some local copies of text files that copied some of your data. The names in the spreadsheet may not be very handy to use from your earlier posting as some are pretty long and included spaces that'll get munged on by readtable to make acceptable variable names so you'll probably want to clean those up some to make easier to use.
CIT=readtable('Citizens.xlsx');
INC=readtable('Income.xlsx');
TAX=read('Taxes.xlsx');
"readtable creates one variable in T for each column in the file and reads variable names from the first row of the file. By default, the variables created are double if the entire column is numeric, or cell arrays of strings if any element in a column is not numeric."
Hence, the SSN, state/region and the tax data are all going to be numeric while the remaining will be cell strings. But that should be almost all it takes to make the table. You may find the SSN is displayed as floating point value given its magnitude; that can be solved by storing as int32 or categorical as chose in the earlier example.
Now, we've already discussed the desirability of merging the income data with that for the indivudals...
CIT_INC=join(CIT,INC); % that was easy, wasn't it???!!!
Now what do we need to compute? There's the tax for each individual based on income and region. Showed a function to do that on an individual basis, how to do it for all and then by groups is where the other functions come in.
You can either use those builtin features or use the routines as I wrote earlier that work on a given SSN and vectorize them to work over an array of SSN and then pass that based on the state. Either is pretty straightforward and requires only a minimal amount of code and the desired subsets can be selected by the '==' operator as I demonstrated before with either the table or the use of the array as long as you don't try to mix string and numeric representation of the same values (primarily the state/region was the culprit).
8 comentarios
Chriss
el 29 de Mayo de 2017
dpb
el 29 de Mayo de 2017
OK, so let's see the evidence! :) See the note I added to above comment on what to show us...as mentioned there, I'd suggest probably re-edit the original question to show the progress you have made with the table class and attach the suggested subset...
I'm willing to help; just want to teach something along the way that will be useful going forward in applying Matlab as it's intended to be used, not just find a workaround to solve a problem that only exists because didn't use the proper tool for the job.
Oh, I've gotta' go do some real work (it is corn-planting season and getting late, after all :) ) so will be later in day before can get back again but I'd suggest you could fruitfully invest some time into reading the background documentation on the table class and the examples and then focus in specifically on rowfun, varfun and the use of grouping variables including the examples. There are, in fact, examples there that mimic your problem pretty well in computing statistics over functional forms; the only thing that isn't directly covered is the use across tables but it shouldn't be too difficult to figure out how to combine those you have and then do the lookup across them if you apply the principles outlined.
HINT: One simplification you can make would be since there's only the single value of income for each taxpayer in the one table and it's 1:1 to the payer database, when you create your tables might as well merge those two into just one. Then you've only got two tables with which to work instead of three...
First trash the GLOBALs!!! CIT, TAX, INC are the local names for the three tables built prior to calling SSNLookup. They're local variables for the use of the function locally (and no, Matlab won't make unnecessary copies, it's too smart for that).
In my example code I tried to use all caps for the tables and lowercase for variables but didn't make it an inviolate rule; the SSN just "seems right" to also be capitals and shouldn't be confusing to use that way.
I had just written the prior code at the command line on the fly to illustrate concepts so initially the tables I created were lowercase; rename them appropriately in your real code.
You do NOT want to rebuild the database tables every time you do a lookup, that's only done once in the initial startup code section where the files are written, and it would then be far more efficient to save those as .mat files once done and only rebuild them when there is an update to the database.
So, remove all that stuff from the function as I had it before and build the input-reading code to create the tables incorporating the hint previously given.
"is it necessary two functions?"
"Necessary" as in "mandatory"? No. "Necessary" as in good code factorization and ease of coding, debugging, maintaining, using? Yes.
Mixing different functions into one conglomeration of code just makes a mishmash of code as does the treatment of the same variable in two datasets as different types did/does in creating all of your searching issues.
Then, having taken care of the above, respond to the comment I made earlier about what would be good to show and put the results of the above code and that data in the edit'ed original question.
I'm moving between fields at the moment; came to the house to check on some other activities as well and just saw you had responded.
Fix that up and will get back later on...
The key is twofold:
- Build the tables first and only once from the input data.
- PASS those tables to functions that need them, do not make them global
- Consider what you need to group over to compute the desired statistics and review the functions outlined above...work the examples and see how to apply to your data similar computation of tax.
- Once you've done 1. and 2., you can, in fact, clear the input cell arrays and be done with 'em--they're only useful for building the tables anyway.
Chriss
el 29 de Mayo de 2017
dpb
el 29 de Mayo de 2017
I'd wager without you can rewrite the functionality of all your code in about 10-15 lines at most. You've got cells; that's where your problems start because you haven't normalized them to be consistent.
All it takes to eliminate the globals is to pass the tables as I've shown already.
Chriss
el 29 de Mayo de 2017
Categorías
Más información sobre Data Type Identification en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!