Possible Incorrect Documentation on ksdensity

3 visualizaciones (últimos 30 días)
David Gillcrist
David Gillcrist el 8 de Oct. de 2024
Respondida: Umar el 9 de Oct. de 2024
I'm trying to implement a custom version of ksdensity. In the documentation the default way of calculating the bandwidth is said to be via Silverman's Rule-of-Thumb, i.e. for a bandwidth h this rule would give
This is according to the wikipedia article on Kernal Density Estimation. However, upon rooting about in matlab files the default bandwidth is calculated in the matlab function: matlab.internal.math.validateOrEstimateBW (run open matlab.internal.math.validateOrEstimateBW if you want to view it in its entirety). Lines 64–68 are shown below and are what is relevant
64 if isequal(bw, 'normal-approx')
65 if all(sigma>0)
66 % Default window parameter is optimal for normal distribution
67 % Scott's rule
68 bw = sigma * (4/((d+2)*N))^(1/(d+4));
69 else
70 ... % Unimportant
71 end
72 else
73 ... % Unimportant
74 end
The 'normal-approx' is the default setting for bandwidth estimation and it should be the rule presented above, however, it is clearly different and is referenced as "Scott's Rule". This could be an issue of wikipedia referencing the wrong bandwidth calculation and that Scott's Rule is, in fact, the same as Silverman's Rule-of-Thumb, but it's been hard to find proper confirmation of this—for example this presentation from UBC has different rule labelled as Silverman's Rule-of-Thumb—as I cannot find Silverman's original paper where he preportedly first introduced this rule. If someone could confirm that this is in fact an error in code or an error in my understanding of the bandwidth calculation, I would be greatly appreciative.
  2 comentarios
Torsten
Torsten el 8 de Oct. de 2024
You should address this question to the MATLAB development team, not to the forum members as poor end users.
the cyclist
the cyclist el 9 de Oct. de 2024
This question triggered a distant memory. I searched and found this question and answer from 8 years ago.
Spoiler: It's not going to help.

Iniciar sesión para comentar.

Respuestas (1)

Umar
Umar el 9 de Oct. de 2024

Hi @David Gillcrist,

After going through your comments and studying the documentation provided at the link below

https://www.mathworks.com/help/stats/ksdensity.html?s_tid=doc_ta#btpl6_1-1

To clarify your inquiry regarding the bandwidth estimation for kernel density estimation (KDE) in MATLAB versus traditional statistical rules, let me delve into each component:

Understanding Silverman’s and Scott’s Rules

Both formulas aim to optimize density estimation under different distributional assumptions.

MATLAB's Bandwidth Calculation

In your provided MATLAB snippet from matlab.internal.math.validateOrEstimateBW, it appears that MATLAB defaults to a bandwidth estimation method labeled as "normal-approx," which aligns more closely with Scott's Rule rather than Silverman's:

bw = sigma * (4/((d+2)*N))^(1/(d+4));

This formula indeed suggests that it uses Scott’s approach by employing a constant derived from normal distribution assumptions.

Clarification on Literature References

The confusion often arises because both Silverman and Scott provide estimates based on similar principles but differ slightly in their constants due to their unique derivations. For instance: Silverman adjusts his constants to achieve optimality across various distributions, while Scott focuses specifically on normal distributions and reference you mentioned from UBC likely conflates these methods or may be contextualizing them differently.

Practical Implications

Your personal experience resonates with common practice among statisticians. Many practitioners prefer adjusting bandwidth downwards (e.g., using factors like 0.5 or lower) to avoid over-smoothing, especially with smaller sample sizes where finer details are crucial.

Here are some additional insights I would like to share with you.

Depending on your data distribution characteristics (e.g., skewness or presence of outliers), you might want to explore robust bandwidth selectors beyond Silverman’s or Scott’s rules. For instance, adaptive methods can provide better performance in heterogeneous data contexts. Also, bear in mind that different statistical software packages may implement these rules with slight variations, leading to discrepancies in output. Therefore, when comparing results across platforms (e.g., R vs MATLAB), it's essential to understand these underlying implementations.

I do agree with @Torsten’s comments about, “You should address this question to the MATLAB development team, not to the forum members as poor end users.”

Hope this helps.

Categorías

Más información sobre Statistics and Machine Learning Toolbox en Help Center y File Exchange.

Etiquetas

Productos


Versión

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by