| Histogram Bin-width Optimization Histogram Optimization | Kernel Optimization | Shimazaki, PhD | Last Update 2010-02-18 0:17 |
| HISTOGRAM METHOD |
| - Web Application |
| - Method |
| - Original Papers |
| - Scheme |
| - Illustration |
| - FAQ |
| - Derivation |
| - Source Code |
| - new Kernel Optimization |
| - HOME Page |
This page INSTANTLY generates an optmized histogram of YOUR DATA. |
|
*Solutions may be found at FAQ. 07/06/16 Ver. 1.0 Web Application for Bin Size Selection © 2006 2007 2008 Hideaki Shimazaki |
I. Divide the data range into and ![]() An Optimized Histogram
|
|
|
When you make a histogram, you need to choose a bin size. How large (or small) the bin size should be?
There is an optimal bin size that remedies these two opposite problems. The Java Applet below demonstrates the problem of the bin size selection. In the applet, data is shown in the upper box and its histogram in the bottom box. The data is an example from Neuroscience: a timing of neuronal firing (spikes) obtained by repeating trials. Twenty sequences (trials) are drawn in the data box. However, the data can be anything (weight, height, or test score etc.), and can be one sequence. The histogram is shown in the bottom box by red color. The blue line is the distribution (or rate) that produce data samples in the box above. Our aim is to choose a histogram that best represents this blue line. As a first step, change the bin size of a histogram in the applet by using the scroll bar at the bottom. With too small a bin size, you get a jagged histogram. With too large a bin size, you obtain a flat histogram. Now bring the scroll bar left to make the bin size thin, and push `redraw' button several times. The `redraw' button regenerates sample points based on the blue density distribution. Please confirm that the shape of the histogram drastically changes. If you are told to tell which of the two hills of the distribution is taller, can you tell? Although the right hill is taller than the left (blue line), the maximum height of a histogram appears at left side quite often. As a second step, bring your scroll bar right to make the bin size broad, and push `redraw' several times. Please confirm that the shape of the histogram does not change very much. In addition, you may notice the right hand side of the hills is higher than the left hand side most of the time. However, can you tell where the highest point of the hill is? The resolution of the histogram is not good enough. When you make a histogram, you need to choose a bin size that compromises the conflict between sampling error and resolution. Now please check the `error check box' on. Yellow area appears, which indicates an error between the histogram and the underlying distribution. You may turn off the histogram check box. Please examine where a total error becomes the smallest by your eyes. You would find that the error becomes the smallest when you bring the scroll bar handle about one fourth of the total length from the left. Please push the `redraw' button several times to check the statistical fluctuation of errors. The optimal bin size of a histogram is the bin size that produces the smallest total errors (yellow area) on average of many realizations of sampled data (i.e. average over many pushes of `redraw' button). At this moment, you might say, "Okay, I now understand what an optimal bin size is. But, how can we know it? After all, we do not know the blue underlying distribution. So we can not know the errors. We can never know the optimal bin size from data." This statement is not true. Indeed, the errors can be estimated from the data. The above method computes the estimated error for several candidate bin sizes, and choose the bin size that produces the smallest `estimated' error. For the details of the method, please refer to Shimazaki and Shinomoto in Neural Computation. |
| Q. It seems that the
theory assumes a time histogram. Can I apply the proposed method of
a histogram to estimate a probability (density) distribution?
A. Yes. Q. I obtained a very small bin width, which is likely to be erroneous. Why? A. You probably searched bin widths smaller than the sampling resolution of your data. Please begin the search from the bin width as large as your sampling resolution. In addition, please check if your data contains duplicated samples. A small bin width will be selected for such highly correlated samples. See below for the solution to the duplicated data due to the low sampling resolution. Q. I have data with low sampling resolution. A. To obtain the optimal bin size correctly, it is recommended to replace your data x with x+r, where r is an uniform random variable drawn from [-dx/2, dx/2]. dx is sampling resolution of data acquisition. You repeat this to all samples x with independently drawn r. Use this randomized data to obtain the optimal bin size. To draw histogram, use the original data. See the difference with the following examples. Copy and paste rounded data 5.1 5.1 4.9 5 5 4.9 5 5 4.9 4.8 4.9 5 4.8 5.1 4.9 4.8 5 5.1 5 5 5 5 5.1 5 5.1 5.1 5 5 4.9 4.9 5 5 4.9 5.1 5 5.3 5 4.9 5 5 4.9 5 4.9 5 5 5.1 5.1 4.7 5 5.1 5.4 5.7 5.5 5.7 5.6 5.5 5.5 5.5 5.3 5.5 5.5 5.3 5.7 5.3 5.4 5.3 5.7 5.4 5.2 5.7 5.8 5.6 5.5 5.3 5.3 5.7 5.7 5.4 5.7 5.5 5.4 5.3 5.6 5.1 5.2 5.9 5.5 6 5.5 5.5 5.7 5.5 5.4 5.8 5.3 5.7 5.4 5.6 5.7 5.3
Try the jittered data.
5.1488 5.067 4.8758 4.9897 4.9574 4.9184 4.9902 5.0483 4.8902 4.8121 4.8654 4.9881 4.7661 5.1258 4.9371 4.7851 5.0186 5.0794 5.0031 5.0332 5.0097 4.9835 5.0799 4.9953 5.0923 5.086 5.0058 5.0243 4.8924 4.8929 4.9625 4.9524 4.879 5.0818 5.0154 5.3457 5.0436 4.8958 4.974 5.0264 4.9259 5.0241 4.9244 4.9606 5.0182 5.0963 5.0712 4.6599 5.0324 5.0675 5.3664 5.7166 5.5394 5.7017 5.6203 5.4654 5.5453 5.5041 5.318 5.4537 5.5309 5.3249 5.662 5.3025 5.3826 5.3046 5.6899 5.3915 5.1681 5.6755 5.7521 5.6424 5.5154 5.3433 5.2664 5.7421 5.7295 5.4077 5.694 5.4758 5.4252 5.2729 5.5564 5.1267 5.2171 5.9215 5.5142 5.9919 5.4891 5.5316 5.6817 5.5315 5.4289 5.8352 5.3006 5.7136 5.4451 5.5944 5.656 5.3367
Q. Can I apply the method to the data composed of integer values? A. Yes. Please note that the resolution of your data is an integer. It is recommended to search only the bin widths which are a multiple of the resolution of your sampled data. Do not search the bin width smaller than the resolution. (The current version of web application can NOT be used for computing only integer bin widths.) Q. I want to make a 2-dimensional histogram. Can I use this method? A. Yes. The 'bin size' of a 2-d histogram is the area of a segmented square cell. The mean and the variance are simply computed from the event counts in all the bins of the 2-dimensional histogram. Matlab demo program for selecting bin size of 2-d histogram. (The current version of web application can NOT be used for computing 2-dimensional histogram.) See also 2-d kernel density estimation. Q. Can I use unbiased variance as the variance in the formula of bin size selection? A. No. Please use the (biased) variance displayed in the method. Please note that a built-in function for variance calculation in a software may returns unbiased variance in default. Q. I have used the Scott's method Optimal Bin Size= 3.49*s*n^(-1/3) ,
and obtained bin size that differs from the result obtained by the present
method.
A. They should be different. Three assumptions were made to obtain the Scott's result. First, the Scott's result is asymptotically true (i.e. it is true for large sample size n). Second, the scaling exponent -1/3 is true if the density is a smooth function. Third, the coefficient 3.49 was obtained, assuming the Gauss density function as a reference. The present method does not require these assumptions. Q.What assumption was made in your method? A. The method only assumes that the data are sampled independently each other (an assumption of a Poisson point process). No assumption was made for the underlying density function (for example, unimodality, continuity, existence of derivatives, etc.). Q. Is the assumption in this method distinct from that in other methods? A. Yes. In classical density estimation, the sampling size is fixed. Here, the total data size is not fixed, but assumed to obey a Poisson distribution. Note that the MISE is ensemble performance of the histograms with 'many' realizations of the experiment. The histograms constructed from a Poisson point events have more variability than the histograms constructed from samples with a fixed amount. This leads to a choice of wider optimal bin size for a histogram under a Poisson point process assumption. However, this difference becomes negligible as we increase the data size. Q. You provide the method for optimizing kernel density estimate, too. Which of the methods, a histogram or kernel, do you recommend for density estimation? A. We generally recommend to use the kernel density estimate. Please check the kernel density optimization page, too.
The FAQ includes my (HS) opinion. They are not opinions neither by my collaborators nor institutions I belong to. |
A bar-graph histogram is constructed simply by counting the number of events
that belong to each bin of width In this thesis, we assess the goodness of the fit of the estimator
By segmenting the range
The second term of Eq.(3.5) can further be decomposed into two parts,
By replacing
|
| The matlab function sshist.m is now available. [2008/11/19]
Below is a simple sample program sample.m .
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 2006 Author Hideaki Shimazaki
% Department of Physics, Kyoto University
% shimazaki at ton.scphys.kyoto-u.ac.jp
% Please feel free to use/modify this program.
%
% Data: the duration for eruptions of
% the Old Faithful geyser in Yellowstone National Park (in minutes)
clear all;
x = [4.37 3.87 4.00 4.03 3.50 4.08 2.25 4.70 1.73 4.93 1.73 4.62 ...
3.43 4.25 1.68 3.92 3.68 3.10 4.03 1.77 4.08 1.75 3.20 1.85 ...
4.62 1.97 4.50 3.92 4.35 2.33 3.83 1.88 4.60 1.80 4.73 1.77 ...
4.57 1.85 3.52 4.00 3.70 3.72 4.25 3.58 3.80 3.77 3.75 2.50 ...
4.50 4.10 3.70 3.80 3.43 4.00 2.27 4.40 4.05 4.25 3.33 2.00 ...
4.33 2.93 4.58 1.90 3.58 3.73 3.73 1.82 4.63 3.50 4.00 3.67 ...
1.67 4.60 1.67 4.00 1.80 4.42 1.90 4.63 2.93 3.50 1.97 4.28 ...
1.83 4.13 1.83 4.65 4.20 3.93 4.33 1.83 4.53 2.03 4.18 4.43 ...
4.07 4.13 3.95 4.10 2.27 4.58 1.90 4.50 1.95 4.83 4.12];
x_min = min(x);
x_max = max(x);
N_MIN = 4; % Minimum number of bins (integer)
% N_MIN must be more than 1 (N_MIN > 1).
N_MAX = 50; % Maximum number of bins (integer)
N = N_MIN:N_MAX; % # of Bins
D = (x_max - x_min) ./ N; % Bin Size Vector
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Computation of the Cost Function
for i = 1: length(N)
edges = linspace(x_min,x_max,N(i)+1); % Bin edges
ki = histc(x,edges); % Count # of events in bins
ki = ki(1:end-1);
k = mean(ki); % Mean of event count
v = sum( (ki-k).^2 )/N(i); % Variance of event count
C(i) = ( 2*k - v ) / D(i)^2; % The Cost Function
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Optimal Bin Size Selectioin
[Cmin idx] = min(C);
optD = D(idx); % *Optimal bin size
edges = linspace(x_min,x_max,N(idx)+1); % Optimal segmentation
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Display an Optimal Histogram and the Cost Function
subplot(1,2,1); hist(x,edges); axis square;
subplot(1,2,2); plot(D,C,'k.',optD,Cmin,'r*'); axis square;
|
![]() |
Copy and Paste a sample program sample.R, then run sshist(faithful[,1]) .
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# 2006 Author Hideaki Shimazaki
# Department of Physics, Kyoto University
# shimazaki at ton.scphys.kyoto-u.ac.jp
# Please feel free to use/modify this program.
sshist <- function(x){
N <- 2: 100
C <- numeric(length(N))
D <- C
for (i in 1:length(N)) {
D[i] <- diff(range(x))/N[i]
edges = seq(min(x),max(x),length=N[i])
hp <- hist(x, breaks = edges, plot=FALSE )
ki <- hp$counts
k <- mean(ki)
v <- sum((ki-k)^2)/N[i]
C[i] <- (2*k-v)/D[i]^2 #Cost Function
}
idx <- which.min(C)
optD <- D[idx]
edges <- seq(min(x),max(x),length=N[idx])
h = hist(x, breaks = edges )
rug(x)
return(h)
}
|
Copy and Paste a sample program sample.nb, then ctrl+shift. \!\(<< \ Graphics`Graphics`\n |
Other applications for analyzing spike data: SULAB ( Prof. Shinomoto ) |
© 2006 2007 2008 2009 Hideaki Shimazaki All rights reserved.