Histogram

It's hard for me to remember the definitions and features of the two types of histograms, height-balanced (HBH) and frequency (aka value-based) histograms (FH or VBH). Here's my own summary.

        Criteria            |   Type   | Meaning of xxx_tab_histograms.endpoint_number^note1
----------------------------|----------|----------------------------------------------------------
Distinct values >  #buckets | HBH      | ordinal number of bucket (i.e. "Histogram bucket number")
Distinct values <= #buckets | FH (VBH) | cumulative count

According to Viewing Histograms in documentation, we can think of a ball-bucket analogy that helps remember the definitions. We have some same-size buckets and a lot of same-size balls. Each ball has a number written on it. We sort all balls based on the numbers on them and put them in the buckets in the sorted order.

If we put equal (or close to equal) number of balls into each bucket, regardless the numbers written on them, we're creating a HBH.
If we put balls in the buckets, in each of which the difference between the largest value and the smallest value written on the balls is about the same, we're creating a FH or VBH.

Although Oracle calls the second type frequency histogram in 10g documentation, the name value-based histogram as in 9i documentation better fits into our analogy; values or numbers on the balls in the same bucket span a same-width range for all buckets, not more, not less.

The name height-balanced histogram is a good one because you can imagine all buckets are filled to about the same depth. The real technical name equi-depth histogram is even better. (Jonathan Lewis Cost Based Oracle p.157 says, according to Wolfgang Breitling, the technical term in computer science is not height balanced histogram, but equi-depth histogram, instead.)

[Update 2013-07] In Oracle 12c, you can create a top frequency histogram if the number of distinct values exceeds 254.

___________________
[note1] Although for both types of histograms, one row in xxx_tab_histograms represents one bucket, the column endpoint_number has different meanings depending on histogram type. So I'm documenting it here. Documentation describes this column as "Histogram bucket number". But for FH, that's wrong; e.g., a column of 100 "1"'s, 1 "2" and 2 "3"'s will have 3 buckets with endpoint_number's 100 (of "1"'s), 101 (for 101-100=1 of "2"), and 103 (for 103-101=2 of "3"'s), respectively. Note that the column endpoint_value has the same meaning regardless histogram type, i.e., the largest value stored in this bucket.

10g xxx_tab_col_statistics (or xxx_tab_columns) has a histogram column that tells you what type of histogram you have. 9i doesn't have this. But the algorithm to determine the type in the definition of 10g xxx_tab_col_statistics works equally well in 9i.


select case when nvl(h.row_cnt,0) = 0 then 'NONE'
     when (h.bucket_cnt > 255
           or
           (h.bucket_cnt > h.distcnt and h.row_cnt = h.distcnt
            and h.density*h.bucket_cnt*2 <= 1))
         then 'FREQUENCY'
     else 'HEIGHT BALANCED'
end
from sys.hist_head$ h
where obj# = &object_id_for_the_table;

Documentation says "Frequency histograms are automatically created instead of height-balanced histograms when the number of distinct values is less than or equal to the number of histogram buckets specified". But the algorithm shown above tells us there's more to it than that simplistic criterion. I don't fully understand it. Distcnt is num_distinct in xxx_tab_columns. But row_cnt (number of rows in xxx_tab_histograms) and bucket_cnt (xxx_tab_columns.num_buckets) often differ from what you specify in the for columns size clause of gather_table_stats. This difference is something I didn't find the pattern about. Not only that, but you can't create a histogram with bucket count greater than or even equal to 255 so what's the purpose of the check h.bucket_cnt > 255?

It's interesting to note that beginning with 10.2.0.4, Oracle dropped *2 shown grayed out above. And beginning with 11gR1, <= changes to <, dropping =.

Beginning with 12c, the number of columns allowed for a frequency histogram can be as high as 2049.

[Original discussion at http://www.itpub.net/thread-1052695-1-1.html]

> In Jonathan Lewis's Cost Based Optimizer, on p.172, density can be calculated as

> sum of the square of the frequency of the nonpopular values /
> (number of nonnull rows * number of nonpopular nonull rows)

> Can anybody explain?

Jonathan says he's going to explain this in Volume 2 of his book series. For now, he just throws out this formula. Following that, he gave another formula that's supposed to do what the first formula does, but he seems to be missing some parentheses.

Patent 6732085 explains the Oracle algorithm in details. Why did the inventor (Ari Mozes) choose density to be "calculated as the sum of the square of the repetition counts for non-popular values divided by ..." instead of any other more mathematically convenient formula? He must have done some experiments and found that his current formula works best to predict the optimal SQL plan. Jonathan's recap of the patent statement is almost verbatim. But note that he replaced the phrase "repetition count" with "frequency".

Honestly, density when there's data skew is really a weird concept. If you have 1000 employees and 99% have salaries between 1k and 1.1k and 1% have 1.1k up to 5k, intuitively you really need two densities, one for the 1-1.1k range and the other for the 1.1-5k range. Having one single density makes no sense. But Oracle has to store one number in data dictionary. What do you think Oracle should do? Store density, or rather densities, in bucket-specific views, such as xxx_tab_histograms (so it's one row per bucket), instead of column-level views, xxx_tab_columns (one row per table column)? Currently Oracle stores one single density in a column-level view. What do you think that density is? An all-bucket density average? Since Jonathan shows that it's used for non-popular values (see Strategy 2 for Height-Balanced Histogram in his book, p.172), this single density for the entire column is an average for non-popular values only? Note that according to Jonathan's examples, whether the histogram is used depends on whether the constant in "where column=constant" is a popular value or not. If it is, look at the histogram (Strategy 1 in his book). If it is not, use the column-average density (Strategy 2).

[Original discussion at http://www.freelists.org/post/oracle-l/Density-statistic-calculation-in-case-of-histogram]

> Please tell me how density is calculated in case of histograms.
> 
> SQL> select col_skew, count(*) from tab_skew group by col_skew order by 1;
> 
>   COL_SKEW   COUNT(*)
> ---------- ----------
>          1         10
>          2         10
>          3         10
>          4         10
>          5         10
>          6         10
>          7         10
>          8         10
>          9         10
>         10       9910
> 
> 10 rows selected.
> 
> SQL> exec dbms_stats.delete_table_stats(user,'TAB_SKEW');
> 
> PL/SQL procedure successfully completed.
> 
> -- Creating width-based histograms, #buckets = #distinct values
> SQL> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW', method_opt=>'FOR COLUMNS COL_SKEW size 10');
> 
> PL/SQL procedure successfully completed.
> 
> SQL> select * from dba_tab_col_statistics where table_name='TAB_SKEW' and column_name='COL_SKEW';
> 
> ... DENSITY ... HISTOGRAM
> ... ------- ... ---------
> ... .00005  ... FREQUENCY
> 
> Question: How density is calculated here? Also it's equal to Selectivity of non-popular values.
> 
> SQL> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW', method_opt=>'FOR COLUMNS COL_SKEW size 5');
> 
> PL/SQL procedure successfully completed.
> 
> -- Creating Height-based histograms, #buckets < #distinct values
> SQL> select * from dba_tab_col_statistics where table_name='TAB_SKEW' and column_name='COL_SKEW';
> 
> ... DENSITY ... HISTOGRAM
> ... ------- ... ---------------
> ... .98209  ... HEIGHT BALANCED
> 
> Question: How density is calculated here and how selectivity will be estimated for
> popolar as well as non-popular values?
...
> After reading an excellent paper "Histograms - Myths and Facts" by Wolfgang
> Breitling, some points get cleared.
> 
> 1)  In case of width-based histograms, density= 1/ (2*number of distinct values)
> 
> Thus come the figure 0.00005
...
> In case of height-based histograms, I didn't get these calculations.
> density = Σ cnt² / ( num_rows * Σ cnt )

The calculation of density in case of a histogram is documented in the US patent 6732085. Jonathan Lewis's CBO book p.172 says the same: "sum of the square of the frequency of the nonpopular values / (number of nonnull rows * number of nonpopular nonull rows)", where "frequency" is called repetition count in the patent.

Intuitive understanding is beyond me. But it's not hard to apply the proposed formula. Note the non-popular value is one that does not show in two (or more) bucket end points. So in your case of 5 bucket height-balanced histogram, it is:

bucket#  1       2     3     4     5
col_skew 1-10    10    10    10    10

10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2
-------------------------------------------- = 0.001
10000 * 90

So the correct answer is 0.001. If you run your test in 10.2.0.2 and above, you'll get this number. In 10.2.0.1, you get 0.98209 for some reason. You would "accidentally" get that number if you applied the formula to your 10-bucket frequency (you call width-based) histogram:

bucket#  1    2    3    4    5    6    7    8    9    10
col_skew 1    2    3    4    5    6    7    8    9    10

10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+9910^2
--------------------------------------------------- = .98209
10000 * 10000

By the way, to add to the list of formulas of density calculation, Wolfgang Breitling has the classic pioneering work in his A Look under the Hood of CBO, and its later rewrite in Histograms - Myths and Facts, and there's the rarely mentioned, a "limited"-published Metalink document Note:43041.1; but that one may be applicable only to certain old versions. For those using 11g, Alberto Dell'Era's New Density calculation in 11g is a must read; it tells us how Oracle changes histogram calculation again.

[Original discussion at http://www.itpub.net/thread-1095499-1-1.html]

create table SMALL
( A VARCHAR2(4),
  B NUMBER,
  C VARCHAR2(4)
);
begin
  for i in 1..4000 loop
    insert into small values (mod(i,20),mod(i,40),i);
  end loop;
  commit;
end;
/
analyze table small compute statistics;

select column_name, num_distinct, low_value, high_value, density
from user_tab_col_statistics
where table_name='SMALL';

COLUMN_NAME        NUM_DISTINCT        LOW_VALUE        HIGH_VALUE        DENSITY
A                            20        30               39                   0.05
B                            40        80               C128                0.025
C                          4000        31               393939            0.00025

Question: Why is A's low_vale 30 instead of the value 1 stored in the database, and high_value 39 instead of 19 or '9'? Why is B's low_value 80 instead of 1 and high_value C128 instead of 39, and it has a character 'C'?

Varchar2 column A's low_value 30 (0x30) is the ASCII code in hex for character '0', and high_value 39 (0x39) is for '9'. Number column B has 80 because that's what 0 is stored as inside the database:

SQL> select dump(0,16) from dual;

DUMP(0,16)
---------------
Typ=2 Len=1: 80

Do the same dump for 39 and you get C128. You can use Greg Rahn's display_raw function to do conversion for you.

To my OraNotes Page