Getting a raw constant number of rows from oracle’s table sample function


You may of course know these two famous posts called To sample or not to sample… (part-2) about data sampling by Mark Hornick. Although very limited in scope, the two posts (imho) very well sketch why we may employ data sampling and how we may lift off table sampling in oracle.
In general, sampling is used to make a representative statement about a collection of data while only regarding a limited random selection, the sample. As long as you are ok to analyze just a sufficient subset of your 1o million rows table for an analysis, you will save your environment a lot of resources and time. On some other scenario, a limited random data selection may also serve verification or testing purposes where, however, not the representativeness but the randomness at a more or less constant sample size, determines the quality of the sample output. Again, as long as you are ok to not exceed this 15 minutes time window overnight, you will be allowed to run that live unit test on any table in question, on 1, 10 or 100 million rows.
In sql, selecting in regard to gain a representative statement will feed the sample function with a requested percentage of rows to sample from. This is what the oracle sample function already offers. Yet another sql to accept a requested actual number of rows to return, independent of the table size, is not available so far (although most people do expect exactly this behaviour when they spot the sql sample function for the first time, weird). The following text will outline a snippet of pl/sql to provide for a sample function to accept the expected number of rows as a parameter.

There is not much rocket science necessary here to do the logic. Since the original oracle sample function takes a percentage of rows, we just employ the rule of three to compute the actual input (#19), along with the overall count(*) rows (#14) and the requested rows, for some table and id-column (#6). Regard the checks in (#16) and (#21, #22). (#24) does some formatting to provide for a palatable sample percentage parameter and (#25) executes the actual data. (#30), finally, returns the set of randomly sampled id’s.

--drop type jm_mo_ids_tab force;
--/
create or replace type jm_mo_ids_tab as table of number;
/
create or replace function sample_rs(
    i_cnt in number, i_tab in string, i_col in string,
    i_whr in string default null) return jm_mo_ids_tab is
  v_ids jm_mo_ids_tab := jm_mo_ids_tab();
  v_smp varchar2(32767) := 'sample (:smpl_prc:)';
  v_cnt integer;
  v_prc number;
begin
  -- get the total of rows in question
  execute immediate 'select count(*) from ' || i_tab || ' ' || i_whr into v_cnt;
  -- prevent 0 divisor
  if (v_cnt > 0) then
    -- simple rule of three: rows_get/smpl% = rows_all/100%,
    --   smpl% = (rows_get * 100%) / rows_all
    v_prc := round(((i_cnt * 100) / v_cnt), 6);
    -- prevent sample over- or underflow, e.g. for (i_cnt > v_cnt)
    if (v_prc < 0.000001) then v_prc := 0.000001;
    elsif (v_prc >= 100) then v_prc := 100 - 0.000001; end if;
    -- get a sample of rows around smpl%
    v_smp := replace(v_smp, ':smpl_prc:', to_char(v_prc, 'fm990.999999'));
    execute immediate 'select /*+ CURSOR_SHARING_EXACT */ ' || i_col ||
      ' from ' || i_tab || ' ' || v_smp || ' ' || i_whr bulk collect
      into v_ids;
  end if;
  -- done
  return v_ids;
end;
/

Do especially pay attention to the usage of /*+ CURSOR_SHARING_EXACT */ in (#25). In fact, I once in a while experienced oracle always returning the same set of id’s on consecutive invocations of the code, no matter what I entered as sample input percentage. Even intentionally violating the range of the sample input parameter, i.e. 0.000001 < v_prc < 100, did not throw an exception whatsoever (and returned the same set again). It seems, that oracle applies the same force cursor sharing rules to the sample input parameter as it does to regular bind variable/values. This is giant bullshit, sorry for that, imagine you change the partition name in a select xyz from table partition(part) and data is still returned from the the original partition… I did not look up or raised a bug on that, draws me too much time and nerves for nothing nettable to my end, just worked around. I’m so dead tired out on that.

Concluding, here is a simple statement to verify or inspect the stability and the variation of the sets of returned id’s according to the given amount of rows (the where clause in (#6) serves forcing oracle to actually execute sample_rs for every row in tmp (#1), what it will not do / out-optimize otherwise). Three example sets of id’s are given, for 20, 50 and 100 rows, respectively, out of a 10 million rows, 6.7 gbyte table. Yes, i’s fast, nothing is as fast in random table sampling as the original oracle sampling function. Any homebrew approach, using dbms_random or ora_hash for example, will always fall behind in that respect.

with tmp as (select level as idx from dual connect by level <= 10)
select idx, count(id_col) as row_cnt, avg(count(id_col)) over() as row_cnt_avg,
  min(id_col) as id_col_min, max(id_col) as id_col_max
from tmp,
  (select value(t) as id_col from table(sample_rs(20, 'table', 'id-column')) t)
where idx != id_col
group by idx
;

idx | row_cnt | row_cnt_avg | id_col_min | id_col_max
1	26	20,6	12980373	170910684
6	16	20,6	13371461	171150593
2	20	20,6	20716524	170704329
4	20	20,6	13021294	1880612248
5	29	20,6	12980608	171665679
8	19	20,6	12989315	171271290
3	17	20,6	13014787	169664010
7	20	20,6	20785428	170992518
9	21	20,6	12919367	169563405
10	18	20,6	12897915	1879537404
2.453 secs

idx | row_cnt | row_cnt_avg | id_col_min | id_col_max
1	47	50,2	12960941	1880618218
6	60	50,2	12845778	171289614
2	51	50,2	12844193	1880388038
4	44	50,2	12849760	1880297350
5	43	50,2	13068108	1880134458
8	49	50,2	12875680	1880077868
3	62	50,2	13139775	1879554240
7	60	50,2	12906559	1880651013
9	46	50,2	12915736	1880597022
10	40	50,2	13020992	1880601611
3.176 secs

idx | row_cnt | row_cnt_avg | id_col_min | id_col_max
1	117	109,7	12854357	1880658888
6	107	109,7	12881207	171346627
2	94	109,7	12960234	1880610605
4	123	109,7	12848383	1880536929
5	109	109,7	12919576	1880571761
8	113	109,7	12871892	1880482529
3	99	109,7	13116822	1880155088
7	125	109,7	12884808	1880409808
9	98	109,7	12939815	1880406235
10	112	109,7	12885659	1880584015
2.874 secs

Have fun, Peter

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s