What is the best processor for Hybrid?

Borchuk Leonid :

We faced with quite a strange pattern in behavior of our essbase application. I have a special essbase monitoring MDX query that calculates a single value in a Hybrid mode. Here it is:

SELECT {CROSSJOIN({[Period].[Jan]},CROSSJOIN({[Level].[LV_Total]},{[AccountCONS].[PL_NetInc]}))} ON COLUMNS,
{CROSSJOIN({[LegalEntity].[LE_PnL]},CROSSJOIN({[BudOwnerCONS].[BO_PnL]},CROSSJOIN({[CostCenterCONS].[CC_PnL]},CROSSJOIN({[Service].
[SE_PnL]},CROSSJOIN({[Reserved].[LO_PnL]},CROSSJOIN({[Program].[PRG_PnL]},CROSSJOIN({[LegalEntityICP].[ICO_PnL]},CROSSJOIN({[Add].
[Add_PnL]},CROSSJOIN({[Channel].[CH_PnL]},CROSSJOIN({[Contractor].[CO_PnL]},{[Currency].[RC_RUB]}))))))))))} ON ROWS
FROM H_ALLOC.H_ALLOC
WHERE ([Years].[FY16],[Scenario].[ACT],[Version].[VR_Work],[ReservedALLOC].[R2_ALLOC_NA],[Detail].[DET_NA]);

The issue is query execution time differs and one couldn’t understand why it is so highly variable:
On production (Intel Xeon E5-2650 v2 $1166.00 per CPU http://ark.intel.com/products/75269/Intel-Xeon-Processor-E5-2650-v2-20M-Cache-2_60-GHz )
MaxL DML Execution Elapsed Time : [0.17] seconds

On test environment (Intel Xeon E5-2660 $1329.00 per CPU http://ark.intel.com/products/64584/Intel-Xeon-Processor-E5-2660-20M-Cache-2_20-GHz-8_00-GTs-Intel-QPI )
MaxL DML Execution Elapsed Time : [0.24] seconds

On dev environment ( Intel Xeon E5645 $554.00 per CPU http://ark.intel.com/products/48768/Intel-Xeon-Processor-E5645-12M-Cache-2_40-GHz-5_86-GTs-Intel-QPI )
MaxL DML Execution Elapsed Time : [0.23] seconds

I know hardware in out environments slightly differs but according to the execution time the worst result has the most expensive CPU ($1329.00). Strange result that need to be explored more closely.

I used for it perf utility ( the best resource about perf http://www.brendangregg.com/perf.html ) which writes CPU (and some kernel) counters in perf.data file. That file could store among others also calls for all shared libraries functions (from stacktrace). And using, for example, cpu-cycles is possible to calculate how many cycles were spent for processing each function. It’s not a trace but quite close.
Essbase main process actively uses shared libraries (shared libraries have symbol names for each function and that names show in reports) so by watching and by measuring these calls one could distinguish different queries. It’s not a logical query execution plan, just profile for amount of physical operations made in query processing.

Let’s see and compare the profiles for test and dev environments.

First, record data:

[root@hyp-ess2e ~]# perf record -T -a -g fp su – oracle -c “source /etc/hyperion/udf_env.sh; /var/lib/hyperion-paysys/MDX.py”
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.561 MB perf.data (~24495 samples) ]

and then take recorded file perf.data and generate report:

[root@hyp-ess2e ~]# perf report –sort=cpu –show-total-period -n –showcpuutilization –comms=ESSSVR

We set sort mode –sort=cpu – it means we’ll see on top the most expensive function.

Report on development:

# Overhead sys user Samples Period CPU
# …….. ……. ……. ……….. ………… …
#
99.78% 0.98% 98.80% 982 581375243 2
|
|–17.56%– GetMemIndv
|–14.29%– _ZN21QPDFOdometerRetriever15bsoProcessBlockEv
|–8.17%– adIndBinarySearchBtree
|–6.02%– adGetSeci
|–5.41%– _ZN26CoordRangeSetArrayIterator6seekGEEPKjS1_Pb
|–5.31%– EssSdSpinlockAcquireNonReentrant
|–3.57%– adIndFindNextKey
|–2.86%– _ZNK13CoordRangeSet12findGEOffsetEj
|–2.65%– _ZN16JerusalemOutline23getAscendantsAndWeightsEP11AD_THDCTX_TPvdR14QPDFChun

Report on testing:

# Overhead sys user Samples Period CPU
# …….. ……. ……. ……….. ………… …
#
98.76% 2.11% 96.65% 833 555714282 2
|
|–17.70%– GetMemIndv
|–14.69%– adIndBinarySearchBtree
|–7.71%– EssSdSpinlockAcquireNonReentrant
|–5.66%– adGetSeci
|–5.42%– _ZN21QPDFOdometerRetriever15bsoProcessBlockEv
|–4.82%– adIndFindNextKey
|–3.85%– _ZN26CoordRangeSetArrayIterator6seekGEEPKjS1_Pb
|–3.37%– adIndLocatePage
|–2.65%– _ZNK13CoordRangeSet12findGEOffsetEj
|–2.65%– EssWideUIntIncrProd32
|–2.29%– EssSdSpinlockAcquire
|–1.92%– _ZN16JerusalemOutline23getAscendantsAndWeightsEP11AD_THDCTX_TPvdR14QPDFChun

I selected as bold the number of samples – it’s the total number of lines in perf.data file when ESSSVR process worked. On dev 17,56% of it – means 982 x 0,1756 = 173 lines contain GetMemIndv function in stacktrace, 14.29% = 140 lines contain _ZN21QPDFOdometerRetriever15bsoProcessBlockEv function and so on.
And also we could conclude from names that all MDX query actually did was walking through index leaves (adIndFindNextKey, adIndBinarySearchBtree) and getting values (GetMemIndv, _ZN21QPDFOdometerRetriever15bsoProcessBlockEv).

But let’s return to comparing test and dev environments.

For me suspicious entry on dev

|–5.31%– EssSdSpinlockAcquireNonReentrant

and on test environment

|–7.71%– EssSdSpinlockAcquireNonReentrant

We seems to be dealing with spinlocks. If so the difference in EssSdSpinlock is quite simple: as classical spinlock https://en.wikipedia.org/wiki/Spinlock it’s either bad code or slow processor.

Bad code is not our case: essbase version and MDX was the same. Let’s see on the processor:

the dev:
# of Cores 6
# of Threads 12
Processor Base Frequency 2.40 GHz
Max Turbo Frequency 2.67 GHz

and test:
# of Cores 8
# of Threads 16
Processor Base Frequency 2.20 GHz
Max Turbo Frequency 3.00 GHz

Test has more threads but less base frequency. Check the frequency value on test:

[root@tthyp-ess2e ~]# cat /proc/cpuinfo | grep MHz
cpu MHz : 2200.000

2200 – it’s base value while maximum is 3000. Changing frequency initially may be a challenge – see info about energy efficient modes and scaling governor, https://itpeernetwork.intel.com/how-to-maximise-cpu-performance-for-the-oracle-database-on-linux/

But all boils down to changing processor energy efficient mode in BIOS and sometimes installing scaling governor daemon in RHEL (see a good example here http://christian-hoekstra.blogspot.ru/2016/09/how-to-configure-cpu-options-in-uefi.html ). After all actions we could see in /proc/cpuinfo:

cpu MHz : 2700.000

the CPU frequency has increased from 2,2 GHz to 2,7 GHz

And total execution time on test has become 17% better.

But we still have an oddity – production is 8% better then test, but they have the same frequency in /proc/cpuinfo: 2700.

And it’s impossible to find out why it’s so using perf command line I wrote earlier. Because Perf is not a trace (and I’m not sure if trace would help here) – it just collects various types of counters at intervals, sampling of execution process at intervals.

Here the two consecutive script executions on production environment:

# Samples: 771 of event ‘cycles’
# Event count (approx.): 569121986
#
# Overhead sys user Samples Period CPU
# …….. ……. ……. ……….. ………… …
#
99.80% 0.94% 98.86% 765 567971776 2
|
|–15.99%– GetMemIndv
|–12.71%– adIndBinarySearchBtree
|–12.32%– _ZN21QPDFOdometerRetriever15bsoProcessBlockEv

# Samples: 793 of event ‘cycles’
# Event count (approx.): 586212636
#
# Overhead sys user Samples Period CPU
# …….. ……. ……. ……….. ………… …
#
99.88% 1.16% 98.72% 792 585517258 10
|
|–15.38%– _ZN21QPDFOdometerRetriever15bsoProcessBlockEv
|–14.37%– GetMemIndv
|–13.10%– adIndBinarySearchBtree

Script and environment are the same but percentages are slightly different. Just by chance (program was in that function when sampling occurs) operation “GetMemIndv” is more common in the first case than in the second case.

It’s hard to say why dev is slower than production using only cpu-cycles counter. And here on the scene appear other counters (default perf stat output is quite useful).

[root@dhyp-ess2e ~]# perf stat -p 5313 -a su – oracle -c “source /etc/hyperion/udf_env.sh; /var/lib/hyperion-paysys/MDX.py”0;Ok

Performance counter stats for process id ‘5313’:

240,001044 task-clock # 0,398 CPUs utilized
23 context-switches # 0,096 K/sec
0 cpu-migrations # 0,000 K/sec
321 page-faults # 0,001 M/sec
571 213 068 cycles # 2,380 GHz [83,37%]
361 027 497 stalled-cycles-frontend # 63,20% frontend cycles idle [83,39%]
209 636 426 stalled-cycles-backend # 36,70% backend cycles idle [66,75%]
506 542 313 instructions # 0,89 insns per cycle
# 0,71 stalled cycles per insn [83,37%]
95 915 575 branches # 399,646 M/sec [83,47%]
2 075 617 branch-misses # 2,16% of all branches [83,28%]

0,603318809 seconds time elapsed

[root@hyp-ess2f ~]# perf stat -p 20555 -a su – oracle -c “source /etc/hyperion/udf_env.sh; /var/lib/hyperion-paysys/MDX.py”0;Ok

Performance counter stats for process id ‘20555’:

193,229461 task-clock # 0,417 CPUs utilized
21 context-switches # 0,109 K/sec
0 cpu-migrations # 0,000 K/sec
302 page-faults # 0,002 M/sec
580 768 762 cycles # 3,006 GHz
395 362 647 stalled-cycles-frontend # 68,08% frontend cycles idle
stalled-cycles-backend
520 716 586 instructions # 0,90 insns per cycle
# 0,76 stalled cycles per insn
98 560 689 branches # 510,071 M/sec
1 013 256 branch-misses # 1,03% of all branches

0,463141531 seconds time elapsed

In the output (http://man7.org/linux/man-pages/man2/perf_event_open.2.html):
task-clock – reports a clock count specific to the task that is running;
context-switches – counts context switches;
cpu-migrations – reports the number of times the process has migrated to a new CPU;
page-faults – reports the number of page faults;
cycles – total cycles;
stalled-cycles-frontend – stalled cycles during issue;
stalled-cycles-backend – stalled cycles during retirement;
instructions – retired instructions;
branches – retired branch instructions;
branch-misses – mispredicted branch instructions.

Here we could see that the amount of CPU cycles was the same ~580 000 000, but production operates with real frequency 3000 MHz – much more then expected 2700 MHz.

The answer (to the question why production is faster then test) is simple – Intel Turbo boost technology ( http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html ) – when processor temperature is low (means weak load) frequency could be increased for a short period of time.

And finally, another one experiment: I tried to find out does it matter processor cache size or not.

Dev, processor E5645, 8 Mb cache size:

[root@dhyp-ess2e ~]# perf stat -e L1-dcache-load,L1-dcache-load-misses -p 5313 -a su – oracle -c “source /etc/hyperion/udf_env.sh; /var/lib/hyperion-paysys/MDX.py”

Performance counter stats for process id ‘5313’:

178 073 504 L1-dcache-load
9 390 025 L1-dcache-load-misses # 5,27% of all L1-dcache hits

Production, processor E5-2650, 20 Mb cache size

184 801 000 L1-dcache-load
10 114 188 L1-dcache-load-misses # 5,47% of all L1-dcache hits

And returning to the question: “What is the best processor for Hybrid?”

It does not matter number of threads, cache size and prediction algorithms. Processor with the highest frequency will be better.

And if we have a choice:
1. E7-8890 v4
Launch Date Q2’16
Processor Base Frequency 2.20 GHz
Max Turbo Frequency 3.40 GHz
Cache 60 MB
Threads 48
Price $7174.00
http://ark.intel.com/products/93790/Intel-Xeon-Processor-E7-8890-v4-60M-Cache-2_20-GHz
2. E7-8893V4
Launch Date Q2’16
Processor Base Frequency 3.20 GHz
Max Turbo Frequency 3.50 GHz
Cache 60 MB
Threads 8
Price $6841
3. E5-2667
Launch Date Q1’12
Processor Base Frequency 2.90 GHz
Max Turbo Frequency 3.50 GHz
Cache 15 MB SmartCache
Threads 12
Price $1552
http://ark.intel.com/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI

I vote for E5-2667 (don’t pay attention it’s 4-years old, since AMD bankruptcy has changed everything). It has almost the same frequency as E7-8893V4 but much-much more cheaper.

P.S. Don’t forget that all we’ll achieve by changing E5645 to E5-2667 is increasing performance by 25-30%. While software changes could increase performance by 1000%.

P.P.S. One could ask: “Why don’t we see a big difference for processor with various cache size? Partially answer could be formulated as – the matter is how code was written, see for example http://www.cse.ust.hk/damon2011/proceedings/p5-sompolski.pdf “