tag:blogger.com,1999:blog-39443060874015823392024-03-13T12:56:26.884-07:00Meta-X86Evgeny Budilovsky's BlogUnknownnoreply@blogger.comBlogger7125tag:blogger.com,1999:blog-3944306087401582339.post-40259716128963170122016-01-14T11:19:00.002-08:002016-01-14T11:19:17.309-08:00The blog has moved to budevg.github.ioI've decided to move my blog to <a href="http://budevg.github.io/">budevg.github.io</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-3944306087401582339.post-1305545213171992092014-06-25T09:02:00.001-07:002014-06-26T12:20:47.226-07:00CPU Cache EssentialsThis post came to my mind after watching the excellent presentation of Scott Meyers called "CPU Caches and Why You care". This post will try to summarize the ideas of the presentation so if you have some spare time you can just watch the presentation on <a href="http://vimeo.com/97337258">video</a>.<br />
<br />
To emphasize importance of cpu caches in our daily work we start with 2 examples:<br />
<br />
The first problem is a simple traversing of 2 dimensional array. The way we can do this in c-like language is by traversing the array row by row. Alternatively we can traverse it column by column.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #6ab825; font-weight: bold;">uint64_t</span> <span style="color: #d0d0d0;">matrix[ROWS][COLUMNS];</span>
<span style="color: #6ab825; font-weight: bold;">uint64_t</span> <span style="color: #d0d0d0;">i,j;</span>
<span style="color: #6ab825; font-weight: bold;">uint64_t</span> <span style="color: #d0d0d0;">ret;</span>
<span style="color: #999999; font-style: italic;">/* row by row */</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(i</span> <span style="color: #d0d0d0;">=</span> <span style="color: #3677a9;">0</span><span style="color: #d0d0d0;">;</span> <span style="color: #d0d0d0;">i</span> <span style="color: #d0d0d0;"><</span> <span style="color: #d0d0d0;">ROWS;</span> <span style="color: #d0d0d0;">i++)</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(j</span> <span style="color: #d0d0d0;">=</span> <span style="color: #3677a9;">0</span><span style="color: #d0d0d0;">;</span> <span style="color: #d0d0d0;">j</span> <span style="color: #d0d0d0;"><</span> <span style="color: #d0d0d0;">COLUMNS;</span> <span style="color: #d0d0d0;">j++)</span>
<span style="color: #d0d0d0;">ret</span> <span style="color: #d0d0d0;">+=</span> <span style="color: #d0d0d0;">matrix[i][j];</span>
<span style="color: #999999; font-style: italic;">/* column by column */</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(j</span> <span style="color: #d0d0d0;">=</span> <span style="color: #3677a9;">0</span><span style="color: #d0d0d0;">;</span> <span style="color: #d0d0d0;">j</span> <span style="color: #d0d0d0;"><</span> <span style="color: #d0d0d0;">COLUMNS;</span> <span style="color: #d0d0d0;">j++)</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(i</span> <span style="color: #d0d0d0;">=</span> <span style="color: #3677a9;">0</span><span style="color: #d0d0d0;">;</span> <span style="color: #d0d0d0;">i</span> <span style="color: #d0d0d0;"><</span> <span style="color: #d0d0d0;">ROWS;</span> <span style="color: #d0d0d0;">i++)</span>
<span style="color: #d0d0d0;">ret</span> <span style="color: #d0d0d0;">+=</span> <span style="color: #d0d0d0;">matrix[i][j];</span>
</pre>
</div>
<br />
Strangely for large arrays (> 10MB) traversing column by column leads to terrible performance.<br />
<br />
The second problem is a parallel processing of some data in large array. We divide the array into X chunks and process each chunk in a separate thread. For example we want to count number of bytes which are set to 1. The following implementation doesn't scale when we run it on machines with more and more cores.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #6ab825; font-weight: bold;">char</span> <span style="color: #d0d0d0;">array[SIZE_10_MB];</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">X</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">NUM_OF_CORES;</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">results[X];</span>
<span style="color: #6ab825; font-weight: bold;">void</span> <span style="color: #447fcf;">chunk_worker</span><span style="color: #d0d0d0;">(</span><span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">index)</span>
<span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">i;</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">work_size</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0; line-height: 125%;">SIZE_10_MB</span><span style="color: #d0d0d0; line-height: 125%;">/X;</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(i</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">work_size</span> <span style="color: #d0d0d0;">*</span> <span style="color: #d0d0d0;">index;</span> <span style="color: #d0d0d0;">i</span> <span style="color: #d0d0d0;"><</span> <span style="color: #d0d0d0;">work_size</span> <span style="color: #d0d0d0;">*</span> <span style="color: #d0d0d0;">(index</span> <span style="color: #d0d0d0;">+</span> <span style="color: #3677a9;">1</span><span style="color: #d0d0d0;">);</span> <span style="color: #d0d0d0;">i++)</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">if</span> <span style="color: #d0d0d0;">(array[i]</span> <span style="color: #d0d0d0;">==</span> <span style="color: #3677a9;">1</span><span style="color: #d0d0d0;">)</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #d0d0d0;">results[index]</span> <span style="color: #d0d0d0;">+=</span> <span style="color: #3677a9; line-height: 125%;">1</span><span style="color: #d0d0d0; line-height: 125%;">;</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #d0d0d0;">}</span>
</pre>
</div>
<br />
This weird behavior can be explained after we learn about the cpu caches.<br />
<br />
CPU caches are a small amount of unusually fast memory. We have 3 types of caches in a regular CPU:<br />
<ul>
<li>D-cache - cache used to store data</li>
<li>I-cache - cache used to store code (instructions)</li>
<li>TLB - cache used to store virtual to real memory address translations</li>
</ul>
<div>
These caches are arranged in a typical 3 layer hierarchy:</div>
<br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"> <span style="color: #d0d0d0;">typical</span> <span style="color: #d0d0d0;">i7-</span><span style="color: #3677a9;">9</span><span style="color: #d0d0d0;">xx</span> <span style="color: #d0d0d0;">(</span><span style="color: #3677a9;">4</span> <span style="color: #d0d0d0;">cores)</span> <span style="color: #d0d0d0;">example</span>
<span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">share</span> <span style="color: #d0d0d0;">by</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">shared</span> <span style="color: #d0d0d0;">by</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span>
<span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">I/D</span> <span style="color: #d0d0d0;">cache</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">cores</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">hw</span> <span style="color: #d0d0d0;">threads</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">latency</span> <span style="color: #d0d0d0;">|</span>
<span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">L1</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">32</span><span style="color: #d0d0d0;">KB/</span><span style="color: #3677a9;">32</span><span style="color: #d0d0d0;">KB</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">1</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">2</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">4</span> <span style="color: #d0d0d0;">cycles</span> <span style="color: #d0d0d0;">|</span>
<span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">L2</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">256</span><span style="color: #d0d0d0;">KB</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">1</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">2</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">11</span> <span style="color: #d0d0d0;">cycles</span> <span style="color: #d0d0d0;">|</span>
<span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">L3</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">8</span><span style="color: #d0d0d0;">MB</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">4</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">8</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">39</span> <span style="color: #d0d0d0;">cycles</span> <span style="color: #d0d0d0;">|</span>
<span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">RAM</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span> <span style="color: #d0d0d0;">|</span> <span style="color: #3677a9;">107</span> <span style="color: #d0d0d0;">cycles</span> <span style="color: #d0d0d0;">|</span>
</pre>
</div>
<br />
For example L2 is 256KB chunk of fast memory (11 cycles access time) which is used to cache both data and instructions and which is shared by 2 hardware threads on single core.<br />
<br />
<a href="http://1.bp.blogspot.com/-a9wT7NnLLew/U6nCe2QGiII/AAAAAAAAB48/gzg8fuuSlf8/s1600/georgie-pix-lightning.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://1.bp.blogspot.com/-a9wT7NnLLew/U6nCe2QGiII/AAAAAAAAB48/gzg8fuuSlf8/s1600/georgie-pix-lightning.jpg" height="200" width="133" /></a><i>By the way there is one type of memory that we didn't mention which can beat the performance of all these layers - the cpu registers.</i><br />
<i><br /></i>
Now when we talk about making your programs fast and furious, the only thing that is really matters is how well you can fit into the cache hierarchy. It won't even matter if you are using machine with 64G. In the hardware world smaller is faster. Compact code and data structures will always be fastest.<br />
<br />
Since access to main memory is so expensive, the hardware will bring a whole chunk of memory to put it into cache line. Typical size of cache line is 64 bytes so each time we read one byte of memory 64 bytes of data will enter our cache (and probably evict some other 64 bytes). Writing one byte of memory will <i>eventually</i> lead to writing 64 bytes of data to memory.<br />
<br />
<a href="http://3.bp.blogspot.com/-6GbJT6Ri-Dk/U6rumH47AyI/AAAAAAAAB-k/TEOPOv5Zn7s/s1600/computer-aplications.gif" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-6GbJT6Ri-Dk/U6rumH47AyI/AAAAAAAAB-k/TEOPOv5Zn7s/s1600/computer-aplications.gif" height="200" width="164" /></a>One interesting thing about these cache lines is the fact that our hardware is <i>pretty smart</i> to perform prefetch of cache line once it detects forward/backward traversal.<br />
<br />
Thinking back about first problem of traversing matrix. It is now clear why the column by column case is having such a bad performance. When we traversing column by column we are not using each cache line effectively. In fact be bring complete cache line just to access one byte and later when the cache line is evicted from the small cache we our accessing the second byte.<br />
<br />
Reasoning about the coherency of the different caches becomes impossible task. Luckily we don't have to reason too much, the hardware will take care of synchronization as long as we will use proper synchronization primitives (high level mutexes, read/write barriers and etc). Unfortunately this simplification comes with <b>cost</b> - <b>TIME</b>. Your hardware will spend precious time on synchronization which will reduce the performance of your program.<br />
<br />
<br />
<a href="http://1.bp.blogspot.com/-9wtm5TjcVvY/U6nCe-xfC8I/AAAAAAAAB5A/6f_kecOwKaY/s1600/sharing.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://1.bp.blogspot.com/-9wtm5TjcVvY/U6nCe-xfC8I/AAAAAAAAB5A/6f_kecOwKaY/s1600/sharing.jpg" height="128" width="200" /></a>Another effect of CPU caches is called "<b>False Sharing</b>". Suppose core 0 reads address A and core 1 writes to address A+1. Since A and A+1 occupy same cache line, hardware will need to synchronize caches by constantly invalidating the cache line and catching it back. This is exactly what happens in problem 2 where:<br />
<br />
<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #d0d0d0;">results[index]</span> <span style="color: #d0d0d0;">+=</span> <span style="color: #3677a9;">1</span><span style="color: #d0d0d0;">;</span>
</pre>
</div>
<br />
invalidates the cache line each increment.<br />
<br />
Quick fix of using local variable to maintain result of each thread and setting them at the end leads to performance boost.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">array[SIZE_10_MB];</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">X</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">NUM_OF_CORES;</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">results[X];</span>
<span style="color: #6ab825; font-weight: bold;">void</span> <span style="color: #447fcf;">chunk_worker</span><span style="color: #d0d0d0;">(</span><span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">index)</span>
<span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">i;</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">sum</span> <span style="color: #d0d0d0;">=</span> <span style="color: #3677a9;">0</span><span style="color: #d0d0d0;">;</span>
<span style="color: #6ab825; font-weight: bold;">int</span> <span style="color: #d0d0d0;">work_size</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">SIZE_10_MB/X;</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(i</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">work_size</span> <span style="color: #d0d0d0;">*</span> <span style="color: #d0d0d0;">index;</span> <span style="color: #d0d0d0;">i</span> <span style="color: #d0d0d0;"><</span> <span style="color: #d0d0d0;">work_size</span> <span style="color: #d0d0d0;">*</span> <span style="color: #d0d0d0;">(index</span> <span style="color: #d0d0d0;">+</span> <span style="color: #3677a9;">1</span><span style="color: #d0d0d0;">);</span> <span style="color: #d0d0d0;">i++)</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">if</span> <span style="color: #d0d0d0;">(array[i]</span> <span style="color: #d0d0d0;">==</span> <span style="color: #3677a9;">1</span><span style="color: #d0d0d0;">)</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #d0d0d0;">sum</span> <span style="color: #d0d0d0;">+=</span> <span style="color: #3677a9;">1</span><span style="color: #d0d0d0;">;</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #d0d0d0;">results[index]</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">sum;</span>
<span style="color: #d0d0d0;">}</span>
</pre>
</div>
<br />
To conclude here are some tips you can use to boost performance by being aware of the CPU cache tradeoffs:<br />
<br />
<h4>
Data cache tips:</h4>
<ul>
<li>Use linear array traversal. Hardware will often optimize and pref-etch the data so that the speedup will be substantial</li>
<ul>
<li>Use as much of cache line as possible. For example in the next code when <i>else</i> clause is happening, we are throwing complete cache line which was fetched by accessing the <i>is_alive</i> member. Solution to this could be to make sure that most objects are alive.</li>
</ul>
</ul>
<div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #6ab825; font-weight: bold;">struct</span> <span style="color: #d0d0d0;">Obj</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">bool</span> <span style="color: #d0d0d0;">is_alive;</span>
<span style="color: #d0d0d0;">...</span>
<span style="color: #d0d0d0;">};</span>
<span style="color: #d0d0d0;">std::vector<Object></span> <span style="color: #d0d0d0;">objs;</span>
<span style="color: #6ab825; font-weight: bold;">for</span> <span style="color: #d0d0d0;">(</span><span style="color: #6ab825; font-weight: bold;">auto</span> <span style="color: #d0d0d0;">o:</span> <span style="color: #d0d0d0;">objs)</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">if</span> <span style="color: #d0d0d0;">(o.is_alive)</span>
<span style="color: #d0d0d0;">do_stuff(o);</span>
<span style="color: #6ab825; font-weight: bold;">else</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #999999; font-style: italic;">// just thrown a cache line</span>
<span style="color: #d0d0d0;">}</span></pre>
</div>
</div>
<ul>
<li>Be alert for false sharing in multi-core systems</li>
</ul>
<h4>
Code cache tips:</h4>
<div style="orphans: auto; text-align: start; text-indent: 0px; widows: auto;">
<ul>
<li>Avoid iteration over heterogeneous sequence of objects with virtual calls. If we have sequence of heterogeneous objects the best thing would be to sort them by type so that executing virtual function of one object will lead to fetching code which can be used by the next object.</li>
<li>Make fast paths using branch-free sequences of code</li>
<li>Inline cautiously. </li>
<ul>
<li>Pros: reduce branches which will lead to speedup, compiler optimizations now possible</li>
<li>Cons: code duplication reduces code cache use</li>
</ul>
<li>Use Profile-guided Optimizations (PGO) and Whole Program Optimizations (WPO) tools -these are automatic tools which will help you to optimize your code</li>
</ul>
</div>
<br />
<br />
<br />Unknownnoreply@blogger.com7tag:blogger.com,1999:blog-3944306087401582339.post-47425033952347866962014-02-12T14:22:00.002-08:002014-02-12T14:24:24.658-08:00Building Distributed Cache with GlusterFS - not a good ideaCache is a popular concept. You can't survive in the cruel word of programming without using cache. Cache is inside you cpu. There is one inside you browser. Your favorite web site is serving pages from cache and you probably have one in your brain as well :)<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-0GIQddDAnbs/Uvvq8RTjFVI/AAAAAAAABmE/BNy25Ho_88o/s1600/calmclearcache.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-0GIQddDAnbs/Uvvq8RTjFVI/AAAAAAAABmE/BNy25Ho_88o/s1600/calmclearcache.jpg" height="200" width="142" /></a></div>
<br />
<br />
Anyway, my mission was to build a distributed cache which can be used to speedup application X. This cache needed to be scalable and to allow extending it by adding more compute nodes. The compute nodes should be cheap amazon instances and the cache should provide very low latency (time until first byte of data arrives should be around 5ms).<br />
<br />
Distributed file systems are a very cool peace of technology, so naturally it seems as a good idea to use it in building this cache. After reading about Ceph vs Gluster wars I've decided to settle down on Gluster. I liked very much the idea of translators which add functionality layer by layer. Each translator is a simple unit which handles one task of complicated file system.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-4F-eVPvrtQU/UvvryozPm9I/AAAAAAAABmQ/YSRLicOQuPQ/s1600/20120221105324808-f2df3ea3e3aeab8_250.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://4.bp.blogspot.com/-4F-eVPvrtQU/UvvryozPm9I/AAAAAAAABmQ/YSRLicOQuPQ/s1600/20120221105324808-f2df3ea3e3aeab8_250.png" /></a></div>
<br />
The DHT translator in Gluster allows to distribute files between the different nodes. Later when the client wants to read file, it can locate the correct node using O(1) hash function computation and to read it directly from the node where it is located.<br />
<br />
Using amazon, I quickly created 5 m1.large instances and installed glusterfs. The installation is pretty easy. I used my favorite tool for repeating tasks on multiple compute nodes (<a href="http://docs.fabfile.org/">fabric</a>) and after a while installation of x nodes became as simple as running<br />
<br />
<div style="background: #f8f8f8; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">>> fab -H hostname1,hostname2,hostnamexx -P -f gluster.py install_gluster create_volume create_mount
</pre>
</div>
<br />
Next I created distributed glusterfs volume consisting of 5 nodes<br />
<br />
<div style="background: #f8f8f8; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">>> gluster volume create cache hostname1:/brick1/cache hostname2:/brick1/cache hostname3:/brick1/cache hostname4:/brick1/cache hostname5:/brick1/cache
</pre>
</div>
<br />
I modified application X code to try and use the remote cache and my algorithm became something similar to:<br />
<br />
<div style="background: #f8f8f8; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">fd <span style="color: #666666;">=</span> open(<span style="color: #bb4444;">"/local_cache/file1/block1.bin"</span>, O_RDONLY);
<span style="color: #aa22ff; font-weight: bold;">if</span> (fd <span style="color: #666666;">==</span> <span style="color: #666666;">-1</span>) {
fd <span style="color: #666666;">=</span> open(<span style="color: #bb4444;">"/gluster/file1/block1.bin"</span>, O_RDONLY);
...
pread(fd, ....)
</pre>
</div>
<br />
And it worked well when there was small amount of application X instances (around 10). However when the amount of application X instances was increased (8 servers, 15 application X instances on each, total of 120 applications which continuously access glusterfs through fuse interface), I've noticed that access to mounted glusterfs file system became very slow.<br />
<br />
Looking at latency of some of the glusterfs operation on client side (thanks good gluster has translator, called <b>debug/io-stats</b>, which can aggregate statistics), I've noticed that some of the lookup operation would take 10-20 seconds ! Basically lookup should be very lite operation, using the DHT translator the client should determine the exact server where the file should be located and then it can query it directly which will lead to distribution of stress between the 5 nodes. Each time I open the file a lookup command will be sent by gluster to determine where the file is located and later I can read it or write to it without a problem.<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">Fop Call Count Avg-Latency Min-Latency Max-Latency
--- ---------- ----------- ----------- -----------
STAT 58 2133.43 us 1.00 us 62789.00 us
MKDIR 112 1009804.77 us 18.00 us 14452294.00 us
OPEN 149 22108.78 us 2.00 us 1778858.00 us
READ 151 160008.16 us 7.00 us 7246801.00 us
FLUSH 149 13178.94 us 1.00 us 1594615.00 us
SETXATTR 2633 232459.26 us 719.00 us 10749344.00 us
LOOKUP 4725 <b>641604.43 us</b> 2.00 us <b>17630043.00 us</b>
FORGET 1 0 us 0 us 0 us
RELEASE 149 0 us 0 us 0 us
------ ----- ----- ----- ----- ----- --- ----- ----- ----- ---
</pre>
</div>
<br />
So why are lookups take so much time ? I realized I need to look at the network level. Luckily glusterfs has <a href="https://github.com/nixpanic/gluster-wireshark-1.4">plugin for wireshark</a> which can dissect traffic and show glusterfs operations.<br />
<br />
Looking at the captured packets I saw that when we access file /file1/block1.bin, gluster will invoke 3 lookup operations. First for directory <b>/</b>, second for sub-directory <b>file1 </b>and then to file <b>block1.bin</b>.<br />
<br />
Now here are the bad news: when performing lookup for directory, gluster will broadcast lookup request to <b>all</b> nodes and then wait until all nodes will respond. In my case one of the nodes (not the one where the actual file sits) would be super busy and would respond only after 5 seconds. And so the lookup request will wait for 5 seconds without trying to continue and actually communicate with the node where the file is located. This bad situation repeats for each path directory component and leads to awful performance with lookups.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-2ucKmQ0RyXI/UvvwNP7UsvI/AAAAAAAABmc/fygijMCQ-Rs/s1600/1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-2ucKmQ0RyXI/UvvwNP7UsvI/AAAAAAAABmc/fygijMCQ-Rs/s1600/1.png" height="489" width="640" /></a></div>
<br />
<br />
Tweaking glusterfs options such as setting dht.lookup-unhashed=off, didn't help since this only relevant the file component and we still need to traverse at least two directory components until we reach the file component in the path.<br />
<br />
So in my opinion such behavior makes gluster very unsuitable for functioning as cache where checking if data is in cache (lookup) is a frequent method. I guess hacking gluster and modifying the DHT translator can solve the issue (and introduce <a href="https://twitter.com/irqed/status/358212928404586498">many other interesting issues</a> :)<br />
<br />
But this would be another interesting post ...Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-3944306087401582339.post-41235660711424980192013-06-10T00:06:00.000-07:002013-06-10T00:06:37.278-07:00Coroutines for the greater goodFunctions are the basic blocks of our daily code. This mechanism provides extremely simple and powerful abstraction. We use the stack memory as our main work space, storing there local variables and the return address. When the function returns, it pops the return address from the stack and jumps to it, allowing to continue previous function execution.<br />
<br />
Once the function returns all it's internal state (local variables) is destroyed and on next invocation of the function we basically need to start from scratch. We can overcome this situation by using some clever techniques but there is a better solution.<br />
<br />
Coroutines allow us to return from function by preserving the current context. Such coroutine can yield by passing the execution to other function which may decide to call the original function again. When this happens the execution will resume from the point of yield.<br />
<br />
Simple example for coroutines is the consumer-producer pattern. This use case can be implemented in the following way using coroutines:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #6ab825; font-weight: bold;">void</span> <span style="color: #d0d0d0;">coroutiine__</span> <span style="color: #447fcf;">producer</span><span style="color: #d0d0d0;">()</span>
<span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">while</span> <span style="color: #d0d0d0;">(!queue_is_full(q))</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #d0d0d0;">item_t</span> <span style="color: #d0d0d0;">*item</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">new_item();</span>
<span style="color: #d0d0d0;">queue_insert(q,</span> <span style="color: #d0d0d0;">item);</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #d0d0d0;">coroutine_yield(consumer);</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #6ab825; font-weight: bold;">void</span> <span style="color: #d0d0d0;">coroutiine__</span> <span style="color: #447fcf;">consumer</span><span style="color: #d0d0d0;">()</span>
<span style="color: #d0d0d0;">{</span>
<span style="color: #6ab825; font-weight: bold;">while</span> <span style="color: #d0d0d0;">(!queue_is_empty(q))</span> <span style="color: #d0d0d0;">{</span>
<span style="color: #d0d0d0;">item_t</span> <span style="color: #d0d0d0;">*item</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">queue_remove(q);</span>
<span style="color: #d0d0d0;">consume_item(item);</span>
<span style="color: #d0d0d0;">}</span>
<span style="color: #d0d0d0;">coroutine_yield(producer);</span>
<span style="color: #d0d0d0;">}</span>
</pre>
</div>
<br />
One example of interesting usage for coroutines appears in the <a href="http://wiki.qemu.org/">qemu</a> project. Qemu is the workhorse of virtualization infrastructure and it is used extensively to provide devices emulation for hypervisors such as <a href="http://www.linux-kvm.org/page/Main_Page">kvm</a> and <a href="https://en.wikipedia.org/wiki/Xen">xen</a>.<br />
<br />
On Jan 2011 coroutines were <a href="http://git.qemu.org/?p=qemu.git;a=commit;h=00dccaf1f848290d979a4b1e6248281ce1b32aaa">introduced</a> into the qemu code base. The authors wanted to use the coroutines power to overcome some of the complexity involved in handling asynchronous calls. Before coroutines, many operations that could block were performed in asynchronous way using callbacks. This lead to complicated code since chunks of code were separated into single steps each implemented as separate callback. Temporary structures were used to pass parameters to callbacks and the whole picture looked complicated and unreadable.<br />
<br />
With introduction of coroutines, simple interface was established to invoke cooperative functions:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #202020; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #999999; font-style: italic;">/* Creating and starting a coroutine is easy: */</span>
<span style="color: #d0d0d0;">coroutine</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">qemu_coroutine_create(my_coroutine);</span>
<span style="color: #d0d0d0;">qemu_coroutine_enter(coroutine,</span> <span style="color: #d0d0d0;">my_data);</span>
<span style="color: #999999; font-style: italic;">/* The coroutine then executes until it returns or yields: */</span>
<span style="color: #6ab825; font-weight: bold;">void</span> <span style="color: #d0d0d0;">coroutine_fn</span> <span style="color: #447fcf;">my_coroutine</span><span style="color: #d0d0d0;">(</span><span style="color: #6ab825; font-weight: bold;">void</span> <span style="color: #d0d0d0;">*opaque)</span>
<span style="color: #d0d0d0;">{</span>
<span style="color: #d0d0d0;">my_data_t</span> <span style="color: #d0d0d0;">*my_data</span> <span style="color: #d0d0d0;">=</span> <span style="color: #d0d0d0;">opaque;</span>
<span style="color: #999999; font-style: italic;">/* do some work */</span>
<span style="color: #d0d0d0;">qemu_coroutine_yield();</span>
<span style="color: #999999; font-style: italic;">/* do some more work */</span>
<span style="color: #d0d0d0;">}</span>
</pre>
</div>
<br />
There are several implementations to the coroutines semantics inside the qemu:<br />
<br />
<ul>
<li><b>coroutine-ucontext.c</b> - implements coroutines using <b>swapcontext</b> system call and the <b>sigsetjmp/siglongjmp</b> calls. The former creates and switches to new stack and the later calls used to jump from one context to other.</li>
<li><b>coroutine-gthread.c</b> - uses glib threads to create abstraction of coroutines. The actual switch between coroutines stops the current thread and activates next thread which executes the cooperative task.</li>
<li><b>coroutine-win32.c</b> - Is the implementation on windows. Windows allows to implement coroutines easily using fibres system calls.</li>
<li><b>coroutine-sigaltstack.c</b> - Unlike in the ucontext implementation, here <b>sigaltstack</b> system call is used to switch stack and create context which is later accessed using <b>sigsetjmp/siglongjmp</b></li>
</ul>
<div>
Different programming languages have different implementations to the coroutines abstraction. One of the best resources on the coroutines in python, is the "<a href="http://dabeaz.com/coroutines/">A Curious Course on Coroutines and Concurrency</a>" tutorial.</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-3944306087401582339.post-73928063582460535392012-12-01T11:13:00.001-08:002012-12-01T11:13:32.213-08:00Using performance counters on linuxOne extremely useful feature linux has to offer is the ability to profile your system (user space and kernel) using the <a href="https://perf.wiki.kernel.org/index.php/Main_Page">perf</a> utility.<br />
<br />
In a nutshell this utility allows you to count hardware and software events in linux kernel. Additional bonus is that when you count these events you can record <a href="http://en.wikipedia.org/wiki/Instruction_pointer">cpu instruction pointer</a> at the time of the event. Instruction pointer records, later can be used to generate concise execution profile of the kernel/user space code.<br />
<br />
Similarly to git, perf uses sub-utils to introduce various functionality:<br />
<br />
The first step is to list available events:<br />
<br />
<div style="background: #202020; background: black; border-width: .1em .1em .1em .8em; border: solid gray; color: white; font-size: 12px; font-size: 12px; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #aaaaaa;">root@dev-12:#</span> perf list
<span style="color: #cccccc;">List of pre-defined events (to be used in -e):</span>
<span style="color: #cccccc;"> cpu-cycles OR cycles [Hardware event]</span>
<span style="color: #cccccc;"> stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]</span>
<span style="color: #cccccc;"> stalled-cycles-backend OR idle-cycles-backend [Hardware event]</span>
<span style="color: #cccccc;"> instructions [Hardware event]</span>
<span style="color: #cccccc;"> cache-references [Hardware event]</span>
<span style="color: #cccccc;"> cache-misses [Hardware event]</span>
<span style="color: #cccccc;"> branch-instructions OR branches [Hardware event]</span>
<span style="color: #cccccc;"> branch-misses [Hardware event]</span>
<span style="color: #cccccc;"> bus-cycles [Hardware event]</span>
<span style="color: #cccccc;"> cpu-clock [Software event]</span>
<span style="color: #cccccc;"> task-clock [Software event]</span>
<span style="color: #cccccc;"> page-faults OR faults [Software event]</span>
</pre>
</div>
<br />
There are tons of events which we can divide into the following main categories:<br />
<ul>
<li>Hardware events</li>
<li>Hardware cache events</li>
<li>Software events</li>
<li>Tracepoint events</li>
</ul>
<div>
Tracepoints events are special places in kernel that were specified by developers as a good position to trace. Stopping there usually brings you to the location where some important kernel functions starts or completes.</div>
<div>
<br /></div>
<div>
For example <i>block:block_rq_complete </i>trace event is passed when block i/o request completes.</div>
<div>
<br /></div>
<div>
Hardware events make use of special cpu hardware registers, which count cpu cpecific hardware events and trigger interrupt when certain threshold is passed. Software events do not require special hardware support and usually generated by kernel handlers which process special events such as page fault.</div>
<div>
<br /></div>
<div>
To watch statistics of events in the system you can use <i>stat</i> command:</div>
<div>
<br /></div>
<div style="background: #202020; background: black; border-width: .1em .1em .1em .8em; border: solid gray; color: white; font-size: 12px; font-size: 12px; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #aaaaaa;">root@dev-12:#</span> perf stat -e syscalls:sys_enter_write,page-faults \</pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #24909d;"> echo </span>hello world > 1.txt
<span style="color: #cccccc;"> Performance counter stats for 'echo hello world':</span>
<span style="color: #cccccc;"> 1 syscalls:sys_enter_write </span>
<span style="color: #cccccc;"> 165 page-faults </span>
<span style="color: #cccccc;"> 0.001013974 seconds time elapsed</span>
</pre>
</div>
<br />
filtering using <i>-e</i> flag is possible and it is possible to count events on existing process using <i>-p</i> flag<br />
<br />
Another step in using perf is to realize that events can be recorded using <i>record </i>command:<br />
<br />
<div style="background: #202020; background: black; border-width: .1em .1em .1em .8em; border: solid gray; color: white; font-size: 12px; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #aaaaaa;">root@dev-12:#</span> perf record -f hdparm -t /dev/vda > /dev/null
<span style="color: #cccccc;">[ perf record: Woken up 1 times to write data ]</span>
<span style="color: #cccccc;">[ perf record: Captured and wrote 0.038 MB perf.data (~1673 samples) ]</span></pre>
</div>
<br />
And later code execution profile can be created using <i>report </i>command (each time event occur the current instruction pointer is recorded as well):<br />
<br />
<div style="background: #202020; background: black; border-width: .1em .1em .1em .8em; border: solid gray; color: white; font-size: 12px; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #aaaaaa;">root@dev-12:/home/local/a/hdparm-9.37#</span> perf report
<span style="color: #aaaaaa;">#</span> Events: 1K cpu-clock
<span style="color: #aaaaaa;">#</span>
<span style="color: #aaaaaa;">#</span> Overhead Command Shared Object Symbol
<span style="color: #aaaaaa;">#</span> ........ ....... ................. ....................................
<span style="color: #aaaaaa;">#</span>
<span style="color: #cccccc;"> 25.80% hdparm [kernel.kallsyms] [k] copy_user_generic_string</span>
<span style="color: #cccccc;"> 8.21% hdparm [kernel.kallsyms] [k] blk_flush_plug_list</span>
<span style="color: #cccccc;"> 4.69% hdparm [kernel.kallsyms] [k] get_page_from_freelist</span>
<span style="color: #cccccc;"> 3.62% hdparm [kernel.kallsyms] [k] add_to_page_cache_locked</span>
<span style="color: #cccccc;"> 3.09% hdparm hdparm [.] read_big_block</span>
<span style="color: #cccccc;"> 2.77% hdparm [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore</span>
<span style="color: #cccccc;"> 2.51% hdparm [kernel.kallsyms] [k] kmem_cache_alloc</span>
<span style="color: #cccccc;"> 1.87% hdparm [kernel.kallsyms] [k] __mem_cgroup_commit_charge</span>
<span style="color: #cccccc;"> 1.76% hdparm [kernel.kallsyms] [k] file_read_actor</span>
<span style="color: #cccccc;"> 1.76% hdparm [kernel.kallsyms] [k] __alloc_pages_nodemask</span>
<span style="color: #cccccc;"> 1.55% hdparm [kernel.kallsyms] [k] alloc_pages_current</span>
</pre>
</div>
<br />
We can filter the kernel code and see where in our program we spend most of our time<br />
<br />
<div style="background: #202020; background: black; border-width: .1em .1em .1em .8em; border: solid gray; color: white; font-size: 12px; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #aaaaaa;">root@dev-12:#</span> perf report -d hdparm
<span style="color: #aaaaaa;">#</span>
<span style="color: #aaaaaa;">#</span> Events: 64 cpu-clock
<span style="color: #aaaaaa;">#</span>
<span style="color: #aaaaaa;">#</span> Overhead Command Symbol
<span style="color: #aaaaaa;">#</span> ........ ....... ..............
<span style="color: #aaaaaa;">#</span>
<span style="color: #cccccc;"> 90.62% hdparm read_big_block</span>
<span style="color: #cccccc;"> 9.38% hdparm time_device</span>
</pre>
</div>
<br />
Also by default we use cpu-clock event as the point where we stop to look at our code execution. We can use different events to find out interesting things. For example what are the places in our code that cause page faults:<br />
<br />
<div style="background: #202020; background: black; border-width: .1em .1em .1em .8em; border: solid gray; color: white; font-size: 12px; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #aaaaaa;">root@dev-12:#</span> perf record -f -e page-faults -F 100000 hdparm -t /dev/vda > /dev/null
<span style="color: #cccccc;">[ perf record: Woken up 1 times to write data ]</span>
<span style="color: #cccccc;">[ perf record: Captured and wrote 0.006 MB perf.data (~278 samples) ]</span>
<span style="color: #aaaaaa;">root@dev-12:#</span> perf report
<span style="color: #aaaaaa;">#</span>
<span style="color: #aaaaaa;">#</span> Events: 87 page-faults
<span style="color: #aaaaaa;">#</span>
<span style="color: #aaaaaa;">#</span> Overhead Command Shared Object Symbol
<span style="color: #aaaaaa;">#</span> ........ ....... ................. ........................
<span style="color: #aaaaaa;">#</span>
<span style="color: #cccccc;"> 72.05% hdparm hdparm [.] prepare_timing_buf</span>
<span style="color: #cccccc;"> 10.96% hdparm ld-2.15.so [.] 0x16b0 </span>
<span style="color: #cccccc;"> 5.34% hdparm libc-2.15.so [.] 0x86fc0 </span>
<span style="color: #cccccc;"> 2.53% hdparm [kernel.kallsyms] [k] copy_user_generic_string</span>
<span style="color: #cccccc;"> 1.26% hdparm libc-2.15.so [.] __ctype_init</span>
<span style="color: #cccccc;"> 1.26% hdparm libc-2.15.so [.] _IO_vfscanf</span>
<span style="color: #cccccc;"> 1.26% hdparm libc-2.15.so [.] strchrnul</span>
<span style="color: #cccccc;"> 1.26% hdparm libc-2.15.so [.] mmap64</span>
<span style="color: #cccccc;"> 1.26% hdparm hdparm [.] main</span>
<span style="color: #cccccc;"> 1.26% hdparm hdparm [.] get_dev_geometry</span>
<span style="color: #cccccc;"> 1.26% hdparm [kernel.kallsyms] [k] __strncpy_from_user</span>
<span style="color: #cccccc;"> 0.28% hdparm [kernel.kallsyms] [k] __clear_user</span>
</pre>
</div>
<br />
To conclude, <i>perf </i>utility is extremely powerful tool. It provides accurate statistics with very little overhead, doesn't require recompilation and can be used to completely understand the overheads of kernel modules and user space application.<br />
<br />
What a nice addition to the arsenal of your favorite hacking tools :)<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-3944306087401582339.post-73845431647996606802012-11-24T04:49:00.001-08:002013-08-29T22:59:42.719-07:00Software Craftsmanship vs HackersRecently it came to my mind that there are two prime schools of programmers in the world. These two groups have opposite values and possess orthogonal skills.<br />
<br />
The first group called "<a href="http://manifesto.softwarecraftsmanship.org/">Software craftsmanship</a>" is focused on the art of writing beautiful and flexible code. Talented people, who belong to this group, build beautiful software designs, practice <a href="http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd">TDD</a>/<a href="http://behaviour-driven.org/">BDD</a> and re-factor their code until the last code-smell disappears. The role models for this group are people like <a href="https://sites.google.com/site/unclebobconsultingllc/">uncle bob</a> and they all have read pragmatic programmer and clean code books.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-O8BqNknDjM8/UiA0lai9Z9I/AAAAAAAABd4/k5NlAOY9sFc/s1600/craftsmanship_en.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-O8BqNknDjM8/UiA0lai9Z9I/AAAAAAAABd4/k5NlAOY9sFc/s320/craftsmanship_en.png" width="226" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://agilitateur.azeau.com/public/agilitateur/craftsmanship_en.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
The other group is what we call "<a href="http://www.imdb.com/title/tt0113243/">hackers</a>". Not only in the sense of people practicing computer security and reverse engineering but also in the broader sense of people who have low level orientation, can quickly understand how closed system is working and modify it to server their purposes. These extraordinary people can write truly amazing software in short period of time and use one line of perl to destroy somebody's world (try the next perl peace of code if you feel lucky) . <br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; background: white; border-width: .1em .1em .1em .8em; border: solid gray; color: black; overflow: auto; padding: .2em .6em; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;">1
2
3
4
5
6</pre>
</td><td><pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-style: italic;"># /usr/bin/env perl</span>
<span style="color: #008800; font-style: italic;"># Danger !!! This script can kill your root file system</span>
<span style="color: darkgoldenrod;">$?</span> ? s:;s:s;;<span style="color: darkgoldenrod;">$?</span>: : s;;<span style="color: #666666;">=</span>]<span style="color: #666666;">=></span><span style="color: darkgoldenrod;">%</span><span style="border: 1px solid #FF0000;">-</span>{<span style="color: #666666;"><-|</span>}<span style="color: #666666;"><&|</span><span style="color: #bb4444;">`{; ;</span>
<span style="color: #bb4444;">y; -/:-@[-`</span>{<span style="color: #666666;">-</span>};<span style="border: 1px solid #FF0000;">`</span><span style="color: #666666;">-</span>{<span style="color: #666666;">/</span><span style="border: 1px solid #FF0000;">"</span> <span style="color: #666666;">-</span>; ;
s;;<span style="color: darkgoldenrod;">$_</span>;see
</pre>
</td></tr>
</tbody></table>
</div>
<br />
Now if you ask your fellow "hacker" to design production system there is high probability that you will get your system soon enough written in ASSEMBLY :). Or you will get it in your favorite c++ without classes, inheritance, templates, stl and all this fancy stuff real hackers never use. And god forbid no unit testing or comments in code because it is written in plain c++ or c or whatever... And you should know how to read code! And it works! Unless there is bug which you can fix in 1 sec :)<br />
<br />
If you ask your enlightened friend from the "Software craftsmanship" group to find out why your mouse stops working whenever you send UDP packet to port 666. There is big chance that you will find him after 10 hours of <a href="http://lmgtfy.com/?q=why+my+mouse+stops+working+whenever+I+send+UDP+packet+to+port+666">googling</a> and sending mails to vendors and support teams all around the world, telling you that you mouse is stupid and there is no real service on port 666 and that there is no problem with port 667 so who cares ...<br />
<br />
So what should you do ? Whom should you join ? As with everything in the world there is no black and white only. In my opinion a good programmer should have a balance of the two types of skills.
Don't compromise on the quality of your code and don't afraid of low level stuff. Create quick prototypes when you need but remember that later real people will have to read and maintain your code.
And most important remember that it is better to be healthy and rich then poor and sick (or just poor or just sick).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-KhPQRvSuuBc/UiA0VCbex7I/AAAAAAAABdw/TOG0cuy0L9k/s1600/Yin-Yang.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://4.bp.blogspot.com/-KhPQRvSuuBc/UiA0VCbex7I/AAAAAAAABdw/TOG0cuy0L9k/s320/Yin-Yang.jpg" width="312" /></a></div>
<!-----><!-----><!-----><!-----><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.stonemania.co.uk/images/stories/yin_yang.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div>
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-3944306087401582339.post-75444914948878085542012-11-09T02:33:00.000-08:002012-11-09T02:33:32.567-08:00Another one bites the blogSo after many many years of not doing what the rest of the world is doing (eating hamburgers ? ), I finally decided to join the community and start my own technical blog.<br />
<br />
In this blog I hope to focus on some of the interesting stuff I enjoy in my profession: low level programming, kernel, storage, virtualization and functional programming.<br />
<br />
The name of the blog: meta-x86 is a merge of two things: the shortcut in my favorite editor (<a href="http://xkcd.com/378/">meta+x</a>) and the architecture I enjoy hacking.<br />
<br />
Get ready for quality posts :)<br />
<br />Unknownnoreply@blogger.com0