Author: Jean-Michel Richer
Institute: LERIA, University of Angers
email: jean-michel.richer@univ-angers.fr
http://www.info.univ-angers.fr/pub/richer

---------------------------------------------------------------------
TABLE OF CONTENTS
=================
- PURPOSE
- BINARY
- TESTS
- RESULTS
---------------------------------------------------------------------

PURPOSE
=======
This package is used to compare the computation time with CPU and GPU
of the score of parsimony of a tree with different methods.
Two tree implementations are given : a dynamic one with binary nodes
and a static one with a flat tree represented by an array of nodes.
The static representation is used on the GPU to have coalesced memory
access.
Different methods to evaluate the score of parsimony are defined
depending on your architecture:

- reference method with no optimization except the ones that the
compiler could provide
- assembly sse2 version with 128 bits registers xmm 
- intrinsics sse2 version with 128 bits registers xmm 
- assembly sse42 version with popcnt
- assembly avx2 version with 256 bits registers ymm
- intrinsics avx2 version with 256 bits registers ymm 


BINARY
======
To obtain the binary, first modify the file 'config/config.params'
to set parameters for your configuration.
Especially:
- define the architecture (CFG_ARCHITECTURE) which should be 32 or 64 bits
- define the size of data (CFG_DATA_SIZE) which should be 8 (dna,rna) or 
32 bits (proteins)
- define vectorization technique (CFG_CPU_VECTORIZE) which should be
sse2, sse42 or avx2

Then type the following:
$ ./config.sh
$ make clean
$ make 

the binary should be generated in the 'bin' directory under 'main.exe'

Note:
- you need to use CUDA 4.2 or CUDA 5.0 with gcc 4.6
- On Ubuntu 13.04 you can use CUDA 6.0 and gcc 4.7

TESTS
=====

1) run a simple test to see all methods available

$ ./bin/main.exe

...

===========================
=== test for all method ===
===========================

==== CUDA init ====
CUDA Device Query... there are 2 CUDA devices.
GPU 0: GeForce GTX 770, compute capability 3.0
GPU 1: GeForce GTX 660, compute capability 3.0
select default GPU
gpu_name=GeForce GTX 770

=== cuda parsimony init ===
gpu taxa_length_aligned=1152
gpu full_size=3456
threads per block=1024
blocks per grid=2

methods
--------------------
method::reference:513
method::compiler:513
method::sse2:513
method::sse_i:513
method:cuda:513

2) to run the CPU test:

$ time ./bin/main.exe --nbr-taxa=128 --taxa-length=1024 --tree-implementation=2 --test=2 --method=2

which uses static tree implementation and SSE2 evaluation (method=2)


3) to run the GPU test:

$ time ./bin/main.exe --nbr-taxa=128 --taxa-length=1024 --tree-implementation=2 --test=3 --threads-per-block=512

will use static tree implementation (required for GPU) and 
a number of threads per block set to 512

4) to test CPU:
note that you can specify the method used as a parameter of this script:

./tests_cpu.sh 3

5) to test GPU:
note that you can specify the number of threads as a parameter of this script

./tests_gpu.sh 256


RESULTS
=======
To obtain html file of all results follow this procedure.

Once the GPU tests have been run for different number of threads, for example:
./tests_gpu.sh 128
./tests_gpu.sh 256
./tests_gpu.sh 384
./tests_gpu.sh 512
./tests_gpu.sh 768

you can create a subdirectory in the 'results' directory to save the file

Then run 'gpu_find_best.sh' from root directory of this project
and specify name of GPU

./gpu_find_best.sh
error: gpu expected !
 
possible gpus are:
------------------
GeForce-GTX-560-Ti
GeForce-GTX-770
Tesla-K20m

./gpu_find_bests.sh Tesla-K20m
...
html file results/html/gpu_Tesla-K20m.html


============
TESTS ON CPU  
============
data  8 bits
method_name=compiler
generate results in results/cpu_Intel-Core-i5-4570-CPU-@-3.20GHz_compiler.txt
64; 1024; 2.4943;
64; 2048; 4.92506;
64; 4096; 9.9912;
64; 8192; 19.8508;
64; 16384; 39.5435;
64; 32768; 79.3558;
64; 131072; 319.209;
64; 262144; 633.103;
128; 1024; 5.07222;
128; 2048; 9.94798;
128; 4096; 20.3089;
128; 8192; 39.9082;
128; 16384; 78.5607;
128; 32768; 162.034;
128; 131072; 643.084;
128; 262144; 1284.42;
256; 1024; 10.0918;
256; 2048; 20.0271;
256; 4096; 40.2091;
256; 8192; 80.0165;
256; 16384; 160.222;
256; 32768; 323.116;
256; 131072; 1292.36;
256; 262144; 2569.13;
512; 1024; 20.1976;
512; 2048; 40.3441;
512; 4096; 80.5493;
512; 8192; 161.125;
512; 16384; 323.591;
512; 32768; 643.965;
512; 131072; 2578.32;
512; 262144; 5174.12;

============
TESTS ON CPU
============
data  8 bits
method_name=sse2
generate results in results/cpu_Intel-Core-i5-4570-CPU-@-3.20GHz_sse2.txt
64; 1024; 0.142707;
64; 2048; 0.267282;
64; 4096; 0.522201;
64; 8192; 1.00759;
64; 16384; 2.07571;
64; 32768; 4.12566;
64; 131072; 33.2769;
64; 262144; 65.3038;
128; 1024; 0.289614;
128; 2048; 0.565322;
128; 4096; 1.0778;
128; 8192; 2.07613;
128; 16384; 4.39072;
128; 32768; 11.3565;
128; 131072; 71.1015;
128; 262144; 130.066;
256; 1024; 0.594676;
256; 2048; 1.12411;
256; 4096; 2.12103;
256; 8192; 4.11571;
256; 16384; 10.7293;
256; 32768; 28.002;
256; 131072; 127.928;
256; 262144; 262.621;
512; 1024; 1.22205;
512; 2048; 2.2778;
512; 4096; 4.32503;
512; 8192; 10.8849;
512; 16384; 28.3288;
512; 32768; 60.5182;
512; 131072; 253.928;
512; 262144; 532.036

