2019-09-04  dyuret  <dyuret@login02.kuacc.ku.edu.tr>

	* Knet.jl: Calling CuArrays on __init__ helps with the stability of some devices.

2019-09-03  dyuret  <dyuret@login02.kuacc.ku.edu.tr>

	* cuarrays: failing on gitlab-ci sometimes, related to gpu type?
	Tesla-K20m-sm_35.log: CUDAnative:pass CuArrays:fail(1)/pass(4451) Knet:pass
	  gebrd!: Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/solver.jl:172
	Tesla-K80-sm_37.log: CUDAnative:pass CuArrays:pass Knet:pass
	Quadro-M2000-sm_52.log: CUDAnative:pass CuArrays:pass Knet:pass
	GeForce-GTX-1080-Ti-sm_61.log: CUDAnative:pass CuArrays:fail(1)/pass(4451) Knet:fail(1)/pass
	  gebrd!: Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/solver.jl:172
	  cpuconv: Test Failed at /dev/shm/dyuret/julia/dev/Knet/test/conv.jl:44
	Tesla-P4-sm_61.log: CUDAnative:pass CuArrays:pass Knet:hang/hang(11:18)
	Tesla-V100-PCIE-32GB-sm_70.log: CUDAnative:pass CuArrays:fail(4446/5) Knet:pass
	  Batch 2D (in 4D): Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/fft.jl:62
	  2D: Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/fft.jl:165 (Float32)
	  3D: Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/fft.jl:165 (Float32)
	  2D: Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/fft.jl:165 (Float64)
	  3D: Test Failed at /dev/shm/dyuret/julia/packages/CuArrays/wXQp8/test/fft.jl:165 (Float64)
	GeForce-RTX-2080-Ti-sm_70 *gitlab-ci* CUDAnative:fail/pass CuArrays:pass(4446 pass,5 broken) Knet:hang(10:56)
	  https://gitlab.com/JuliaGPU/Knet.jl/pipelines/80040815
	    /builds/JuliaGPU/Knet.jl/.julia/packages/CUDAnative/LkH1v/test/device/execution.jl:545
	  https://gitlab.com/JuliaGPU/Knet.jl/pipelines/80044768, 80054621
	    CuArrays:5-broken, Knet:hangs

2018-09-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* gc: The current gc has three problems:
	1. Waiting until memory is out to call gc.
	- try to GC at 1GB, if you don't get half the memory back, increase to 2GB etc.
	- need to keep track of how much available.
	2. Hanging on to arrays that do not get reused.
	- need a way to measure last used age for each bucket.
	3. Not being able to reuse arrays for slightly smaller arrays.
	- exponential buckets waste too much memory.
	- we may not need this if #2 resolved.

2018-09-05  dyuret  <dyuret@login03.kuacc.ku.edu.tr>

	* examples:
	? Knet/tutorial
	+ charlm: deprecate, in tutorial
	+ cifar10-cnn
	+ dcgan-mnist
	+ DeepLearningFrameworks: Knet.CNN example has accuracy regression.
	? dynet-benchmark
	+ fashion-mnist
	+ housing-linreg
	+ julia-tutorial: check old commands, turn into notebook.
	+ knet-tutorial: deprecate -> Knet/tutorial
	+ lenet
	+ mnist-mlp
	+ optimizers
	x overfitting: deprecate this, it is replicated in tutorial.
	? reinforcement-learning
	+ resnet
	= rnnlm: update this with new interface, check for performance regression
	+ rnn-tutorial: check for performance regression
	+ synthetic-linreg
	+ variational-autoencoder
	+ vgg

2018-08-11  Deniz Yuret  <dyuret@ku.edu.tr>

	* julia7-compat-todo:
	- fix JLD.
	- unsafe_copy! is not in base any more? fix unsafe_copy!, unsafe_convert in karray looking in base
	- unary_nd, indexed_function, isequivalent, _dbg, ssize not in AutoGrad any more.
	- fix KnetDisplay (summary line should show KnetArray) and other display/show problems.
	- separate and move KnetArray to KnetML.
	- reorganize unary.jl and broadcast.jl
	- check TODO.
	+ seed! gc dir -- just use the same names but have Knet versions.
	- data
	+ deps
	- docs
	- examples
	- prof
	+ src: CPU:20/20
	+ test: CPU:14/14
	- limit max memory allocated by kptr
	- AutoGrad only uses broadcasted now, compare performance with using broadcast
	- try making karray <: AbstractArray and overriding get/setindex.
	- search for TODOs.
	- new AutoGrad interface.
	- test on other AD and GPUarray pkgs.
	- add using LinearAlgebra: lmul!, rmul! to test/linalg.jl
	- use global keyword in the for loops in tests
	- update travis.yml (and even better add gpu testing through #312)
	- add Project.toml
	- add Manifest.toml to .gitignore
	- update readme badges
	- eventually, slim down update! and rnn gpu tests

2018-08-09  Deniz Yuret  <dyuret@ku.edu.tr>

	* broadcast: we override broadcast and broadcasted for Rec and KnetArray.
	- dot operations turn into broadcasted expressions.
	- Rec overrides broadcasted to call broadcast_r directly.
	- KnetArray should override broadcasted to call broadcast directly.

2017-09-06  EC2 Default User  <ec2-user@ip-172-31-24-9.us-east-2.compute.internal>

	* julia6-compat-todo:
	+ branches for cuarrays and gpu arrays
	+ checkout autograd master in travis
	+ need to figure out how to handle cat in autograd.
	+ notebooks, vgg. resnet, prof
	+ test autocad examples
	x test all examples on 4,5,6
	+ update news: autograd done, knet done.
	- go thru issues: autograd done, issues left.
	- branch for reversediff?
	- test on latest
	- speed test
	- examples/optimizers.jl too slow.
	- examples/charlm.jl does not pass gradcheck.
	- new autograd interface
	- broadcast without broadcast_func symbols


2017-09-01  EC2 Default User  <ec2-user@ip-172-31-10-154.us-east-2.compute.internal>

	* julia6-compat-todo:
			julia4	julia5	julia6
	kptr		1	1	1
	gpu		1	1	1
	distributions	1	1	1
	update		1	1	1
	karray		1	1	1
	linalg		1	1	1
	conv		1	1	1
	broadcast	1	1	1
	unary		1	1	1
	reduction	1	1	1

2017-07-29  dyuret  <dyuret@cn6.kuacc.ku.edu.tr>

	* julia6-compat-todo:
	- fix precompile warnings: WARNING: deprecated syntax "Expr(:ccall)". Use "Expr(:call, :ccall)" instead.
	- Pkg.test passes: kptr, gpu, distributions, conv
	- fix Pkg.test warnings: linalg
	- fix Pkg.test errors: update, karray, broadcast, reduction, unary
	- fix Pkg.build errors: Pkg.build does not work.

2017-05-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO-KUparser:
	Speed issues with KUparser.
	First epoch slower: do we need to pre-allocate and not use cudaMalloc?
	Slow-down of long runs: is it gpu copy or parsing algorithm? (dynamic-oracle gives clues)
	update! slows down in runtests if at the end!
	Fix dynet benchmarks, incorporate more (lstm, logp) from cudnn.
	Check out GPUArray and ReverseDiff.
	Start testing with Julia6 maybe better with speed.
	Need to figure out the new Julia6 broadcasting syntax.
	Also try the recommendation to detect one-out-of-k argument signature.

2017-04-07  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:

	* TODO8:
	# handle KnetFree==nothing in knetgc()
	# gpu(false) does not clear out Knet memory structures?
	# add mean/std for KnetArray.
	# implement nce. debug nce8 branch. try on large data/vocab. why doesn't it converge on ptb?
	# broadcast kernels: debug 16/17, add tests, add benchmarks.
	# check batch normalization in resnet, add it to src. -- waiting for ilker.
	# document load/save via JLD.
	# recover kuparser.
	# navigation: implement world represenation.
	# apply new rnn ideas to doc, readme, tutorial, slides, ijulia.
	# fix reduction kernels for large matrices (10K-100K rows start giving bad answers): enis looking
	# Latest master julia6 failing on nightly; AutoGrad also fails.
	# try feeding rnnlm subset of embed matrix as weight.
	# functional lstm interface defining multi-output grad2 function.
	# Setup attobot for next release: https://github.com/attobot/attobot.
	# prepare and submit julia4/5 benchmarks: find out why benchmarks are slower on julia5/6: https://github.com/JuliaLang/julia/issues/18135 ?
	# Issue 89: reduced_dims -> reduced_indices in 0.5.1, stop using unexported functions from base. Other examples: to_indexes, dims2string, deepcopy_internal, LinearFast, show_backtrace, decolon (in AutoGrad).
	# AutoGrad docs convert from comments to docstrings in core.jl and include in Knet manual. wip in newdocs branch.
	# AutoGrad tests convert to the new test system.
	# add some test for deconv vs conv, gclip.
	# gpu switching back and forth does not seem to work. do we really need multiple handles for cublas, cudnn? do we need them for libknet8?
	# AutoGrad: run all tests with KnetArrays
	# AutoGrad: support for keys, values, next for dictionaries.
	# add benchmarks with new dynamic frameworks: Yoav tweet, Volkan email. assigned to Ilker, Enis.
	# docs todo: Julia tutorial. simplify examples. Baris Bozkurt's comments.
	# docs todo: perceptron. kernel perceptron. svm. lenet and vgg in cnn section.
	# mnist2d: implement/test sparse arrays
	# time doing a single im2col instead of N for conv4
	# replace T<: conditions in functions with generated code for each type
	# 0.7: rename the 73 functions, cpu tests (add conv), v0.5 compat. check old todo list
	# DL439: If one has access to numerical computation on complex numbers, then there isa very eﬃcient way to numerically estimate	the gradient by using complex numbersas input to the function (Squire and Trapp, 1998)
	# optimization:
	## make BLK,THR dependent on the input size? may improve final sum which is only 10x100 in this example.
	## extend benchmark tests to cover all combinations of 10,100,1000 dimensions.
	## optimize reduction and broadcasting kernels.
	## optimize logp / softmax.
	## optimize conv4 / matmul - arrayfire? cudnn conv instead of matmul? cudnn conv algorithms? fft paper from https://arxiv.org/abs/1601.06815.
	## try fusion: we can do layers in one kernel call: relu(wx+b).
	## try streams or multiple gpus.
	## for general arrays: broadcast, get/setindex, h/v/cat. Enis working on this.

2017-03-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* examples/rnnlm.jl: Trying to replicate one of:
	http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf
	http://www.fit.vutbr.cz/~imikolov/rnnlm/asru_large_v4.pdf
	https://arxiv.org/abs/1312.6026
	https://arxiv.org/abs/1409.2329
	https://arxiv.org/abs/1508.06615

	(:epoch,28,:perp,103.928986f0,173.2658f0,157.65414f0) RNNLM.main("--seed 1 --epochs 30 --dropout 0.5 --hidden 200 --embed 200 --batch 100 --best foo.jld")
	(:epoch,14,:perp,52.10398f0,140.70792f0,130.77716f0)  RNNLM.main("--seed 1 --epochs 30 --dropout 0.5 --hidden 650 --embed 650 --batch 100 --best foo650.jld")
	(:epoch,27,:perp,75.37883f0,135.43819f0,127.377f0)    RNNLM.main("--seed 1 --epochs 30 --dropout 0.5 --hidden 200 --embed 200 --batch 100 --optim Adagrad()")
	(:epoch,5,:perp,69.288506f0,184.24977f0,171.84555f0)  RNNLM.main("--seed 1 --epochs 30 --hidden 200 --embed 200 --batch 100 --optim Adagrad()")
	(:epoch,30,:perp,113.68978f0,166.22272f0,154.77658f0) RNNLM.main("--seed 1 --epochs 30 --hidden 200 200 --embed 200 --batch 100 --dropout 0.5 --optim Adagrad()")
	(:epoch,25,:perp,181.603f0,239.55917f0,225.10878f0)   RNNLM.main("--seed 1 --epochs 30 --hidden 200 200 --embed 200 --batch 100 --dropout 0.5 --optim Sgd(lr=1,gclip=5)")

	Paper claims perp < 100. Potential reasons we don't get this:
	x Dropout not working. significant drop when turned off.
	- Difft way to measure (exclude eos etc)
	- Adam bad, sgd/gclip good. (adagrad works better). Paper uses SGD(lr=1,gclip=5) halving lr if devperp does not go down by 1.
	- Difft lstm type?
	- Initialization different. [-0.05,0.05]
	- Batchsize different = 20.
	- Highway network.
	- BPTT 35 time steps: charlm style not sentence bound?
	- Number of layers = 2! Table 2 gives 2x300 for char model. 2x200 for word model?

2017-03-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:


2017-03-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# charlm: Implement rnns using separate embedding vectors and a @primitive concat operation: fix charlm demo, implement efficient s2s demo, s2s with attention. Need to first solve minibatching. Is v/hcat efficient?  KnetArray implements only 2-arg version. AutoGrad calls cat, supports multiple args, grad uses uncat which uses getindex with an array of indices.  KnetArray probably does not support an array of indices.
	# charlm: try indexing op instead of sparse matmul.
	# charlm: try lstm time concatenation.
	# charlm: add adam: sgd(3.5) works faster. short bptt in initial epochs works faster.
	# Add dropout as a primitive. Need kernel.
	# charlm: add dropout
	# s2s profile.
	# check julia4 and julia6 compat, find out why julia pkgs marks	knet as broken.
	# fix intermittent test errors: https://github.com/JuliaLang/julia/pull/20736#issuecomment-283834724

2017-03-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* profile: %41 forw (%20 logp), %44 back (%21 logp), %14 sum_outgrads
	* flat: 2027 logp, 1761 sum, 1410 gemm, 1373 sum_outgrads

          4115 ...et/.julia/v0.5/AutoGrad/src/core.jl:88; forward_pass(::Function, ::Tuple{Dict{Symbol,Any},Arr...
           12   /Users/dyuret/knet/master/prof/s2s.jl:83; state = initstate(inputs[1], model[:state0]) # 14
           92   /Users/dyuret/knet/master/prof/s2s.jl:86; input = lstm_input(model[:embed1], input) # 85
           717  /Users/dyuret/knet/master/prof/s2s.jl:87; state = lstm(model[:encode], state, input) # 723
           9    /Users/dyuret/knet/master/prof/s2s.jl:91; input = lstm_input(model[:embed2], EOS) # 3
           692  /Users/dyuret/knet/master/prof/s2s.jl:95; state = lstm(model[:decode], state, input) # 702
           80   /Users/dyuret/knet/master/prof/s2s.jl:100; input = lstm_input(model[:embed2],output) # 61
           39   /Users/dyuret/knet/master/prof/s2s.jl:102; state = lstm(model[:decode], state, input) # 30
           1    /Users/dyuret/knet/master/prof/s2s.jl:106; gold = vcat(outputs..., EOS) # 1
           2473 /Users/dyuret/knet/master/prof/s2s.jl:107; sumlogp = lstm_output(model[:output], preds, gold) # 2441
            43   /Users/dyuret/knet/master/prof/s2s.jl:114; pred1 = vcat(preds...) # 46
            248  /Users/dyuret/knet/master/prof/s2s.jl:115; pred2 = pred1 * param[1] # 242
            144  /Users/dyuret/knet/master/prof/s2s.jl:116; pred3 = pred2 .+ param[2] # 145
            2037 /Users/dyuret/knet/master/prof/s2s.jl:117; sumlogp = logprob(gold, pred3) # 2006
             2027 ...rs/dyuret/knet/master/prof/s2s.jl:135; o1 = logp(ypred,2)     # 1999
               833 ...ret/.julia/v0.5/Knet/src/unary.jl:176; x1 = maximum(x,d...)
               120 ...ret/.julia/v0.5/Knet/src/unary.jl:177; x2 = x .- x1
               122 ...ret/.julia/v0.5/Knet/src/unary.jl:178; x3 = exp(x2)
               827 ...ret/.julia/v0.5/Knet/src/unary.jl:179; x4 = sum(x3,d...)
               1   ...ret/.julia/v0.5/Knet/src/unary.jl:180; x5 = log(x4)
               123 ...ret/.julia/v0.5/Knet/src/unary.jl:181; x6 = x2 .- x5
             6    ...rs/dyuret/knet/master/prof/s2s.jl:136; o2 = o1[index]         # 4
             4    ...rs/dyuret/knet/master/prof/s2s.jl:137; o3 = sum(o2)           # 2

          4409 ...et/.julia/v0.5/AutoGrad/src/core.jl:231; backward_pass(::AutoGrad.Rec{Dict{Symbol,Any}}, ::Aut...
           932  ./<missing>:0; *(::Type{AutoGrad.Grad{2}}, ::Knet.KnetArray{Float32...
           1    ./<missing>:0; +(::Type{AutoGrad.Grad{1}}, ::Knet.KnetArray{Float32...
           410  ./<missing>:0; .*(::Type{AutoGrad.Grad{2}}, ::Knet.KnetArray{Float3...
           148  ./<missing>:0; .+(::Type{AutoGrad.Grad{2}}, ::Knet.KnetArray{Float3...
           21   ./<missing>:0; getindex(::Type{AutoGrad.Grad{1}}, ::Array{Any,1}, :...
           2142 ./<missing>:0; logp(::Type{AutoGrad.Grad{1}}, ::Knet.KnetArray{Floa...
            818  ...uret/.julia/v0.5/Knet/src/unary.jl:195; dx1 = sum(dy,d...)
            120  ...uret/.julia/v0.5/Knet/src/unary.jl:196; dx2 = exp(y)
            1038 ...uret/.julia/v0.5/Knet/src/unary.jl:197; dx3 = dx2 .* dx1
            166  ...uret/.julia/v0.5/Knet/src/unary.jl:198; dx4 = dy - dx3
           245  ./<missing>:0; lstm_input(::Type{AutoGrad.Grad{1}}, ::Knet.KnetArra...
           143  ./<missing>:0; sigm(::Type{AutoGrad.Grad{1}}, ::Knet.KnetArray{Floa...
           6    ./<missing>:0; sum(::Type{AutoGrad.Grad{1}}, ::Float32, ::Float32, ...
           106  ./<missing>:0; tanh(::Type{AutoGrad.Grad{1}}, ::Knet.KnetArray{Floa...
           10   ./base.jl:151; vector_any()
           2    ...AutoGrad/src/base/abstractarray.jl:0; cat(::Type{AutoGrad.Grad{17}}, ::Knet.KnetArray{Floa...
           85   ...AutoGrad/src/base/abstractarray.jl:85; cat(::Type{AutoGrad.Grad{11}}, ::Knet.KnetArray{Floa...

          1373 ...et/.julia/v0.5/AutoGrad/src/core.jl:233; backward_pass(::AutoGrad.Rec{Dict{Symbol,Any}}, ::Aut...
           360 ...lia/v0.5/AutoGrad/src/interfaces.jl:71; sum_outgrads(::Void, ::AutoGrad.UngetIndex)
           569 ...lia/v0.5/AutoGrad/src/interfaces.jl:87; sum_outgrads(::Dict{Symbol,Any}, ::AutoGrad.UngetIndex)
           3   ...lia/v0.5/AutoGrad/src/interfaces.jl:92; sum_outgrads(::Array{Any,1}, ::AutoGrad.UngetIndex)
           77  ...uret/.julia/v0.5/Knet/src/karray.jl:905; sum_outgrads{T}(a::KnetArray{T},b::KnetArray{T})=(a+b)
           346 ...uret/.julia/v0.5/Knet/src/karray.jl:908; c = sum_outgrads_karray(a, b.value, b.index...)

2017-03-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# fix repeated index gradient bug.
	# s2s: Add a generic s2s example to Knet/examples. Need to solve minibatching / concat first.
	# docs: write rnn chapter. fix earlier chapters.


2017-03-11  Deniz Yuret  <dyuret@ku.edu.tr>

	* indexing: repeated indices are not handled properly by
	ungetindex, they need to be summed in backward pass but setindex
	just overwrites. Let's understand how Julia indexing works:

	a[:,i]
	522 ./abstractarray.jl:752; getindex(::Array{Float64,2}, ::Colon, ::Array{Int64,1})
	?   ./multidimensional.jl:270; _getindex(::Base.LinearFast, ::Array{Float64,2},...
	522 ./multidimensional.jl:291; _unsafe_getindex(::Base.LinearFast, ::Array{Float64,2},...
        149 ./multidimensional.jl:296; macro expansion
        373 ./multidimensional.jl:298; macro expansion
        373 ./multidimensional.jl:340; _unsafe_getindex!
        373 ./multidimensional.jl:348; macro expansion
        6   ./cartesian.jl:62; macro expansion
        367 ./cartesian.jl:64; macro expansion
        45  ./cartesian.jl:62; macro expansion
        7   ./multidimensional.jl:349; macro expansion
        315 ./multidimensional.jl:350; macro expansion

	_unsafe_getindex is a @generated function which uses @nexprs, @ncall, to_index.

	http://docs.julialang.org/en/latest/manual/metaprogramming.html#Generated-functions-1

	@generated functions return quoted expressions that get compiled
	at runtime. Only the types of expressions can be accessed in their
	body.

	@ncall 3 func a
	==> func(a_1, a_2, a_3)
	@ncall 2 func a b i->c[i]
	==> func(a, b, c[1], c[2])

	@nexprs 4 i -> y[i] = A[i+j]
	==> y[1]=A[1+j]; y[2]=A[2+j]...

	to_index defined in operators.jl, converts to Array,Colon,Int
	Numbers -> Int
	Colon is kept
	BitArray -> find(I) -> Vector{Int} # although there is a special _unsafe_getindex which skips using find
	Other arrays are kept (int or cartesianindex)

	index_shape(A,I_1,I_2,...) gives the destination shape

	indices(b) => (Base.OneTo(10000),Base.OneTo(10000))
	eachindex(b) => Base.OneTo(100000000)

	Finally we call _unsafe_getindex!(dest,A,I_1,I_2,...) at multidimensional.jl:340
	J = decolon(src,I_1,I_2,...) converts colons to explicit indices.

	@nloops N itersym rangeexpr bodyexpr
	@nloops N itersym rangeexpr preexpr bodyexpr
	@nloops N itersym rangeexpr preexpr postexpr bodyexpr

	# this indexes dest with linear, src with cartesian indices
        D = eachindex(dest)
        Ds = start(D)
	for j_2 in J_2
	  for j_1 in J_1
	    d,Ds = next(D,Ds)
	    dest[d] = getindex(src,j_1,j_2)
	  end
	end

	We need to write an accumulating version of setindex! (addindex!) for ungetindex.
	_setindex! for N-D indices L364
	_setindex! for 1-D indices L368 uses _maybe_reshape(A)
	_unsafe_setindex! is called
	_unsafe_batchsetindex! is called with (A,_iterable(x),to_indexes(J...)...)  L420

	Here is macroexpanded version for N=2:
	X is iterated over. A is called with individual indices.
	We just need to zero out A and add for AutoGrad.
	We need to write a KnetArray specific version in Knet.

quote  # none, line 2:
    begin 
        I_1 = I[1]
        I_2 = I[2]
    end # none, line 3:
    idxlens = index_lengths(A,I_1,I_2) # none, line 4:
    setindex_shape_check(X,idxlens[1],idxlens[2]) # none, line 5:
    J = decolon(A,I_1,I_2) # none, line 6:
    begin 
        J_1 = J[1]
        J_2 = J[2]
    end # none, line 7:
    Xs = start(X) # none, line 8:
    begin 
        $(Expr(:inbounds, true))
        begin  # cartesian.jl, line 62:
            for j_2 = J_2 # cartesian.jl, line 63:
                nothing # cartesian.jl, line 64:
                begin  # cartesian.jl, line 62:
                    for j_1 = J_1 # cartesian.jl, line 63:
                        nothing # cartesian.jl, line 64:
                        begin  # none, line 9:
                            (v,Xs) = next(X,Xs) # none, line 10:
                            setindex!(A,v,j_1,j_2)
                        end # cartesian.jl, line 65:
                        nothing
                    end
                end # cartesian.jl, line 65:
                nothing
            end
        end
        $(Expr(:inbounds, :pop))
    end # none, line 12:
    A
end


2017-03-10  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Added KnetArray indexing support for: Int, Colon, UnitRange, StepRange, CartesianIndex, Array{Int}, Array{Bool}, Array{CartesianIndex}. Multidimensional indexing incomplete.

2017-03-03  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Implement a decent hill climbing algorithm for hyperopt: do not repeat, have acceleration, independent step sizes, guarantee local minima, one dimension at a time, pick dimension/move using bandits or a smart queue. Wikipedia has good pseudocode. Bandits may work better.

2017-03-01  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Add gclip to update!.

2017-02-26  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Issue 88: better handling of difft elt types on data vs weights in update!

2017-02-23  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# bug in AutoGrad: broadcast.jl: $f(x1::Rec,x2::AbstractArray)=$f(x1,x2.value)
	# gradcheck fails on: bmax(x,y)=broadcast(max,x,y) or bmin(x,y)=((y.<x).*y+(x.<y).*x)
	# grad of convert (AutoGrad issue), implemented in karray.jl, need to port.

2017-02-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# automatically generate README.md from tutorial.md.
	# make examples easier to load (turn off Pkg install for vgg etc, docs take too long)
	## no easy way to do it without effecting source links.

2017-02-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# use mocha for cpu conv/pool. wip.
	# cpuconv todo:
	# ok: need to get rid of mask in pool_back
	# ok: reimplement conv4 in terms of im2col
	# ok: need low level blas call with pointers
	# ok: reimplement conv4x conv4w using col2im?
	# ok: need separate cpu and gpu libraries: condition makefile on finding nvcc, also cond openmpi like mocha/dep

2017-02-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# add tests for new update interface.

2017-02-16  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# permutedims ambiguity in AutoGrad in Julia 0.4.
	# docs imagesize: The ![]() syntax doesn't support image size. However, you should be able to use the @raw block to insert custom HTML with an <img> tag as follows:
	```@raw html
	<img src="..." height="..." width="...">
	```
	# add x=unpool(pool(x)) test,
	# add davide fix to knet before release
	# Contents does not show on README.md, @ref doesn't work etc. New README?
	# fix unclear docs for optimization, tutorial example? better yet, impl improved update interface.

2017-02-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# KnetArray ==, ===, isapprox etc. missing.
	# copy! for KnetArray missing. (we have fill!, rand! etc. is copy! supported by AutoGrad? should refuse for Rec)
	# deepcopy does not work for KnetArray
	# override KnetArray and Array instead of cpu2gpu etc.
	# add KnetArray{Float32}(3) type init like Array and deprecate the other.
	# Make BaseTestNext conditional on julia version. If not possible, create header.jl in test/ and check version/pkg.

2017-02-13  Deniz Yuret  <dyuret@ai-test.hpc.ku.edu.tr>

	* DONE:

	# Latest master failing on Julia 0.4 (convtest) and nightly distros.
	# TonyKelman on Compat: If you use Compat in your tests, it needs to be in test/REQUIRE or REQUIRE.
	# TonyKelman on removing nightly from travis: you can make this an allowed failure so it'll run but won't make your status red

2017-02-12  Deniz Yuret  <dyuret@ai-test.hpc.ku.edu.tr>

	* unit-test-TODO:
	+ unary.jl: cpu=25s gpu=41s
	+ broadcast.jl: cpu=19s gpu=34s
	+ reduction.jl: cpu=18s gpu=28s (6 fail)
	+ karray.jl: cpu=(fail) gpu=12s
	+ linalg.jl: cpu=(fail) gpu=15s
	+ update.jl: cpu=6.6s gpu=19s
	+ kptr.jl: cpu=2.5s gpu=2.5s
	+ gpu.jl: cpu=3s gpu=2.4s
	+ distributions.jl: cpu=3s gpu=3s
	+ conv.jl: cpu=(fail) gpu=13s (10 broken)
	+ runtests.jl: switch to new tests, figure out why so slow
	+ karray,linalg: cpu tests failing
	+ reduction: gpu tests failing
	+ conv.jl: debug unpool, add cpuconv

2017-02-10  Deniz Yuret  <dyuret@ai-test.hpc.ku.edu.tr>

	* DONE:

	# doc warnings about missing broadcast, reduction, restricted cat, indexing, no bool array etc.
	# figure out rtfd doc setup or forwarding.
	# figure out no KnetArray in cpu problem
	# KnetArray does not work on cpu-only (should it? no. operators overloaded assuming gpu)
	# add back cpu convolution
	# gputests.jl and cputests.jl are broken.
	# resolve issues

2017-02-09  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:

	# missing docs for distributions
	# the update methods not documented yet.
	# 0.8.1: tag new version for the paper when all is done.
	# add transpose to KnetArray.
	# source links? (especially for examples)
	# docs todo: warn against overwriting arrays.
	# use original docstrings for examples readme, put examples readme in docs.


2017-02-08  Deniz Yuret  <dyuret@ku.edu.tr>

	* Documenter.jl: (TODO) Julia documentation moved back to MD
	supported by Documenter.jl.  Supports latex, doctest, xrefs,
	search.  Hosted on github pages built throgh Travis (no need for
	Python).  PDF output?  PDF:
	https://juliadocs.github.io/Documenter.jl/stable/lib/internals/writers.html
	Automatic conversion from rst? Try Pandoc.  Use markdown_github as
	output.
	Using sips command on osx to resize images.

	- doctest, does it accept ... ?
	- how do we generate docs from docstrings? ```@docs
	- can use function f end to document a zero method function
	- you can document fields of a type.
	- how do you refer to julia function docs from knet docstrings?
	- @doc "..." foo is used when foo is defined by generated code
	- can we default link text to link content?
	- $(EXPORTS) used in module doc to list exported symbols.
	- using ```@contents, @index, @docs
	- where are the source links?

2017-02-07  Deniz Yuret  <dyuret@ai-test.hpc.ku.edu.tr>

	* src/Makefile (CFLAGS): code refactoring:

	Function lists (cuda??.jl) and cuda code generators
	(cuda??_gen.jl) do not go well together.  The reason is the same
	function list (e.g. binary array ops) are used by more than one
	type of cuda kernel (same size vs broadcasting kernels).  It makes
	more sense to collect function lists in semantically named files
	(unary, broadcasting, reduce etc.).

	cuda1: abs2,abs,acos,... => unary
	cuda10: add,sub,mul,...  => broadcast
	cuda11: add,sub,mul,...
	cuda12: add,sub,mul,...
	cuda20: sum,prod,...     => reduction
	cuda21: sum,prod,...


2016-10-24  dyuret  <dyuret@ku.edu.tr>

	* julia.st: for pretty-printing Julia use:
	enscript -Ejulia -M A4 -2rGC -o foo1.ps core.jl
	ps2pdf foo1.ps

	enscript does not come with a julia format.  You can create one
	using matlab.st (for keywords) and python.st (for strings,
	comments) under /usr/share/enscript/hl/julia.st.

2016-10-21  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# implement adam.
	# gpu(true) leaves memory footprint on each device.
	# Paulito old cudnn interface support.
	# update the rest of the documentation.
	# vgg: change default padding for conv4 to be (n-1)/2
	# Davide push/vcat bug.


2016-10-05  dyuret  <dyuret@ku.edu.tr>

	* DONE:
	# write paper for nips
	# document KnetArray in readme. finish the under the hood section.
	# put function references in documentation.
	# try to measure back functions one by one as well for gpu profiling.
	# implement axpy! for KnetArray if it is worth it to get faster updates.

2016-10-03  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# delete knet.ji in build script: no need.
	# use Knet.dir set in Knet.jl.
	# check if benchmarks keep minibatches in gpu.
	# implement a install_extras command: no need instead we do following:
	# automatic loading of packages by demos (use Pkg.add or introduce installExtras)
	# implement vggnet demo.
	# housing.jl bias fails gradcheck (because of Float32)
	# check warnings in cpu-only knet.
	# change charlm default winit.

2016-09-30  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# use Float32 in housing by default.
	# charlm can transfer all minibatches to gpu before timing (slows down due to gc)
	# reimplement lenet using loop.
	# charlm: profile speed, add dropout, nlayer.

2016-09-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# broadcasting KnetArray bug when array size 1x1.
	# extend readme with examples. revise intro. publish intro on blog. use the presentation.

2016-09-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# new amazon aws image
	# finish cudnn, curand etc. if possible eliminate dependence on them.
	# logp should take a second argument like sum.

2016-09-16  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# lenet: move cudnn calls to cuda44.jl
	# karray: checkbounds without abstractarray.
	# document KnetArray inline.
	# remove importall Base.
	# implement more efficient lstm making sure getindex does a view (transpose?)
	# citation
	# write examples/README
	# charlm fails on cpu.  change its default data file to something smaller.  small files fail gradcheck.
	# test gpu/cpu, osx/linux, 0.4, 0.5, 0.6. (currently failing on Travis).
	# clean up old tags and register Knet

	* Release:

	Eliminated old tags.  Just tag one 0.7 and one 0.8 version.

	For each tagged version:
	- minimize REQUIRE
	- test on v0.4 v0.5 v0.6
	- test on cpu vs gpu

	julia 0.4 knet 0.7 cpu: ok
	julia 0.4 knet 0.7 gpu: fail: mnist4d,copyseq
	julia 0.4 knet 0.8 cpu: ok
	julia 0.4 knet 0.8 gpu: ok
	julia 0.4 autograd cpu: ok
	julia 0.4 autograd gpu: ok
	julia 0.5 knet 0.7 cpu: error
	julia 0.5 knet 0.7 gpu: error
	julia 0.5 knet 0.8 cpu: ok
	julia 0.5 knet 0.8 gpu: ok
	julia 0.5 autograd cpu: ok
	julia 0.5 autograd gpu: ok
	julia 0.6 knet 0.7 cpu: error
	julia 0.6 knet 0.7 gpu: error
	julia 0.6 knet 0.8 cpu: ok
	julia 0.6 knet 0.8 gpu: ok
	julia 0.6 autograd cpu: error, wip compat0.6, 0-dim array problem
	julia 0.6 autograd gpu: error, wip compat0.6

	julia 0.6 knet 0.8 cpu:
	using Knet
	WARNING: log{T <: Number}(x::AbstractArray{T}) is deprecated, use log.(x) instead.
	WARNING: cos{T <: Number}(x::AbstractArray{T}) is deprecated, use cos.(x) instead.
	WARNING: abs2{T <: Number}(x::AbstractArray{T}) is deprecated, use abs2.(x) instead.
	Tests pass.

	julia 0.6 autograd master cpu:
	using AutoGrad: ok
	Pkg.test("AutoGrad")
	WARNING: max{T1 <: Real}(x::Real,y::AbstractArray{T1}) is deprecated, use max.(x,y) instead.
	WARNING: abs{T <: Number}(x::AbstractArray{T}) is deprecated, use abs.(x) instead.
	This is a common error I need to fix for other Julia versions as well:
	WARNING: (AutoGrad.ungetindex,0.18437601415789295,[0.228639,0.085112],(2,),"MethodError(convert,(Array{Float64,N},OH416_A660_2_(2,)_0.18437601415789295))")
	ERROR: LoadError: MethodError: no method matching erfinv(::Array{Float64,1})

	julia 0.4 knet 0.7 gpu: fails mnist4d and copyseq

	julia 0.4 knet 0.7 cpu: should add cpu conv test but they pass.

	julia 0.5 knet 0.7 cpu:
	using Knet
	WARNING: Method definition randn!(Base.Random.AbstractRNG, AbstractArray{#T<:Any, N<:Any}) in module Random at random.jl:1207 overwritten in module Knet at /mnt/ai/home/dyuret/.julia/v0.5/Knet/src/util/array.jl:36.
	WARNING: could not import Base.lastidx into LegacyStrings
	WARNING: Base.writemime is deprecated. likely near /mnt/ai/home/dyuret/.julia/v0.5/Knet/src/net.jl:186
	WARNING: deprecated syntax "[a=>b for (a,b) in c]". Use "Dict(a=>b for (a,b) in c)" instead.
	Pkg.test("Knet")
	WARNING: symbol is deprecated, use Symbol instead./mnt/ai/home/dyuret/.julia/v0.5/Knet/examples/linreg.jl:25
	WARNING: Knet.Kfun.(:wdot) is deprecated; use Knet.Kfun.:wdot or getfield(Knet.Kfun, :wdot) instead./mnt/ai/home/dyuret/.julia/v0.5/Knet/src/compiler.jl:33
	ERROR: expecting assignment expression got  # /mnt/ai/home/dyuret/.julia/v0.5/Knet/src/kfun.jl, line 43:
	in _comp(::Expr, ::Dict{Symbol,Symbol}, ::Dict{Symbol,Any}, ::Expr) at /mnt/ai/home/dyuret/.julia/v0.5/Knet/src/compiler.jl:97

	julia 0.6 knet 0.7 cpu: (similar to julia 0.5)
	using Knet
	WARNING: Method definition randn!(Base.Random.AbstractRNG, AbstractArray{#T<:Any, N<:Any}) in module Random at random.jl:1281 overwritten in module Knet at /mnt/ai/home/dyuret/.julia/v0.6/Knet/src/util/array.jl:36.
	WARNING: could not import Base.lastidx into LegacyStrings
	WARNING: Base.writemime is deprecated.  likely near /mnt/ai/home/dyuret/.julia/v0.6/Knet/src/net.jl:186
	WARNING: deprecated syntax "[a=>b for (a,b) in c]".Use "Dict(a=>b for (a,b) in c)" instead.
	Pkg.test("Knet")
	WARNING: Method definition randn!(Base.Random.AbstractRNG, AbstractArray{#T<:Any, N<:Any}) in module Random at random.jl:1281 overwritten in module Knet at /mnt/ai/home/dyuret/.julia/v0.6/Knet/src/util/array.jl:36.
	WARNING: symbol is deprecated, use Symbol instead.
	WARNING: Knet.Kfun.(:wdot) is deprecated; use Knet.Kfun.:wdot or getfield(Knet.Kfun, :wdot) instead.
	ERROR: expecting assignment expression got  # /mnt/ai/home/dyuret/.julia/v0.6/Knet/src/kfun.jl, line 43:
	in _comp(::Expr, ::Dict{Symbol,Symbol}, ::Dict{Symbol,Any}, ::Expr) at /mnt/ai/home/dyuret/.julia/v0.6/Knet/src/compiler.jl:97

	julia 0.5 knet 0.7 gpu:
	WARNING: Base.SparseMatrix is deprecated. (in CUSPARSE)
	WARNING: Method definition (::Type{Knet._CudaArray})(CUDArt.CudaArray{#T<:Any, #N<:Any}) in module Knet at /state/partition1/dyuret/knet/publish/v0.5/Knet/src/util/cudart.jl:127 overwritten at /state/partition1/dyuret/knet/publish/v0.5/Knet/src/util/cudart.jl:128.
	WARNING: Base.writemime is deprecated. likely near /state/partition1/dyuret/knet/publish/v0.5/Knet/src/util/cudart.jl:133
	ERROR: LoadError: LoadError: LoadError: UndefVarError: TopNode not defined /state/partition1/dyuret/knet/publish/v0.5/Knet/src/util/deepcopy.jl, in expression starting on line 21
	WARNING: Method definition randn!(Base.Random.AbstractRNG, AbstractArray{#T<:Any, N<:Any}) in module Random at random.jl:1207 overwritten in module Knet at /state/partition1/dyuret/knet/publish/v0.5/Knet/src/util/array.jl:36.
	WARNING: deprecated syntax "[a=>b for (a,b) in c]". Use "Dict(a=>b for (a,b) in c)" instead.
	WARNING: could not import Test.default_handler into Main
	WARNING: could not import Test.Success into Main
	WARNING: could not import Test.Failure into Main
	ERROR: LoadError: LoadError: UndefVarError: Success not defined
	WARNING: symbol is deprecated, use Symbol instead. /state/partition1/dyuret/knet/publish/v0.5/Knet/examples/linreg.jl:25
	WARNING: Knet.Kfun.(:wdot) is deprecated; use Knet.Kfun.:wdot or getfield(Knet.Kfun, :wdot) instead. /state/partition1/dyuret/knet/publish/v0.5/Knet/src/compiler.jl:33
	catastrophic failure of tests.

2016-09-14  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# lenet broken: chased bug down to KnetArray <: AbstractArray, what do we inherit?
	# charlm: slowed down
	# charlm: experiment with other forms of lstm
	# try cpu sum and vcat see if better.
	# check multi-gpu support on KnetArrays: can we copy, free, etc with a non-active device?
	# charlm: optimize params
	# 0.7: upper limit julia 0.5, remove downloading mnist from runtests.
	# 0.7 has tests failing on Julia 0.4 and errors on Julia 0.5.

2016-09-13  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	## support cat and sub with KnetArray.

2016-09-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# mnist2d: sampling gradcheck
	## need a good gradcheck to make sure, other TODO items in charlm.
	# mnist2d: implement efficient softmax -- seems ok for now.
	# load/save using JLD: need to write KnetArray handlers.
	## turn gpu on by default if exists.
	# eliminate dependence on CUDArt.
	## support multiple gpus with KnetArrays

	# need an efficient softmax, logsumexp.
	## test on mnist --fast:
	## cpu before: 2.75  gpu before: 2.25
	## cpu after : 2.75  gpu after : 2.10
	## test on lenet --fast:
	## gpu before: 10.15  gpu after: 9.84 (mostly because of relu)
	## after using 10^8 for knetgc limit: 9.30
	## test on charlm with 10k lines of shakespeare (one epoch time) (10,2.419187890389808):
	## compare with 10.6 secs/epoch for train in Knet7:
	## gpu before: test: 2.32 train: 6.58
	## gpu after : test: 2.33 train: 6.30
	## bitarrays : test: 2.15 train: 6.20
	## fixes     : test: 2.16 train: 6.14

	## charlm:
	# loss results do not match Knet7 because of epsbump
	# cpu=gpu tested but knet7=knet8 only forward up to epsbump, knet8 dont have keepstate or gclip
	## implemented maximum but AutoGrad.ungetindex failing tests


2016-09-10  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# charlm keepstate problem: state has Values so gc does not work unless we getval or reinit state between iterations!
	# charlm loss problem: knet7!=knet8 because of epsbump.

2016-09-06  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# mnist2d: cpu support, in general for all of Knet; test with v0.4 and v0.5
	# mnist2d: cpu/gpu, Float16,32,64 options
	# mnist2d: update documentation
	# put info about additional packages for examples and gpu in README
	# mnist4d
	# gc problem.

	* gc-problem: mnist4d explodes memory.

2016-09-04  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# mnist2d: add hidden option
	# mnist2d: comparison with knet7

2016-08-29  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# profile AutoGrad more, figure out memory problem, good look at unbroadcast. memory use with/without forw.  closures?
	# make AutoGrad completely generic?
	# tmpfree is dangerous, user visible variables from a=b*c may be overwritten!
	# figure out why forw adds 0.38
	# determine and minimize autograd overhead.

2016-08-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# move cuda -> src, src -> src7
	# should we introduce our own CudaArray type?
	# define KnetArray. (KA) use instead of CudaArray.
	# test cuda version compare with af.
	# we could just support 0,1,2 dims or 0,1,N dims for map and reduce.
	# finish cuda2arg, cublas
	# need to write cuda21 vector reductions before full AutoGrad test.

	* ArrayFire: gave up on it, my code is faster:
	# can we get AF kernels to dump out?
	# try c++ ArrayFire mnist example?
	# will arrayfire memory management still work with rnns?
	# eventually look at arrayfire convolution.
	# thrust and tensorflow also opensource.

	* JIT: could look at this if libknet8 gets too big:
	# JIT compile kernels as needed: https://blog.maleadt.net/2015/01/15/julia-cuda/
	## http://docs.nvidia.com/cuda/nvrtc/index.html#basic-usage
	## http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix


2016-08-27  Deniz Yuret  <dyuret@ku.edu.tr>

	* cuda: timing tests on 100K repetitions on 1000x100 output array with simple ops:
	BLK=256,THR=256 for all (except reductions cuda20,cuda21 use 128,128)

	F32	F64	where
	0.73	1.40	cuda1  (unary)
	0.73	1.40	cuda01 (scalar,array)
	0.80	2.06	cuda11 (same size array)
	0.86	2.08	cuda12 (same size array, broadcasting)

	broadcasting (cuda12):
	F32		F64
	mat+mat	0.86	mat+mat	2.08
	mat+col	1.06	mat+col	1.53
	mat+row	1.28	mat+row	1.64
	row+col	1.53	row+col	1.57

	scalar reduction (cuda20):
	F32 2.98
	F64 3.08

	vector reduction (cuda21): (for 1000x100 matrix)
		F32	F64
	mat>col 3.54	3.66
	mat>row 0.86	1.00

2016-08-26  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# we should write a general profiler to compare cpu and general gpu libraries, primitives and mnist.
	# can we have fine control over arrayfire memory management? deviceMemInfo, deviceGC, If you are calling from a garbage collected language life Julia, you will also need to call the garbage collector of that language before hand to clean up the references.
	# can we improve arrayfire mnist results by using external kernels for xentloss and relu?  find better xentloss design.  using quadloss for now.
	# profile ArrayFire.
	# read ArrayFire docs.
	# find ArrayFire kernels. can we use arrayfire kernels outside?  (matmul, broadcast and reduce). They are built by JIT!


2016-08-25  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	# Finish math.jl in AutoGrad.
	# - Implement broadcast1 for sin,cos etc. - arrayfire?
	# - Take a look at Julia broadcast implementation. - arrayfire?
	# - Implement 2+arg broadcasting better than CUDNN. - arrayfire?

	# ArrayFire may do these better!!!
	## It does memory management: https://arrayfire.com/forums/viewtopic.php?f=17&t=43223
	## Can it be extended with missing functions?  Can we write new kernels?
	## Can we run CUDNN code on it?
	## Try sending AFArrays to libknet kernels or CUDNN functions to see.
	## Benchmark comparing it to CUBLAS and CUDNN.
	## Figure out the bug in the mnist example.
	## http://arrayfire.com/custom-kernels-with-arrayfire/
	## http://arrayfire.com/arrayfire-cuda-interoperability/

	# ArrayFire should have {T,N} instead of {T,4} but:
	## ambiguity for ^ need to be fixed.
	## display for vectors broken.
	## zeros(AF) gives regular array.

	# ERROR: ArrayFire Error (101) : Device out of memory
	## It seems I'll need my memory manager after all?  Does ArrayFire support target variables?
	## The problem was fixed when I enabled gc in timeit!!!
	## train0 still having issues. (1) I can copy their kernels. (2) I can call their gc?
	## axpy was the culprit, back to normal.

2016-07-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO7:

	0000 add code citation for bibtex like autograd
	0000 debug: master and dev give different results on MNIST4D.main("--epochs 1")
	0000 debug: (initialization?) master and dev differ on CopySeq.main("--epochs 1 --dense "*Pkg.dir("Knet/data/seqdata.txt"))
	0000 debug: liblinear and knet give different results for softmax regression
	0000 change aws documentation to use regular instead of spot pricing
	0000 CUBLAS should be extended with all the allocating ops like *,+ etc. for debugging.
	0000 gpu() should take a gpu id.  Knet should automatically use the most empty gpu.
	0000 should be able to work with both cpu and gpu models/arrays. gpu() is not flexible enough. Array types should decide.
	0000 BUG: repeat is buggy, it doesn't work if we remove o... from the definition of mlp.
	0000 BUG: soft is buggy, its derivative won't work if used inside the network, go back to xentloss/softloss design.
	0000 LoadError: unrecognized keyword argument "nrepeat" https://github.com/denizyuret/Knet.jl/issues/3
	0000 Make Knet.gpu() take an int option and set the device on multi-gpu machine, by default have it search for the device with the least memory load.
	0000 NaN problem: julia ModelC-S-xyz.jl --lrate 0.263519 --decay 0.131408 --dropout 0.263145
	0000 Register project under Julia (probably after fixing the CUSPARSE incompatibility and HTTPClient issue and documentation TODO fixes).
	0000 The delta-bar-delta algorithm (Jacobs, 1988) is an early heuristic approachto adapting individual learning rates for model parameters during training. Theapproach is based on a simple idea: if the partial derivative of the loss, with respectto a given model parameter, remains the same sign, then the learning rate shouldincrease. If the partial derivative with respect to that parameter changes sign,then the learning rate should decrease. Of course, this kind of rule can only beapplied to full batch optimization
	0000 Write Knet paper, send somewhere.
	0000 add cpu conv tests to cputest.jl, test HTTPClient, register with Julia
	0000 bmul is incomplete, complete and test it.
	0000 check the momentum definition compared to the Bengio book.	does alpha=0.99 mean the same thing?
	0000 decide ml course concept map and book chapters
	0000 define soft73loss to be the old xentloss, i.e. softmax + cross-entropy loss, define xentloss to follow soft layer?  for binary input we have the same problem, logisticloss vs sigmloss for sigmoid followed by logistic loss?
	0000 fix new cusparse.jl v4
	0000 installation on a cpu machine produces many warnings and errors, find a way to conditionally load gpu support.
	0000 logisticloss: need a loss function to use with a single probability output with sigm.
	0000 transfer to balina (tenten, wikidump)
	0000 unit test optimizers from http://cs231n.github.io/neural-networks-3/#ada
	0015 - profile lstm and figure out multi-gpu: amazon has multi-gpu machines: http://www.nvidia.com/docs/io/116711/sc11-multi-gpu.pdf
	0025 speed comparison with barret code on charlm
	0040 - op/loss.jl: Fix/test all losses.  Update comments del l.y.  No need for forw.  Just loss, which returns loss and sets gradient.  Check tmp use avoid alloc.  Retire/fix logploss, xentloss, percloss, scalloss. -- src/op/loss.jl: cleanup, move out of op.
	0050 use templated code for cuda kernels
	0060 initforw: does not check sharing or read-before-write registers when cond/minibatch changes.
	0085 ai-mtg: kernels and perceptrons. kperceptron.jl: reimplement within the new framework. perceptron and structured perceptron examples. perclossback: these should be scaled 1/nx, why isn't our gradient check complaining? reimplement averaging. for perceptron.
	0090 docs: 3 sections: tutorial, examples, reference
	0090 docs: 3-arg loss functions (user does not need)
	0090 docs: ArgParse and scripts
	0090 docs: add convexity discussion to the ml book
	0090 docs: add to README: what is Knet and why should you bother.  compositional models.  benchmarks.
	0090 docs: anatomy of a knet function.
	0090 docs: colon and symbols
	0090 docs: convolution and pooling: explanation
	0090 docs: data types net, reg, stackentry access functions etc. don't forget to export whatever is mentioned in docs.
	0090 docs: each example can be its own doc
	0090 docs: explain the acronyms for wbf cbfp etc.
	0090 docs: find the paper that shows tradeoff for minibatching.
	0090 docs: fix knet.svg
	0090 docs: how to add new (1) comp ops, (2) prim ops, (3) updates, (4) rgens. (5) loss fns.
	0090 docs: knet function anatomy.
	0090 docs: link Julia functions to Julia doc
	0090 docs: nce
	0090 docs: netprint?
	0090 docs: optimizing parameters
	0090 docs: perceptron, kernel perceptron
	0090 docs: predicting with lenet
	0090 docs: s2c, s2s
	0090 docs: sell speed using benchmarks
	0090 docs: setseed and replicatability
	0090 docs: size with without dims option.
	0090 docs: structured learning
	0090 docs: update!: parameter averaging
	0101 example: aliya: try bit representation on ipa
	0101 example: character convolution models: its got good notation, lstm, conv, highway networks, language modeling... lm paper: github.com/yoonkim/lstm-char-cnn 1508.06615v4.pdf convlm: work on char-conv based lm/mt model? - is batch norm a better gclip alternative? - do we need rmsprop or adam? - how many hidden states does the conv char lm paper use? - small=2x300, large=2x650 - for the V=10k mikolov data. - bptt 35 time steps (words?) - batch size of 20 and 100 - 25 epochs - dropout=0.5 - gclip=5 - ashish said that didn't work without highway networks, do we need them?
	0101 example: many-to-one and one-to-many mt training.  mt models that fix the hidden state.  swapping input-output languages fixing the hidden state.  learning paraphrase models from mt models.
	0102 example: yonatan supertagging
	0105 example: onur/ozan ner bidirectional model version 2. try dropout and other architectures for ner.
	0107 example: aliya ipa example: "Bi-directional conversion between graphemes and phonemes using a joint n-gram model.", "http://arxiv.org/pdf/1506.00196v3.pdf"
	0124 example: attention s2s model: http://arxiv.org/abs/1508.04025
	0124 example: translation s2s model: http://arxiv.org/abs/1409.3215
	0125 example: ntm and variants: http://arxiv.org/abs/1410.5401 http://arxiv.org/abs/1505.00521
	0126 example: ctc apply speech model to mt: https://github.com/baidu-research/warp-ctc http://www.cs.toronto.edu/~graves/icml_2006.pdf
	0127 example: image captioning example http://arxiv.org/abs/1411.4555 
	0129 example: rnn based parser. ashish.
	0130 example: language learning in minecraft
	0132 example: learning to interpret python, find other examples: http://arxiv.org/abs/1410.4615
	0133 example: net2net: http://arxiv.org/pdf/1511.05641
	0220 saman: add lcn example to knet, push lcn branch to	master. lcn: add example: http://papers.nips.cc/paper/4773-convolutional-recursive-deep-learning-for-3d-object-classification.pdf
	0220 saman: add msra init: arXiv:1502.01852, http://www.jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf; https://github.com/BVLC/caffe/blob/master/include/caffe/filler.hpp#L129
	0220 saman: profile lcn: /mnt/kufs/scratch/szia13/mnist/foo.jl; You can pass lcn_mode as true or false to use it. The directory also contains the gaussian kernel for lcn. I timed one convolution layer followed by softmax with and without lcn and here are the timings: without lcn: 56.18 for 20 epochs (2.81 seconds per epoch on avg); with lcn: 131.69 for 20 epochs (6.58 seconds per epoch on avg)
	0220 saman: ~/knet/lcn/examples/: mnist4d("--actf relu --gcheck 10 --xscale 255 --lr 0.001") gives lots of gradient errors.
	0221 saman: waiting for saman to debug: implement lrn from cudnn (*LRN* and *Divisive*): Anyway LRN is local response normalization. Its mentioned in the following paper: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf; This is similar to LCN but its more like brightness normalization rather than contrast normalization and its used in ImageNet. So i think it would be good to incorporate it in Knet. Divisive Normalization is also associated with it and it performs normalization using a precomputed means matrix. So far only this has been implemented. If there would be a way to compute means and dev itself, this could be used for the division part of LCN. check out cudnnLRNMode_t and cudnnDivNormMode_t in the latest cudnn
	0800 update: Barrett's training tricks from Google: 1. Element wise clip the cell states so they do not go above 50, 2. Clip the norm elementwise at a threshold value (I do not know what they use), 3. Take the total norm of all matrices except the input embeddings and rescale the norm so it is <=5
	0800 update: Ozan's optimization tricks: We are using adam. There is no common choice for optimization method. People use adam, rmsprop, momentum or their own learning rate schedule. Some tricks: - Norm clipping - Truncated backprop for large sequences - Batching sequences - Initializing hidden to hidden weights with identity matrix for elman type networks with relu activation - Large initial values for forget gate bias in LSTMs.  In response to: are you guys using adam, rmsprop, adadelta, adagrad etc?  what tricks to people prefer for training rnns?
	0800 update: batch normalization: http://arxiv.org/abs/1502.03167 http://arxiv.org/abs/1510.01378, cudnn4 has it.
	0800 update: callbacks for update: registering the operation with atreplinit(): : registry look at atexit etc for examples from julia. update: update.jl: figure out how to extend update with callbacks so new optimization tricks can be added.  would it be just easier to leave this to the user: they write their own train function anyway.  they have access to setp.  just provide some examples.
	0800 update: gradient noise http://arxiv.org/pdf/1511.06807v1.pdf: basically try all combinations of weight noise, gradient noise, activation noise.  prevent blowup by normalization/clipping.
	0800 update: implement adadelta.  test all of them.
	0800 update: maxnorm should give a detailed report on all out/dif arrays rather than a single number. Probably makes sense to do this inside the Net. Could be like a training option.
	0800 update: understand xavier/glorotuniform: it works so well, may inform gradient/weight clipping and reporting, i.e. if xavier is based on some norm, clip based on that norm.
	0800 update: update.jl: add norm clipping and other Google tricks: clip types: calc(per-element, per-matrix, all-recurr, all-model) x clip(g, w, x) x (all, recurr) ?
	0800 update: we have not tested adagrad etc. in Knet7 yet.  waiting for callback update implementation.
	0900 speed: barret speed: for sparse input, keep dw dense, but use a mask to make w += dw faster.
	0901 speed: barret speed: implementing the whole lstm with single matrix multiplication: subarray operation?
	0902 speed: gemm can do incremental updates: pass incr back to ops instead of creating tmp.
	0950 speed: mul2: can probably be implemented using blas sbmv with diagonal m.  (else change order to x1,x2,y) also check out scale! with vector scale argument.
	0999 - update docs, remove dead code, update comments, check src TODO, - src/op/compound.jl: finish documentation for all ops.
	1000 Figure out how to debug Julia source -- cannot see local variables.
	1000 Figure out how to run on multi-gpu machines: both how to choose single gpu, or use multi-gpu: gpu.jl was running init on all gpus, fixed that.
	1000 Fix "Parsing the Penn Treebank in 60 seconds" blog post to work with the new Knet, or have a Knet version to checkout for this demo.
	1000 It takes 24 methods to define a loss function!  Need better array library so we can write generic code.
	1000 Train bidirectional rnn to predict substitutes, use for scode.
	1000 comparison: another deep learning evaluation kit: https://github.com/zer0n/deepframeworks/blob/master/README.md
	1000 comparison: convnet speed comparison: https://github.com/soumith/convnet-benchmarks. Yukarıdaki benchmark skorlarına göre en hızlı çalışan sistemi, yazan kişi aşağıdaki yorumda açıklamış. http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/#comments
	1000 comparison: http://blog.udacity.com/2016/01/putting-deep-learning-to-work.html tensorflow course?
	1000 comparison: http://googleresearch.blogspot.com/2015/11/tensorflow-googles-latest-machine_9.html
	1000 comparison: https://github.com/Lasagne/Lasagne
	1000 comparison: https://github.com/Microsoft/CNTK (microsoft)
	1000 comparison: https://github.com/baidu-research/warp-ctc
	1000 comparison: https://github.com/dmlc/mxnet (baidu?)
	1000 comparison: https://velesnet.ml/
	1000 comparison: https://www.projectoxford.ai/
	1000 comparison: sell shorter implementation by comparing with other frameworks: http://josephpcohen.com/w/visualizing-cnn-architectures-side-by-side-with-mxnet/
	1000 compiler: warn for unused constants
	1000 cpu conv/pool support.
	1000 cpu sparse support: mnist2dx does not run: Need to implement: A_mul_B!(::Array{Float32,2}, ::Array{Float32,2}, ::SparseMatrixCSC{Float32,Int32}); the other direction is implemented in sparse/linalg.jl.  take a look at how '*' works.
	1000 cusparse.jl: check efficiency of resizecopy for different array types.
	1000 handout-mtg Replace 0's with ? or _ in dimensions.
	1000 highway networks
	1000 ijulia notebooks: https://github.com/JuliaLang/IJulia.jl
	1000 implement an integrated optimizer like hyperopt within knet, dhc maybe?
	1000 implement maxnorm activation function.
	1000 initforw/forw: write in-place transpose for dy or decide on extra space for ops that need temp space (use tmp?).
	1000 inittype: fix inittype to check parameters as well as inputs, error if inconsistent, use Float32 if not found.  Add detection of sparse array types. does initsize play well with stype?  what is sparse other than input? What if there is a conflict with par.init or a previous eltype? initforw: type/size inference in inputless networks?
	1000 investigate irnn/relu gradcheck failure (adding, mnistpixels) make sure it is not a bug
	1000 julia: timholy: If you can come up with a solution that exploits Requires.jl, do you think that would be better? (does this solve conditional gpu library dependency problem for Knet?)
	1000 linalg.jl: A_mul_B!{T}(C::CudaMatrix{T}, A::CudaMatrix{T}, B::CudaSparseMatrixCSC{T}) allocates.
	1000 loss should give total loss, the ycols normalization makes it too confusing.  back can do the normalization?
	1000 net/util.jl: look into julia nullables to make this nothing=zero matrix thing better
	1000 pool.jl: function cudnnGetPoolingNdForwardOutputDim(pd::PoolingDescriptor, input::AbstractCudaArray) check if added to the library. will be fixed in v4.
	1000 probably should move cuda and julia code from linalg into add/mul etc.
	1000 setp/getp should work with register names, indices to set a single register.
	1000 src/netcomp.jl: convert all asserts to meaningful error messages.
	1000 src/op/conv.jl,pool.jl: test 3D-4D-5D convolution and pooling.
	1000 src/op/conv.jl: fix old versions to run comparison test.
	1000 test and integrate cudnn v4, it has batch normalization, lcn, lrn.  may have fixed some bugs.
	1000 testing: add a testnet.jl that constructs random nets and does gradient testing: testlayers could not find the pool bug
	1000 urnn, a more sophisticated irnn but same norm preserving idea: http://arxiv.org/pdf/1511.06464v2.pdf (not only norm preserving helps convergence but rnn without stack may be possible with reversible ops! also look at the other reversible paper: arXiv:1502.03492)
	1008 Julia: auto_unbox error: seems quadloss2 related. why do we get at exit (cudnn? other finalizers?): error in running finalizer: ErrorException("auto_unbox: unable to determine argument type"). check julia-dev response. gdb not showing symbols... opened CUDArt issue. Running gradcheck prevents the error (because it forces gc?).  Using softmax also prevents error (again, there are temp variables that force gc?)
	1008 Julia: cublas latest master broke knet, highlevel.jl conflicts with linalg.jl?  broke adding.jl. ArgumentError: output matrix must not be aliased with input matrix
	1008 Julia: cusparse latest master broke knet: UndefVarError: SparseArrays not defined. Check author's email.
	1009 add streams for speed-up: the problem is not as urgent as we thought 3500 vs 3750 wps.
	1300 package KUdense as DynamicArrays.jl and post it: add gpu conditionals and cpu only tests, sparse?; should turn constructors into convert.
	1500 add copytenten to runtests: src/data/S2SData.jl: add epochsize argument so copytenten can stop early.
	1500 examples/adding.jl: check out unstability of 400. share results with authors.
	1500 examples/mnist4d.jl,mnist2d.jl,rnnlm.jl: test with dropout
	1500 examples/rnnlm.jl: did not get good results on the large experiment, debug.
	1500 examples: ArgParse cannot catch error: fix is in https://github.com/carlobaldassi/ArgParse.jl/issues/24
	1500 examples: ArgParse, args: consolidate example options in one file.  fix train and predict.  turn libsvm2sparse into a data generator.
	1500 examples: reimplement predict, train, and tutorial.jl.
	1501 - add ipa model train and predict to runtests.jl.
	1501 - all sparse code slowed down? (lcn, master test).  due to xavier.  find out why. rest of mnist speed-up due to not using ItemTensor?
	1501 - examples: consolidate train/test functions, maybe demo options. use as_symbols option for parse_args
	1501 - test dropout and lcn by adding them to mnist2d, mnist4d?
	1501 addingirnn: slow, keep profiling addirnn, initforw probably responsible for remaining difference?
	1501 analyze array sharing in lstm and irnn, compare to master
	1501 copyseq: get rid of ystack alloc (l.180); see if add2 can pass dy back twice without copying; see if any more array sharing possible for add.back.
	1501 ipa: port to Knet7
	1501 julia mnist2d.jl --epochs 1 --batchsize 1 does not learn, figure out why.
	1501 nce needs mask
	1501 ncelm: port to Knet7
	1501 ner: port to Knet7
	1501 rnnlm: simpler rnnlm example without keepstate, with mask.
	1501 s2s needs nce
	1501 test/rnntest.jl: back gives eq for layers 18..8, approxeq for layers 7..1: TODO: investigate why
	5000 runtime improvements: find more register sharing for dif in initback.
	9999 find out why profiling does not work: @time does not mix well with @profile.  valgrind detects memory bugs.

2016-06-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* TRACE-DESIGN-2:

	The design falls out of the following requirements:

	- No compiler: the user runs regular Julia code for forward.
	- All ops on the path from a parameter to a loss need to be recorded for backward.
	- Don't know when and where the losses are going to appear.
	- All ops involving parameters and their descendents need to be recorded.
	- Create recording methods for each op marked with argument types.
	- Mark parameters with Par and descendents with Dat data types.
	- Par and Dat are only there to mark the need for recording and to indicate what/how to update.
	- Par is read-only during forward, modified only during init and update.
	- Dat can be overwritten multiple times during forward. (e.g. in RNN; record has to handle this)
	- Dat is needed because after a Par-Array op, there could be Array-Array ops before loss.
	- Both Par and Dat can interact with regular Array.
	- We can't just use special ops instead of Par/Dat types, user error may prevent recording using regular ops.

	Problems: Want to preserve Julia syntax options (* vs A_mul_B!
	etc) and just add recording.  As soon as we introduce a new type,
	many parts of the hierarchy need to be defined.  With two types
	even worse.  Can we do this without defining new types?  A macro
	like @into!  Instrumenting @profile ?  Would still need to define
	back for each op, but these can be generic and at least not needed
	for each arg type.  What about for loops and conditionals?
	Leaving where to put the macro to the user is dangerous.

	The problem with the instrumenting profile solution is that it
	records the top level lines, not the low level operations like we
	want.  We want to instrument few low level operations (A_mul_B*,
	broadcast, conv, pool etc.) and build everything out of these.
	The datatypes inform the user which operations have been
	instrumented.  Using Julia names or syntax on this primitive
	operations is optional and probably not very practical.  So if we
	want to write tanh(w*x.+b), we'll have to define these operations
	for the new datatypes.  We can define/instrument them in bulk
	using macros.

2016-06-16  Deniz Yuret  <dyuret@ku.edu.tr>

	* TRACE-DESIGN-1:

	Instead of pre-compilation, why don't we do runtime trace?  Record
	all operations that involve parameters and their children.  Once
	back is called go backwards on the whole execution history.  Any
	part of the code that does not get effected by parameters (for
	loops with constants, constant conditionals) need not be recorded.

	Can we use the trace mechanism?  None exist except @profile.
	Can we record everywhere or just inside knet functions?
	Is it possible just to override some functions and leave for loops alone?
	Can we get rid of forw and just call knet functions?

	write @knet8 macro.
	replace array ops with instrumented versions that save state.
	save execution trace to global stack.
	call back with gold output and loss function?
	how do we distinguish parameters?
	How do we mark children of parameter operations?
	Record everything and just go back on the useful ops (useful means on the path from a parameter to loss).
	Need to be able to turn off recording for test and have reset, save-state?

	If we record everything, do we need to mark parameters?  They are
	just like any other array.  Except we want to take derivative wrt
	them.  How do we represent identity of arrays?  Two variables can
	point to the same array.  One array can change over time...

	We can have a specific parameter type, which takes care of
	specializing functions, but what about array-array intermediate
	ops without parameters (mark those with a special type as well?)

	How do we give the target array in the syntax?  Revert back to
	ugly Julia?

	We could use InplaceOps again.  Need to extend it with relu etc.
	Or relu etc. may be inplace by default anyway.

	If we are going to type @into! for each expression, we may as well
	do @knet and do what we want (recording etc) thru special
	functions?  vs. control things thru special array types?  But
	@into! will be optional for readability, we'll allow using
	A_mul_B! etc.

	The two array types (for parameters and data arrays) is the better
	solution.  We could also revive the dynamicarrays library and
	implement e.g. blobs of caffe with both gpu and cpu pointers etc.
	convert can be used to get cpu or gpu pointers on demand.  Should
	look at caffe blobs in more detail to see if this is worth it.

	Is two array types sufficient to construct the full comp graph?
	Ops can be: arr=op(par,arr), arr=op(arr).  par never on the left
	side, it is a constant.  Anything that leads from par to loss
	should be recorded.  Don't know where loss is going to be called
	during forward pass, so record everything that involves par or
	arr. Any ops that involve regular arrays or constants?  par gets
	initialized and updated outside of recording.

	We'll have to write a lot of methods for a lot of array
	combinations again.  Each will have to know how to record as
	well.  Can't we do something more minimal?

	If we do everything with a @knet macro, can we avoid defining new
	types and ops?  The @knet expressions will check the global save
	variable and save everything automatically.  Leave if/for alone.
	Could also do what @into! does.

	Who does the array allocation?  We want to get as close as we can
	to somebody writing a natural julia function to go forward.

	InplaceOps takes arithmetic ops, converts them to internal ops
	(add! etc.) which in turn gets converted to A_foo_B or broadcast.

	Check this out: The syntax @generated function enables generation
	of specialized methods based on argument types.


2016-04-01  Deniz Yuret  <dyuret@ku.edu.tr>


	* DONE:
	DONE implement gru using con. or axpb?
	DONE use download instead of HTTPClient or Request!  Remove those from the requirement list.
	DONE switch back to HTTPClient in REQUIRE and in tutorial, solve problem at koc: libcurl fails, libcurl/master passes (this is because it doesn't test anything, the regular version calls HTTPClient.test), (cancel: see what changed, apply fix to httpclient), with regular libcurl v0.2.0 osx and ubuntu works, centos6 does not. cancel all, julia has a download function!
	DONE xavier was buggy with FC layers, fixed it and added a scale parameter.
	DONE update: implement alternatives to adagrad: adam, rmsprop, adadelta. test on bparser.

2016-02-06  Deniz Yuret  <dyuret@ku.edu.tr>

	* cpu-conv:
	dyuret@iui-5-0:~/knet/dev[0]$ git pull
	remote: Counting objects: 21, done.        
	remote: Compressing objects: 100% (21/21), done.        
	remote: Total 21 (delta 9), reused 0 (delta 0), pack-reused 0        
	Unpacking objects: 100% (21/21), done.
	From github.com:denizyuret/Knet.jl
	c6481ec..c1cfa23  dev        -> origin/dev
	Updating c6481ec..c1cfa23
	Fast-forward
	examples/mnist4d_debug.jl  | 245 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
	examples/mnist4d_sample.jl | 230 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
	src/Knet.jl                |   1 +
	src/util/conv_pool_cpu.jl  | 120 +++++++++++++++++++++++++++++++++++++++++++
	test/cpu_cptests_rand.jl   |  63 +++++++++++++++++++++++
	test/cpu_cptests_simple.jl |  58 +++++++++++++++++++++
	6 files changed, 717 insertions(+)
	create mode 100644 examples/mnist4d_debug.jl
	create mode 100644 examples/mnist4d_sample.jl
	create mode 100644 src/util/conv_pool_cpu.jl
	create mode 100644 test/cpu_cptests_rand.jl
	create mode 100644 test/cpu_cptests_simple.jl

	dyuret@iui-5-0:~/knet/dev[0]$ git log | head
	c1cfa23 2016-02-06 Merge pull request #4 from kuruonur1/conv_cpu
	c928733 2016-02-06 conv gemm
	7c68461 2016-02-05 conv cpu works
	c6481ec 2016-02-05 updates

2016-01-25  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE cusparse v4 broken: quick fix with REQUIRE version spec
	DONE try new cudnn 404

2016-01-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE Turn on travis testing
	DONE - work on yonatan example?
	DONE decide ml course examples
	DONE example: barret: check on tur/uzb experiments. try pre-training for low resource mt: we can use rnnlm or copy, may need permutation matrix before encoder. work on pretraining for low resource translation using s2s or lm models.
	DONE example: need something for turkish translation to report to tubitak
	DONE example: try s2s model in turkish translation: jonmay sent link to low density translation data.
	DONE example: yonatan: language learning model
	DONE merge pull requests: https://help.github.com/articles/checking-out-pull-requests-locally/

2016-01-13  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE docs:  rename charlm minibatch to something else.
	DONE docs:  sforw/sback
	DONE docs: do not use randn, Float64 trouble later.
	DONE docs: gutenberg does not let download from certain machines.
	DONE docs: load/save data in jld
	DONE docs: load/save model in jld
	DONE docs: repeat operator
	DONE docs: update!: gradient clipping
	DONE docs: generate pieces from docs/shakespeare.jld
	DONE docs: sforw vs forw (forw is used during test for rnns!)

2016-01-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE demo not working in *julia*.
	DONE master fails the cpu tests again.
	DONE to_host undefined error during tutorial on cpu.
	DONE docs: 2-arg loss functions
	DONE docs: @knet function as model
	DONE docs: @knet function as new op
	DONE docs: @knet return
	DONE docs: back
	DONE docs: bug-feature request (github issues)
	DONE docs: classification and softloss, zeroone loss
	DONE docs: compile
	DONE docs: conditionals, kwargs to forw
	DONE docs: contributing (fork/pull)
	DONE docs: data normalization
	DONE docs: defining new primitives
	DONE docs: dot/add vs */+
	DONE docs: dropout
	DONE docs: elementwise and broadcasting operations: .+ .- .* ./
	DONE docs: forw
	DONE docs: get registers and looking at network internals
	DONE docs: init options to par: array, rgen
	DONE docs: installation
	DONE docs: julia array indexing, colon operator.
	DONE docs: keyword args in knet function
	DONE docs: keyword args to compile
	DONE docs: list of primitive ops
	DONE docs: minibatching
	DONE docs: other training options
	DONE docs: read-before-write ops, sequences, rnns: language model? karpathy? smaller character based word model?
	DONE docs: reading data: readdlm, HTTPClient.get
	DONE docs: regression and quadloss (housing example)
	DONE docs: setp and learning rate
	DONE docs: splat
	DONE docs: sum, mean, std with dims option.
	DONE docs: train script
	DONE docs: trn/tst split
	DONE docs: update!
	DONE docs: xavier, uniform etc.
	DONE docs: zero size, size inference: could have a tutorial section on this following new operators.
	DONE docs: add save and best options to charlm, write a generator with temperature.
	DONE docs: call the option dropout, not training.
	DONE docs: commit charnn once experiments done
	DONE docs: compile kwargs
	DONE docs: finish dropout example
	DONE docs: forw kwargs
	DONE docs: get rid of loadnet savenet
	DONE docs: have a table of defined kfun.
	DONE docs: have both dropout and a separate conditional section with rnn example?
	DONE docs: introduce compiler kwargs in one paragraph, example.
	DONE docs: list compound ops
	DONE docs: list prob dist
	DONE docs: minibatch source.
	DONE docs: play with charlm batch size!
	DONE docs: put dropout example back where it was.
	DONE docs: setup amazon machine.
	DONE docs: show source for minibatch, make it more readable.
	DONE docs: size inference
	DONE docs: wrap charlm up and start yonatan.
	DONE docs: write karpathy demo for rnns, or rewrite rnnlm, or copy with words
	DONE work on lorelei tokenization?
	DONE docs: add intro/conclusion at all levels. 
	DONE docs: amazon machine, pull/fork, issues.
	DONE docs: fix doctest again.
	DONE docs: installation link is broken: http://www.sphinx-doc.org/en/stable/markup/inline.html
	DONE docs: keyword args to compile(), 
	DONE docs: keyword arguments. 
	DONE docs: primitive ops. 
	DONE docs: ref links do not show up in github, neigher does :math: this is normal, it happens on Julia doc as well.
	DONE docs: rnn1: would be nice to use 0 for xsize at this point.  Also this is the second time we are using Xavier etc without much explanation.
	DONE docs: size inference?
	DONE docs: broadcasting, explain in minibatch. even earlier we have broadcasting in lenet.
	DONE docs: introduce table of distributions, Bernoulli etc.
	DONE docs: update options


2016-01-10  Deniz Yuret  <dyuret@ku.edu.tr>

	* amazon:
	3 types of gpu instances:
	Instance type 	vCPUs 	Memory (GiB)	Storage (GB)	Weighted capacity 	Total bid price  	% of On-Demand 
	  cg1.4xlarge	16	22	2 x 840	1	$0.07 	3%
	  g2.2xlarge	8	15	1 x 60 SSD	1	$0.07 	11%
	  g2.8xlarge	32	60	2 x 120 SSD	1	$0.07 	3%

	Model	GPUs	vCPU	Mem (GiB)	SSD Storage (GB)
	g2.2xlarge	1	8	15	1 x 60
	g2.8xlarge	4	32	60	2 x 120

	High Frequency Intel Xeon E5-2670 (Sandy Bridge) Processors
	High-performance NVIDIA GPUs, each with 1,536 CUDA cores and 4GB of video memory

	cg1.4 is previous generation, only available US east, $2.
	g2.2 is $0.07, g2.8 is $0.29

	g2.2 has 1 GRID K520 NVIDIA GPU, 4GB GPU RAM, 16GB CPU RAM, 1 CPU
	with 4 physical cores.  Looks like 7 with hyperthreading. Intel(R)
	Xeon(R) CPU E5-2670 0 @ 2.60GHz.

	g2.8 has 2 sockets, 8 cores per socket, 16 physical 32 virtual cores.
	$ egrep -e "core id" -e ^physical /proc/cpuinfo|xargs -l2 echo|sort -u
	60GB CPU RAM.  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz.
	4 K520 GPUs each with 4GB RAM.

	nvidia ami: Cannot be made public!!!
	amzn-ami-graphics-hvm-2015.09.1.x86_64-ebs-d3fbf14b-243d-46e0-916c-82a8bf6955b4-ami-b0b8c8da.2 (ami-943956f4)
	Driver version 340.32
	libcublas.so.6.5.14 (under /opt/nvidia/cuda/lib64)
	added:
	emacs-24.3
	git-2.4.3
	perl-5.6.13
	python-2.7.10
	mlocate
	sudo yum --enablerepo=epel install hdf5
	cmake (for MbedTLS)

	Requesting ec2 instance:
	- Request spot instance.
	- Choose Amazon Linux AMI with NVIDIA GRID GPU Driver.

	http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html#gpu-operating-systems
	https://aws.amazon.com/ec2/spot/getting-started/
	https://console.aws.amazon.com/billing/home?#/
	http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html

	Login command: ssh -i /path/my-key-pair.pem ec2-user@ec2-198-51-100-1.compute-1.amazonaws.com

	AMI creation:
	ami-809feae0 snap-ecd005d4 vol-696e1f91 /dev/xvda
	ami-4f9fea2f snap-d81561e1 vol-696e1f91 /dev/xvda Knet-0.7.2 cannot make public!

	Start with: ami-d5ea86b5
	Amazon Linux AMI 2015.09.1 (HVM), SSD Volume Type - ami-d5ea86b5
	amzn-ami-hvm-2015.09.1.x86_64-gp2 (ami-d5ea86b5)
	[ec2-user@ip-172-31-9-42 ~]$
	ami-b99beed9 snap-53dfb770 vol-f2bccd0a /dev/xvda Knet-0.7.2 public image!
	ami-149bee74 snap-90d01fb2 vol-f2bccd0a /dev/xvda Knet-0.7.2a more up to date.


2016-01-08  Deniz Yuret  <dyuret@ku.edu.tr>

	* charlm:
	512x3: (98,0.0523347633027361,1.1037882519571802,1.254777816141009)  source ../setup.sh; julia charlm.jl 100.trn 100.dev --epochs 1000 --hidden 512 --embedding 512 --nlayer 3 --lr 1.0 --gclip 5.0 --dropout 0.5
	best @10 epoch: (8,1.0,1.0524426423170234,1.2930993204383174) tmp/source_setup_sh_julia_charlm_jl_100_trn_100_dev_epochs_10_hidden_1024_embedding_256_nlayer_1_lr_1_0_gclip_5_0_dropout_0_0.out:
	best @20 epoch: (18,0.81,1.0498852644873522,1.263663904705879) tmp/source_setup_sh_julia_charlm_jl_100_trn_100_dev_dropout_0_5_batchsize_128_hidden_1024_embedding_256_nlayer_1_lr_1_0_gclip_5_0_epochs_20.out:(18,0.81,1.0498852644873522,1.263663904705879)
	best @20 epoch: (20,0.9,1.118612103332397,1.2526385033000718) source_setup_sh_julia_charlm_jl_100_trn_100_dev_dropout_0_5_batchsize_128_hidden_1024_embedding_256_nlayer_2_lr_1_0_gclip_5_0_epochs_20.out
	foo.jld with (8,1.0,1.0533660260935827,1.3006370929184012) generates some pretty good Shakespeare.
	So 1.3 seems reasonable.  Figure out how to get there fastest.

	* DONE:
	DONE loadnet, savenet
	DONE example: karpathy character based models, generating wikipedia etc. inputless networks with random generators


	* savenet:
	example charlm model with:
	Dict{Symbol,Any}(:lr=>1.0,:savefile=>nothing,:loadfile=>nothing,:dropout=>0.0,:bestfile=>nothing,:embedding=>256,:gclip=>5.0,:hidden=>256,:epochs=>1,:nlayer=>1,:decay=>0.9,:seqlength=>100,:seed=>42,:batchsize=>128,:datafiles=>Any["100.dev","100.dev"])
	172895296 save size
	147279959 compressed save
	166699081 getbytes
	We could cut this to a third if we did not save tmp and dif0.
	If we zero these out initback won't realize and won't reallocate.
	If we zero all but persistent we can get more savings.
	2694376 after zeroing all but persistent arrays
	2720536 when compressed: takes more space!

2016-01-05  Deniz Yuret  <dyuret@ku.edu.tr>

	* intro.rst: 3613 words = 8.5 pages.  425 words/page.

2016-01-04  Deniz Yuret  <dyuret@ku.edu.tr>

	* DOCS:
	.. - kfun as model: linear regression.
	.. - kfun as new ops: mnist lenet.
	.. - compile time parameters: 
	.. - runtime parameters: conditionals: dropout? on mnist lenet?
	.. - rbw registers: rnn intro, rnnlm (char based).
	.. - conditionals: copyseq or adding or dropout?
	.. 
	.. - linear regression?  uci?  https://archive.ics.uci.edu/ml/datasets/Housing
	.. - or do we do artificial data generation: cpu/gpu conversion may be difficult.
	.. - mnist definitely
	.. - mnist4d for convolution
	.. - maybe something else for simple nnet?
	.. - copyseq to introduce rnns
	.. 
	.. DONE:
	.. 
	.. - we need to talk about installation somewhere.
	.. - Other requirements like v0.4.0, cuda libraries, cpu compatibility etc.
	.. - DONE: Install latest v0.4.2.
	.. - DONE: Update packages.
	.. - DONE: Figure out no-gpu installation (CUDA* requirements)
	.. - DONE: Create an amazon aws image for easy gpu work.
	.. .. see http://sphinx-doc.org/ext/doctest.html
	.. .. testcode for regular doctest for prompted examples
	.. .. http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#directives

2015-12-30  Deniz Yuret  <dyuret@ku.edu.tr>

	* tur-preinit-fails:
	- rnd results too good:
	-- maybe they would be with these params for fr-en, en-en too.
	- pre results too bad:
	-- buggy trainer
	-- buggy big model
	-- buggy vocab copy
	-- buggy corpus: tokenization capitalization domain etc
	--- check most frequent tokens, check lm cross entropy
	--- maybe fr-en would not do as good if en side was not identical
	- fixed copy problem but still bad pretraining results
	-- running pre2 to confirm
	-- english vocab different or distribution different

2015-12-29  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE cpu test domask
	DONE test cpu compatibility for all ops, numerical comparison with gpu: add, conv, pool.
	DONE cpu: lstm does not work: cpu mulback not implemented yet
	DONE cpu: addirnn does not converge: (40000,0.16878323f0,6.2745414f0,0.25430006f0) (1) weight init different: ok, (2) gradcheck buggy, fixed. (3) lstm passes. (4) relu fails, fixed. (5) sigm fails. but lstm has sigm units??? noshare gives correct result in cpu. sharing pattern same as gpu. 5,6,8,9 shared. 5=4*9 which works in cublas not in blas? should fanout>1 be shared? (6) tanh fails. all fixed.
	DONE compiler: cannot add three terms, fix in compiler (a+b+c): easy fix, convert all to .+ which takes binary ops. but too late by the time it gets to the compiler.
	DONE compiler: compound statements that involve axpb or alpha/beta in add/mul.  + == .+ ; - == .- ; * != .* ; / != ./ ; the last two are only equal when one arg is a number.  need to support number+array, array+number and array+array versions.
	DONE lstm gives different results when overwrite=false.  cpu fails as well.

2015-12-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE - make semicolon optional: cannot do it, foo(a=1) is not the same as foo(;a=1) in julia. no, that is true in defition, not in call!

2015-12-26  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE - bmul!, baddforw!: make broadcasting ops take small array in either position, fix infersize by having a second pass that assumes add/mul are not broadcasting.  test infersize with lstm and minibatch=1. look at julia broadcasting.

2015-12-25  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/op/broadcast.jl:
	# Broadcasting elementwise add and mul
	# This should eventually replace add.jl and mul.jl
	# TODO: figure out CUDNN_SAME_C, here or in the bias definition?  bias can't resize?
	###but we can define the model with par size (1,1,b,1). Support full Julia broadcasting.
	# avoid reshape_to_match if possible.
	# Do we need to support full broadcast?  All we need so far is elementwise and bias and bias has its own rules (SAME_C) which breaks the julia broadcast pattern when ndims(x1)=1 and ndims(x2)>2.
	# Do we need alpha,beta when we have axpb?  Only advantage is to be able to use a single kernel like axpy for both scaling and adding.
	# alpha,beta for mul is not fully general anyway because of negative numbers. Should we define sub and div instead?
	###Just use axpb? Get rid fo alpha/beta.
	# sub may mean subarray which may allow us to implement single op lstm.
	###Define everything in terms of broadcast!, then extend broadcast! to CudaArray.
	# julia broadcast also accepts number instead of array

	Finish this before documentation as it changes the interface:
	- no more alpha,beta in add/mul
	- full support of Julia broadcast semantics with no exceptions (use (1,1,n,1) for SAME_C)

	Use mnist2d, mnist3d, and copyseq for speed tests, each has different broadcasting requirements.
	Fix mnist4d and cbfp to use 4D bias.

	mnist2d test: passes, with slight numerical nondeterminism due to
	atomicAdd.  speed=4.46 compared to master=3.97 which uses
	cudnnAddTensor going forward, cudnnConvolutionBackwardBias going
	backward.

	DONE: Figure out their implementation. Main diff in forw: 3.60 vs 4.27.  Back is 3.60 vs 3.70.  So our baddforw is slow.
	DONE: Add size checks for cudnn calls: just let them fail.
	DONE: test switching bias and array.
	DONE: adding.jl needs add2,add3, but not working.
	DONE: need to figure out AddTensor and BiasBack to make mul more efficient?  (But mul is always elementwise in lstm)
	DONE: test infersize: lstm+, lcn, dropout, batch=1, other saman example?
	POST: Implement CPU versions: forw already done, need back.

	* DONE:
	DONE - find the three parameter element wise mul in julia and rename mul2 -- can't, it calls broadcast

2015-12-23  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE Julia: I tried Knet.ji for faster load (see Module doc) but it fails.  Fix it.  Two problems.  Kenv should be split: rename Kfun in kfun.jl, cp compound.jl kfun.jl, put @knet macro and defs, including op defs in it.  CU* should be compiled.  However latest CU* have issues.


2015-12-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE new par: consolidate set and setopt. mnist2d.jl: setopt! should work with register names as well - fix the relationship between par/arr and the new reg, use plist for par update options, maybe merge par and reg? par can use plist to avoid all the unused members, update callbacks can use plist for extensibility as well.
	DONE src/op/input.jl: try replacing methods with nothing. we should not be using these?


2015-12-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* refactoring:
	Net: reg,stack,sp,lastforw,lastback
	+get->reg, +out->get, +dif->dif, +reg(f,i), reg(f,n), length ok, define +get/dif for i.
	input_registers -> inputregs, registers -> regs.
	outs,difs not exposed. fine.
	params:return regs instead of Par, Par arrays will be eliminated.
	stack_length etc. is ok.

	Reg: op, name, args, cond, argv, plist, out,out0,dif,dif0,tmp
	get(r,s) should be getprop? could be getp/setp/incp.
	frequent flags like forw can be fields instead
	how about setopt? merge with setp.

	Par: dims; init; initialized; out; dif; lr; l1reg; l2reg; adagrad; ada; momentum; mom; nesterov; nes; average; avg;
	use lots of flags and plist
	these are on the register or the par?
	out,dif on register
	the rest could be on plist, eliminate all par storage, put all on reg.
	how about initialized? forw only has access to array not register, so init/initialized/dims need to be in the op.
	how about options passed at creation? vs later with setopt,setp?
	par.back should never run, it has no input. same goes for arr, input, rnd, etc.
	cannot change par without changing update.
	Input: copies external input every call
	Arr: initializes once, never changes
	Par: initializes once, changes with update
	Rnd: generates random every call

	* update-callbacks:
	update: if all plist is on reg, params needs to return the reg,
	not the op. and update needs the reg.  Current update ops are:
	bunch of ops that scale dif, axpy(scale,dif,out), averaging hack.
	keeping general scale separate is efficient, we can keep doing
	that using a difscale field that callbacks can write to. In case
	of momentum, adagrad etc. the auxiliary matrix is also modified.
	we may need an order of ops, priority. the averaging trick comes
	after the update.  we may have post-filters that do parameter
	clipping. this should be per net, not global.  how does the user
	modify it?  do we have all callbacks on net by default?  or is it
	per parameter?  per register? that way we can support param, grad,
	activation clipping.  param/activation clipping follows forw.
	grad clipping follows back (write after dif is written, before it
	is read?). no we don't want to clip during back, we want to clip
	right before update.  param clipping is right after update.  how
	about batch normalization?  do we have before/after
	forw/back/update all available?  This may take a while to sort
	out, let us focus on simpler refactoring tasks for now.

	Here is how julia callbacks work:

const atexit_hooks = []

atexit(f::Function) = (unshift!(atexit_hooks, f); nothing)

function _atexit()
    for f in atexit_hooks
        try
            f()
        catch err
            show(STDERR, err)
            println(STDERR)
        end
    end
end


	* DONE:
	DONE cusparse: check out library changes that the author emailed.
	DONE cpu-only support
	DONE try softloss in cpu loss.jl:93: gpu wins with copyloss and rnnlm
	DONE rnnlm still has too much alloc: only because I was testing dense.
	DONE test batch resize: minibatch resizing: cusparse.jl: rename all Base.copy! to resizecopy!; define ones for Array, CudaArray; use them in resizing.
	DONE back.jl: seq and f.sp>0 redundant in back, do we need sback?
	DONE I get Nan's back when I run copyseq with winit=Uniform(-10,10).  Gradient check fails with winit=Uniform(-1,1)
	DONE rethink out/get/reg/dif/getprop set/setprop/setopt naming.  - use better get/set function names in net/*.jl, search and replace all a.b type references.


2015-12-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* examples/copyseq.jl: looking at cudart operations.
	batchsize=128.  running on /tmp/foodata with 128 sequences.
	Meaning 1 epoch = 1 minibatch.  Longest sequence has 20 tokens.
	One minibatch takes 42 forward steps with two eos tokens.  The
	model has 77 registers, stack has 3234=77*42 entries.  51 unique
	out0, 51 unique dif0, 31 unique tmp.  76 :grad, 31 :incr (27
	Par + 4 multi)

	1 epoch = 1 minibatch.
	:CudaArray   => 637
	:reinterpret => 336
	:fill!       => 170
	:copy!       => 799

	2 epochs:
	:CudaArray   => 742 (+105)
	:reinterpret => 672 (+336)
	:fill!       => 332 (+162)
	:copy!       => 1598 (+799)

	calling train from outside is different?:
	:CudaArray   => 42
	:copy!       => 757
	:fill!       => 162
	:reinterpret => 336

	+ why two different results for train? losscnt vs back only.
	+ what are the delta ops each epoch?

	Prevent multiple alloc for ygold and mask: could do in copyseq
	script, but a more general solution is better.  The loss2/loss3
	interface.  We want grad without loss: train.  We want loss
	without grad: test.  Storing ygold/mask with net?  Still need to
	copy multiple times for loss/grad.  If copyseq uses gpu we have
	single copy.  vs two copies and a general solution.  Implement
	both?  loss is not part of the net interface at all!  It just acts
	on arrays.  OK, we copy once in copyseq.  Still we have 42 allocs
	because of ystack.

	CudaArray:
---	-21 53 37 71 198 sback1 softloss3 domask copies mask: done.
---	-21 51 37 71 198 sback1 softloss3 axpy copies ygold: done.
---	-21 89 172 softloss2 copies ygold to gpu: done.
---	-21 77 89 172 softloss2 copies mask: done.
---	-21 78 89 172 softloss2 allocs tmp: prevent by having softlosstemp done.
***	+42 c180 ystack

	copy!
ok	=21 f42 f34 c159 encoder sforw copy input
ok	=1  a49 f44 f34 c159 encoder sforw add.forw(x1,nothing)
ok	=189 n109 f47 f34 c159 encoder sforw push!
ok	=21 f42 f34 c164 decoder sforw copy input
ok	=210 n109 f47 f34 c164 decoder sforw push!
---	-21 l89 c172 softloss2 copies ygold to gpu
---	-21 l77 l89 c172 softloss2 copies mask to gpu
ok	=21 l50 b37 b71 c198 sback1 softloss3 ypred->xgrad
---	-21 l51 b37 b71 c198 sback1 softloss3 axpy copies ygold to gpu
---	-21 l53 b37 b71 c198 sback1 softloss3 domask copies mask to gpu
?	=21 a61 b52 b71 c198 sback1 add.back copies dy to dx2: can we solve this with sharing?
?	=21 a61 b52 b71 c205 sback2 add.back copies dy to dx2
?	=105 a155 a65 b52 b71 c198 sback1 add baddback:153?dy->db (dx1) for symmetric add2 dx1=dx2, can we use the same array?
?	=105 a155 a65 b52 b71 c205 sback2 add baddback:153?dy->db (dx1)
***	+21 c180 ystack
***	+21 c180 ystack
ok	+21 c138,151 copytogpu ygold/mask
ok	+21 c138,150 copytogpu ygold/mask

	fill!
ok	23 n93 c135,124,54 reset! incr
ok	8  n93 c135,124,54 reset! incr
ok	4  d19 b52 b71 c205 dot.back x2==nothing
ok	63 b58 b71 c198 sback2 incr
ok	63 b58 b71 c205 sback1 incr
ok	1  m48 b52 b71 c205 mul.back x1==nothing

	* DONE:
	DONE rnnlm: WARNING: Stack input has forw=false. solved.  results ok.
	DONE 1. rnnlm: speed: 23.227981 vs 25.988030.  gcheck failing. fixed.
	DONE 2. copyseq: speed is slow. (15.232240 vs 11.432132) probably due to copying ygold to dense.
	DONE - check/compare number of arrays allocated for forw/back lstm, count number of useless copies and allocations
	DONE Start investigating (1) slightly different results, (2) speed and memory issues before going any further. get all other models to work, sanity check with no array sharing version.  Could make this an option in runtests?
	DONE test cpu softloss. (no, it is better to copy sparse ygold to gpu than dense ypred to cpu) compare quadloss on bigger problem. copy!: softloss@loss.jl:67 ygold copied for gradient. however this may happen twice one for 2arg one for 3arg version of softloss.  prevent that. (for not modified copyseq)
	DONE - back: problem with zeroing out the dif0 for back.
	DONE replace Expr(:&&) with true in compiler. not necessary.
	DONE - fix compiler to reuse registers and share intermediate tmp
	DONE - initsize: ignore forw=false
	DONE - optimize alloc, copy, fill.
	DONE - runtests.jl: update timing
	DONE - src/model/fnn.jl: predict has not been tested
	DONE - src/model/s2s.jl,tagger.jl,s2c.jl: implement predict, i.e. decoder. in all models.
	DONE get rid of input() and use constructor arguments instead?	can we?
	DONE ai-mtg: Copy instruction for going back two time steps.
	DONE - start to clean up data and models. retire the Model interface if ok, only expose Net,forw,back,reset!,update!, let people write their own train/test functions. model.jl: add predict, load, save:
	DONE move compound, loss out of op. par and loss should probably get out of op.  at least loss.
	DONE - runtests.jl: update timing, simplify scripts (no need for norms everywhere), merge with test/runtests.jl. edit test/runtests.jl to run the examples instead.
	DONE src/net.jl: TODO: eventually get rid of all similar, similar!, issimilar, dsimilar, dtype, dsimilar! etc. defined in array.jl.
	DONE Julia: for load/save check timholy's message about extending jld with custom types - the first and second test give different results after load from file
	DONE - cleanup net/util.jl
	DONE lcn->master, master->v0.6, dev->master and/or v0.7

2015-12-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE copyseq: fix winit bug, reverts to Xavier even if we call compile with Gaussian.  done: was a bug in copyseq definition.
	DONE copyseq: figure out the small diff between master and dev: close enough, numerical.
	DONE copyseq: figure out why both master and dev fail gradcheck: we broke it at cca8b4f 2015-12-05 improved softmax gradient: using dx=q-p.  how come mnist is not broken?  final layer broken ok, but what about the previous layers? mask problem, fixed.
	DONE solve master gcheck issues: gcheck fails on some tests: ok only adding with relu fails, known issue.
	DONE copyseq: gives slightly different result. gradcheck does not work. need to keep around input indices!  fixed.  gradcheck still failing.  fixed.
	DONE solve master speed issues: significant slow down on some tests: xavier slowed down sparse mnist2dx and copyseq!? some machine difference. copyseq culprit: cca8b4f 2015-12-05 improved softmax gradient: using dx=q-p: probably because ygold is densified.
	DONE - reimplement S2C using conditionals: adding, mnistpixels.
	DONE - solve the incr problem, user based register sharing may be in trouble.
	DONE - wrap examples in modules so multiple can be loaded?

2015-12-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* copyseq: The encoder difs are always nothing.  Potential
	problem: going back stack tells us which ops were skipped.
	However it doesn't tell us what the inputs were.  The inputs to
	ops are dependent on the conditions.  So the stack may need to
	keep around the inputs as well.  No: we need the indices of the
	inputs.  Is there anything else that is condition dependent we are
	missing that back uses?

2015-12-16  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE rnnlm: gives slightly different result. fails gradcheck.
	DONE axpb -> copy (already defined as an actf?)
	DONE addinglstm: gives slightly different result. could be due to cpu quadloss: checked, it is not. lstm add2 difference.  fixed.
	DONE mnistpixels: gives slightly different result.
	DONE ninputs(Net) does not consult :forw, it should.

	* examples/rnnlm.jl: debug diff with master:
	>> how much variance is there?
	dev:
	(ep,perp...,m...,lr) = (1,815.7953052044443,545.0401611328125,267.5125427246094,123.92666625976562,1.0)
	(ep,perp...,m...,lr) = (1,806.9491109456006,554.669677734375,267.5966796875,123.9271240234375,1.0)
	(ep,perp...,m...,lr) = (1,810.3839144385606,544.9676513671875,267.5242614746094,123.92684936523438,1.0)
	master:
	(ep,perp...,m...,lr) = (1,824.9219871743913,533.6437377929688,267.20458984375,136.92347717285156,1.0)
	(ep,perp...,m...,lr) = (1,820.244987897238,565.3406982421875,267.2564392089844,136.92347717285156,1.0)
	(ep,perp...,m...,lr) = (1,826.4401933306086,531.721923828125,267.1590881347656,136.9234619140625,1.0)
	>> run with dense to reduce variance:
	dev:
	(ep,perp...,m...,lr) = (1,810.3214702395429,541.9421997070312,267.4609069824219,123.92681121826172,1.0)
	master:
	(ep,perp...,m...,lr) = (1,824.2065024120675,531.85888671875,267.1900329589844,136.9234619140625,1.0)
	>> see if nosharing makes a difference: no difference.
	>> run with gradient check: dev fails, master does not!
	>> print vecnorm to find out where things diverge.
	+ why do we have an extra copy at 41 and why is it axpb? Because of return h.  Fixed axpb->copy.
	- changing the add2 definition changed the dense dev result to: (still failing gcheck). why?
	(ep,perp...,m...,lr) = (1,826.4387381368634,545.0924072265625,267.61993408203125,131.5108184814453,1.0)
	+ compare forw: ok
	+ compare forw+back1: ok
	+ compare forw+back up to update: ok but copy and mul dif are not the same going back?
	h=mul() has fanout > 1, so it has :incr
	its copy on the other hand has fanout=1 and has no :incr
	going back copying the dif from one to the other erases incr info
	the copy operation is dangerous: either fix it or don't use it.
	to not use it we can:
	- return last expr value like in julia
	- use return like a normal variable in func def
	- have return and h be aliases rather than different variables
	these still do not solve the copy problem.
	no no no no no
	this is a non problem.
	h has fanout=5.  This turns on :incr.  This makes back write to tmp and increment.  All is ok.
	+ compare forw+back+update+forw
	>>> keepstate means out was not 0 when we started, stack_inputs assumes it was.
	+ turn off keepstate on master and dev to show this is the case: passed.
	julia rnnlm.jl --dense --nosharing --gcheck 3 ptb.valid.txt
	(ep,perp...,m...,lr) = (1,705.1714829159305,261.08807373046875,92.81856536865234,1.0)
	+ check gradients with keepstate off: still failing? something wrong with dev gcheck. fixed.
	+ find a way to implement keepstate (probably pushing an initial state to the stack).
	- put readbeforewrite to avoid unnecessary stack copy.

2015-12-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* addirnn.prof: Generated by foo6.jl.  3 epochs of addirnn.
	sback:2356, but when you add all the back.jl lines we get 1300?
	flat profile is confused because sback calls itself, counts twice.
	The real costs look like:

	total: 3056
	compile: 147
	forw1: 1082
	forw2: 237
	back2: 208
	back1: 1074

2015-12-10  Deniz Yuret  <dyuret@ku.edu.tr>


	* examples/adding.jl: Got the timing with new stack down to:
	12.077214 seconds (21.69 M allocations: 951.524 MB, 1.71% gc time)
	vs master:
	10.396677 seconds (21.08 M allocations: 874.771 MB, 1.84% gc time)
	- The remaining diff probably due to extra initforw.
	- However results don't match. First fix that.

	Try sforw/sback on linreg: works
	Try sforw/sback on mnist2d: works
	Try single iter adding:
	(16,1.5236576f0,2.3667753f0,1.5722239f0) single iter with master
	(16,1.5236576f0,2.3667827f0,1.5743105f0) single iter with dev
	(16,1.5236576f0,2.3667827f0,1.5743105f0) single iter dev with copy-all

	The problem is we are using future values of inputs in back.
	Fixed it.  Now we have: (slight diff due to cpu quadloss)
	12.071895 seconds (21.19 M allocations: 922.258 MB, 1.80% gc time)

	With gpu quadloss:
	12.307355 seconds (21.19 M allocations: 922.106 MB, 1.74% gc time)

	* DONE:
	DONE check on vivi 16e9.cnt
	DONE length(Adding.net.sdict) => 91 after one minibatch, normal? No longer using sdict.
	DONE should we perform memory management for stack by keeping around some arrays that get reused?
	DONE start barrett translation experiments

2015-12-09  Deniz Yuret  <dyuret@ku.edu.tr>

	* seq-vs-rnn: Another point is seq vs rnn.  rnn means we have
	read-before-write in the network.  seq means we are running it
	forw multiple steps before calling back.  They are not the same,
	we can have seq with fnn, or run rnn for a single step.  When do
	we need the following, and when can we avoid them:

			r+s	!r+s	r+!s	!r+!s
	par=incr	y	y	n	n	this is only necessary for sequences where par gets read multiple times
	stack		y	y	n	n	if we ever rewrite out, only happens in sequences (not counting array sharing)
	out=void	y	n	y	n	this is only necessary if there is read-before-write but it is cheap, if keepstate out=out0 instead
	out0=0		n	n	n	n	never needed
	tmp=0		n	n	n	n	never needed
	dif=void	n	n	n	n	never needed, will never read before write?
	dif0=0		i	i	i	i	need if incr
	reg=incr	o	o	o	o	always needed for multi-output

	When the user compiles a net, we know whether or not it is an rnn.
	Actually we may need to wait until the first forw to see which
	cond are true, which may change the rnn property.  When the user
	calls forw we don't know if this is part of a sequence yet.  back
	can tell this from stack depth if forw uses stack.  We can assume
	rnn's always go with sequences (otherwise read-before-write does
	not make sense).  Or always use the stack.  In any case par=incr
	only necessary if stack depth is more than prog length.  Stack
	always necessary, it also gives us information about cond, just
	avoid copy.

	* reg-types: special treatment may be required for registers:

	- multi-output: these registers are read at multiple locations in
	the program.  Depends on :cond.  :incr flag set for both sback and
	back.  The return registers are read by the user and compared to
	gold, which should count as 1.  All Par registers become
	multi-output during sequences and need :incr in sback.
	Multi-output registers can forwoverwrite but should not be
	overwritten.  More specifically they could be overwritten by the
	last reader, but that is probably not worth the complexity.

	- need-for-back: these registers are detected using back_reads_x
	and y.  Also the return register is always in this category
	because of the loss gradient.  Depends on :cond.  Their :save flag
	is set.  However to speed up initforw we can compute this once and
	be conservative in array sharing.  They can forwoverwrite but
	should not be forwoverwritten in fnns.  In rnns they can be
	forwoverwritten after copying to stack.  If push happens at
	creation time this is ok.  If it happens at read time, this is
	risky, the register could be overwritten before it is read.
	Better be conservative and do not overwrite these.

	- persistent: Par and Arr registers.  They do not change during
	forwback, so no need to copy them in sforw when pushing.  Cannot
	overwrite or be overwritten, the par and arr ops already have
	canoverwrite=false so no need to separately check.

	- isreturn: should count 1 extra output and always have
	need-for-back.

	- read-before-write: these registers exist in rnns.  They should
	be zeroed out at the beginning of sequences by reset!.  They can
	forwoverwrite and be overwritten.  FNNs with this type of register
	do not make sense, if they exist the user should call reset.

	* addingirnn: broken. turning off array sharing fixes it, but
	slightly different result.  could be due to change in quadloss.
	backoverwrite is the problem, forwoverwrite is ok. also get

	Problem1: initforw does array sharing.  then forw pushes one of
	the shared arrays on the stack.  next iteration all the other
	users of that array realloc.  The sharing is lost.  This is a
	performance problem does not effect the answer.
	Solution1a: go back to copy-on-save instead of copy-on-write.  The
	sharing structure is preserved.  Con: useless copying for fnn.
	Solution1b: Reintroduce the flags to stop push/pop for fnns.
	Solution1c: consider overwrite before realloc in forw.  need to
	check if input is on stack as well.
	Implementing solution 1c... 1 alloc and 3 pointers every iteration,
	might be slow.

	? Solution1d: do not share problem registers in the beginning, that
	way nobody has to realloc.

	Problem2: initback let incr register dif0 to get overwritten.
	This is fixed now.

	Getting the correct result, but slow: 10.54 secs vs 22.07 secs.
	Could be the extra alloc for the stack, knet6 reuses stack
	arrays.  Could be the extra shuffling due to Solution2.

	? why slow?
	? why axpb?
	Solution2a: try stack array reuse, rather array pool for initout0.
	? WARNING: Sdict not empty. WARNING: Object not in dict.
	+ remember: stack tells back which ops were executed, cannot get rid of stack for fnns without major redesign.
	? conjecture: stack always has unique arrays?
	x hypothesis: copy_on_write for stack ops responsible.

	Timing expriments:
	- original: 22.535261 seconds (37.51 M allocations: 1.407 GB, 3.15% gc time)
	- without decref!: 21.918482 seconds (37.51 M allocations: 1.407 GB, 3.12% gc time)
	- without copy_on_write: 19.962486 seconds (36.14 M allocations: 1.368 GB, 1.67% gc time)

	This is still far from master: 10.544112 seconds (21.14 M allocations: 876.542 MB, 2.00% gc time)

	- compare registers with master: master has different out0 for
	relu9.  dev has differnt dif0 for dot3.

	- could it be the cpu quadloss?  with gpu quadloss:
	22.765790 seconds (37.52 M allocations: 1.407 GB, 3.12% gc time)

	I am out of ideas.  Profiling results:

			dev		master
	rnn.forw	L43:1268	L20:898
	fnn.forw	L45:275 	L22:41
	fnn.back	L47:437 	L35:76
	rnn.back	L48:2337	L36,38:118+1248
	update!		L51:23		L43:18

	Low level functions (src/op) seem comparable.  dev is a bit slow
	on add.back probably because of the one dot3.dif0 sharing it
	missed.

	- Is it the conditional evaluation?  We do initforw twice a
	minibatch?

	- forw has 3 differences:
	1. initforw (18 vs 187): more calls?
	2. inputs (27 vs 199): map expensive, fixed with for loop.
	3. copy_on_write (0 vs 253): alloc cost?

	- back:
	1. pop is expensive (0 vs 228).
	2. dxx is expensive (0 vs 157).
	3. op.back (968 vs 1799): missed sharing?  extra ops? fix gpusync().
	4. axpy (212 vs 376)
	5. fill (78 vs 121)
	6. get1 (0 vs 767): calling methods was a bad idea.

	d77a960: 14.171792 seconds (22.24 M allocations: 925.864 MB, 4.10%
	gc time) At this point back seems as fast as master (dev:1146,
	master:1362).  The big differences seem to be initforw and
	copy_on_write.

	without copy_on_write: 12.10 seconds.
	without intiforw: I don't know how to test that.

	And we still have the realloc bug. (Solution 1d).

	- pop can delete sdict when stack empty.
	- arrays in sdict can be copied to sfree and reused.
	- have to be careful if sharing with out0. there will be sharing.

	- may switch to copy-on-save: no sharing between stack and out0.
	no need to check sdict.  sharing in out0 preserved, no realloc.
	can reuse stack arrays through sfree.  find some other solution to
	prevent copy for fnn.  detect 2nd forw as seq?  have rnnforw and
	rnnback?  fnn will need some other way to detect conditions.

	OK new design: separate forw/back from sforw/sback.  forw/back
	does not use stack, back can use :forw flag to find out which ops
	to undo.  Removes the need for apply.  Still need to be careful
	for: (1) read-before-write, (2) multi-output, (3) sharing arrays
	that back will use.  FNN is done.

	copy-on-save has the following problem: the same array may get
	copied multiple times (one for x, one for y, imagine an output of
	relu going into a dot).  copy-on-write is more elegant for saving
	records of operations.  copy-on-write has three problems: (1) hash
	lookup cost, probably insignificant.  (2) Allocation cost, would
	go away if we found a safe and quick way to reuse arrays.
	copy-on-save also has to reuse arrays, but reuse is easier to
	implement because there is no sharing between stack and out0. (3)
	when a shared array is pushed, all siblings reallocate.  Can solve
	by not sharing :save registers, which will cost in-place ops.
	There is also the old design of pushing only output array rather
	than a record of operations.  That requires knowing the :save
	flag.  The :save flag depends on :cond.  :cond may change during
	multiple iterations.  But pushing happens during forw, so that's
	ok.  Let's try this with copy-on-save.

	- try copy-on-save: pushing array if save, nothing if not.  That
	does not distinguish noforw from nosave.  Fix using :skip.  Also
	this does not avoid copying persistent arrays.  Check for that.

	- fix slow initforw.

2015-12-08  Deniz Yuret  <dyuret@ku.edu.tr>

	* mnist4d: dev gives out-of-memory.  master and lcn are different?
	cudnn difference? nope.  lcn was stuck at 405b7d7.  We have
	changed default conv and softgrad since then.  dev still giving
	oom.  The size and array sharing match.  There must be extra
	allocs.  It was because I forgot to reset!, which meant sdict was
	not emptied, which meant at iter>1 it reallocated all arrays
	thinking they were still in the stack.  Probably should make sdict
	count refs.  mnist4d had reset but was using forw instead of
	forwtest in test.
	DONE - Make this more idiot proof. Call forwtest->apply?
	DONE - broke mnist2dy and mnist2xy: only when module Knet is commented out.


	* mnist2dx: slow in dev.  slower in master?  At some point between
	Nov18 and Dec08 it went from 12 to 19 secs.  Got messed up in
	405b7d7.  The only thing I did there was to switch from Gaussian
	to Xavier?  And switching back to Gaussian fixes the speed!  We
	have w1(64x784) in +-0.216 with xavier (std=0.125).  w2(10x64) in
	+-0.546 (std=0.310).  These are >10 times larger than
	gaussian(0,.01).  We can observe the slowdown with Gaussian with
	similar std.  In fact w1 can be kept xavier.  When w2.std exceeds
	.1 we start slowing down.

	* DONE:
	DONE mnist2dy: slow
	DONE mnist2dx: slow
	DONE mnist2dxy: slow
	DONE saman: cannot infer size error. drop(r,x,y) has no way to infer r.  drop(x,r,y) has no way to infer y.  how did this work in ipa?
	DONE mnist4d: giving out-of-memory error.

	* examples/mnist2d.jl: compiler optimizations.
	At init:
	2+9 alloc initout0
	2 fill! for bias
	8+2 alloc initback

	In 100 batches:
	+100 copy! @forw.jl:42: forw copies input ok
	+100 copy! @back.jl:41 softloss.back copies ygold ok
	-200 copy! @add.jl:60  back copies dx2->dy, array sharing should solve this
	-200 copy! @add.jl:116: baddforw1, forw array sharing should solve this.
	-100 copy! @actf.jl:98: softback copies ygold again?  array sharing?

	After forwoverwrite: 7/11 unique out0, wbf share same out0, no copy for stack.
	After backoverwrite: 7/11 unique dif0, wbf share same dif0, 2 copy! per iter.

	* DONE:

	DONE initback: removing the ability for getdx from a subset of the inputs during back, add later if necessary.
	DONE make forwoverwrite the default after testing, still leave it as an option
	DONE rename overwrites -> canoverwrite
	DONE linreg: slow
	DONE mnist2d: slow
	DONE reimplement array sharing, make it an optional pass for debugging
	DONE shows that +2 CudaArray comes from add.jl:75 reshape_to_match.  Does this allocate?  No, but it would be still better to get rid.  Also mul.  Also symmetric broadcasting.
	DONE turn CUDArt back to normal
	DONE - add a copy instruction: but sometimes making variables identical is different from copy, avoid actual copy when user says a=b or return c if possible.
	DONE - back/initback: get rid of seq flag if possible, can we use save?
	DONE rename train->train! ?? not exporting this function any more!

2015-12-07  Deniz Yuret  <dyuret@ku.edu.tr>

	* linreg: compiler optimizations.
	master:0.74s -> dev:1.04s (for 5 epochs)
	+ initialization only? no, for 20 epochs 2.88->3.92s, means init:26.6ms->80ms, iter:142.6ms->192ms
	+ multiple initialization?  no, confirmed initforw and initback only entered once.
	+ extra alloc/copy? 5 epoch, 10K epochsize, 20 batchsize, 500 batch/epoch, 2500 batches, expect few alloc, 5000 copy.
	got 2505 alloc, 7500 copy: 1 alloc, 3 copy per batch + 5 alloc at init.
	copy!: loss:134(ygold->ygrad ok), forw:42(input->reg ok), loss:144 (ygold->temp in quadloss2)
	alloc: 2*initforw:125, 2*initback:19, +1*initforw:125, loss:143(repeats)
	OK, reduced it to 3 copy per iter, got rid of quadloss temp alloc.
	time improved to 0.90s for 5 epochs, 3.58 for 20 epochs
	cpu quadloss improved this further.
	+ biyofiz-4-0 new timing:
	master:662ms dev:772ms
	+ stack cost?
	if I take out stack dev:720ms
	20 epochs: dev:2780ms master:2590ms


2015-12-05  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE saman: kunet-vs-knet speed and results:  /mnt/kufs/scratch/szia13/mnist/mnisttst-KuNet.jl /mnt/kufs/scratch/szia13/mnist/mnisttst-Knet.jl
	DONE nvidia bug check: actf.jl: cudnnSoftmaxBackward off by a factor of 2? came back saying there is no bug.  was my mistake. also report the cusparse transpose problem. 
	DONE cudnn softmax function: take a second look at it, see what it does, measure its speed, use it if possible.
	DONE pool.jl: CUDNN_POOLING_AVERAGE_COUNT_EXCLUDE_PADDING is buggy when padding > 0 or else I have no idea what it is doing. KevinKang@NVIDIA: This issue has been fixed now in the development versions, and the fix would be available for you in the next cuDNN release(v4). Thanks for reporting this issue to us!
	DONE - examples/mnist4d.jl (lemlp): debug softloss carefully, mlp diff has grown.
	DONE merge to dev: saman: loss.jl: solve the mask issue. change softloss to pass back the gold probabilities.  have soft back do the p-q numerically stable gradient calculation.  for 2arg softloss see if cpu or gpu is faster. DONT:implement and use xentloss, numerically much more stable than softloss, no need for epsilon.

	* softmax:
	Currently soft computes: qi = exp(xi)/Σ exp(xj)
	Then softloss computes: J  = Σ -p log q
	Then softgrad computes: ∂J/∂q = 1-p/q
	;; Derivation requires making normalization explicit:
	;; J = Σi -pi log (qi/Σqj)
	;; = Σi -pi log qi + pi log Σqj
	;; = (Σ -pi log qi) + log Σ qj
	Then softback computes: ∂J/∂xk = Σi (∂J/∂qi)(∂qi/∂xk)
	;; ∂qi/∂xk = [(i=k)(exp xi)(Σ exp xj) - (exp xi)(exp xk)] / (Σ exp xj)^2
	;; = (i=k) qi - qi qk
	But we can instead do: ∂J/∂x = q-p
	;; ∂J/∂xk = Σi (∂J/∂qi)(∂qi/∂xk)
	;; = Σi (1-pi/qi)((i=k) qi - qi qk)
	;; = Σi (qi-pi)((i=k)-qk)
	;; = Σi ((i=k) qi - qi qk - (i=k) pi + pi qk)
	;; = qk - qk - pk + qk = qk - pk
	So have softloss pass back p instead of ∂J/∂q which is numerically unstable.

	How do we handle masks?  A column mask asks to ignore certain
	columns.  Any gradients on these columns need to be 0.  For softmax
	we can simply set p=q for these columns.  The new version:
	Soft computes: qi = exp(xi)/Σ exp(xj)
	Then softloss computes: J  = Σ -p log q
	Then softgrad computes: dq = (mask=1 ? p : q)
	Then softback computes: dx = q-p
	and we get zeros in masked columns.

	But this assumes only loss has mask available.  If back has mask
	available soft can use it to zero out mask=0 columns.  That
	simplifies softgrad.

	How do we handle masks in quadloss?
	Simply set dx=0 for columns with mask=0.
	This is like softback.  We need a mask kernel.

	* kunet-vs-knet: debugging mnist4d example.
	- kunet converges, knet does not.
	- kunet with XentLoss converges, with Soft+SoftLoss does not.
	- first diff in winit, changing Knet winit to Gaussian(0,0.01).
	- kunet softloss back2 gives NaN, xentloss does not.
	- knet back1 does not match kunet.xentloss back1.

	- knet: 134.3331f0.soft.10.654804f0 | 0.1144394f0
	- if ygrad=(1-ygold/ypred)/ycols is correct then for 1st iter we have:
	- vecnorm(ygold)=11.313708f0
	- vecnorm(ypred)=10.654804f0
	- vecnorm(ygrad)=156209.28f0 (with epsbump)
	- vecnorm(ygrad)=2.463973f6  (without epsbump): this is roughly the kunet.softloss result.

	- The gradient wrt the input to soft:
	- kunet.xent: 0.116006486f0
	- kunet.soft: 0.116006486f0
	- knet: 0.1144394f0 (small difference expected due to epsbump)

	Other than that the runs are identical during the first iter.
	For relu1.back we have:
	0.004115438f0->0.0030872861f0 (kunet)
	0.004064399f0->0.0030463908f0 (knet)

	2nd iter:
	forw identical for soft,xent ending with 2610.2563f0 presoftmax.
	knet ends with 2546.7175f0 presoftmax.
	soft immediately fails going back with NaN.
	knet ypred=9.008366f0, ygold=11.313708f0, ygrad=471141.4f0
	ygrad without epsbump gives nan because some ypred==0.
	knet presoftmax gradient=0.0733809f0, relu1 gradient=0.002412369f0->0.0019412006f0, conv1 3.7847784f0.
	xent presoftmax gradient=0.0962289f0, relu1 gradient=0.005368307f0->0.0044503217f0, conv1 31.956705f0.
	The knet gradients seem to be 0.5 to 0.2 smaller than the xent gradients.
	+ Can we converge if we bump the lr?  No change whatsoever, the acc is stuck at 0.097.  Why?
	- The gradients all go to 0?  Why would they go to 0 instead of increasing?
	+ Decreasing the lr helps.  If we use .0001 instead of .001 gradients no longer vanish, convergence slower and nondeterministic.
	+ Can we fix softloss to act more like xentloss?  Is xentloss accurate?  Why is epsbumped softloss so bad, does it change direction? xentloss is q-p, softloss is 1-p/q, use xentloss.
	+ Would it help if we used doubles? YES.  In fact xent and soft give identical results: (double.jl)
	[:lr=>0.001,:scale=>255.0,:gcheck=>0,:iters=>0,:soft=>false,:epochs=>3,:float64=>true,:seed=>42,:nbatch=>128]
	(1,0.9621,0.9626666666666667)
	(2,0.9713,0.974)
	(3,0.9753,0.9792)
	+ With identical settings Knet gives similar (but not identical) results if we turn off epsbump:
	[note that kunet gives tst/trn, knet gives trn/tst in 2nd and 3rd position]
	julia> include("mnisttst-Knet.jl")
	julia> mnist4d("--ftype Float64")
	INFO: Testing lenet (convolutional net) on MNIST
	Dict{AbstractString,Any}("epochs"=>3,"lr"=>0.001,"nbatch"=>128,"ftype"=>"Float64","seed"=>42,"gcheck"=>0,"iters"=>0,"scale"=>255.0)
	(1,0.9564469818376068,0.9571314102564102,0.5512143746688302,18.867612666745313,1275.3388801827127)
	(2,0.9718215811965812,0.9708533653846154,0.10425938282183883,18.89275776985493,36.76887070582188)
	(3,0.9784154647435898,0.9742588141025641,0.07478685399078502,18.91234710013741,27.65146895969362)
	(0.07478685399078502,18.91234710013741,27.65146895969362)
	- why not identical?
	+ can we converge if we only epsbump the denominator? (yi-dyi)/yi => no the top doesn't make a difference anyway.
	+ would it help if we use a smaller epsilon? => no still have the same problem.
	- xentloss much more numerically stable, implement that.

	Hi Saman,
	I checked the KUnet vs Knet difference.  Here is the story:
	First the difference between xentloss vs softloss in KUnet:
	KUnet implements xentloss (q-p) and softloss (1-p/q) gradients exactly.  These should give identical results except with xentloss the network output is unnormalized, with softloss it is normalized.  When you use a scale of 255, this pushes softloss to the boundary of Float32 (remember p/q?  p is the gold probability, q is your estimate, and if it is too close to p/q blows up), resulting in NaNs.  Xentloss has no such problem, and computes correctly.  If you use Float64, there is no overflow and they learn the exact same network.
	Now for Knet: I haven't implemented Xentloss yet, I thought softloss would be enough.  To make softloss resistant to the overflow I changed the formula to (1-p/(q+epsilon)) when q is too small.  This avoids the NaNs, but gives slightly different gradients.  This seems to be enough to prevent convergence in your 255 scaled problem.  I went through the first couple of iterations.  The gradient norms start a bit smaller compared to the correct norms.  Then for some reason I don't yet understand all gradients go to 0 and the model stops learning.
	I confirmed this by turning off the epsilon in Knet and observing it produces similar results with KUnet in the 255 scaled case.  (Not identical but close enough, this could be due to library differences).  I think the right solution is to always use xentloss because it is much more numerically stable.  I will implement that in Knet as well.
	best,
	deniz

2015-12-04  Deniz Yuret  <dyuret@ku.edu.tr>

	* compiler.jl: conditionals and input assignment.  The inputs to
	an operation may come from different places depending on the
	runtime conditions.  Therefore we cannot compile exact input
	indices into the Net.  We have to do this at initforw.

	Interestingly we just switched from in-place semantics to julia
	semantics for a=f(x).  If multiple assignments are made to the
	same variable, this matters.

	For copyseq we have two rnns that share a hidden state.  All other
	weights are different.  Could do it by defining a two input rnn
	which takes (x,h) and pass h from one to the other.  However we
	can't have two outputs.  y is the only return as it will be
	compared with ygold for loss.  I don't think we could do it
	without spilling the guts of lstm into the definition.  Here is an
	attempt:

	@knet function affine2(x,y; o...)
	    a = par(; o...)
	    b = par(; o...)
	    c = par(; o...)
	    return a*x+b*y+c
	end

	@knet function s2s(x; o...)
	    if decoding
	        hx = affine2(h,x; o...)
	    else
	        hx = affine2(h,x; o...)	# this op has different weights
	    end
	    h = relu(hx)	# share hidden state
            if decoding
                return soft(wdot(h; out=vocab))	# only return if decoding
	    end
	end

	* DONE:
	DONE - fix gradcheck
	DONE - fix repeat

2015-12-03  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:

	DONE net.jl: implement incref! decref! copy-on-write
	DONE net.jl: implement get(Net,Symbol)
	DONE compiler.jl: symbols to numbers
	DONE fix util.jl: reset!
	DONE fix initforw
	DONE fix forw
	DONE fix initback
	DONE fix back
	DONE net.jl: delete Reg
	DONE rename Ins -> Reg
	DONE avoid making par incr for fnn
	DONE back.jl: are we sure y.out has not been changed? must use ysave with loss.
	DONE can't use get(f,:return) in back, the :forw register is no longer valid, use the stack to see the operations
	DONE fix lastback rethink lastforw==lastback in a seq context
	DONE return needs to always save its output to compute loss. should we have a return operator as well as a register?  does nothing except pass its input, but has back_reads_y=true.  do this with copy.  no need for op, did it as extra condition in forw.

2015-12-02  Deniz Yuret  <dyuret@ku.edu.tr>

	* zero: We need to be careful about when registers are zeroed and copied.
	r.tmp never needs to get zeroed.
	r.out needs zero if read-before-write, otherwise no need.
	In that case do we still need out vs out0?
	r.dif needs zero if :incr. (1) after creation, (2) after each use.
	par.out never zeroed after initialization.
	par.dif can be zeroed after update!.

	Current count after 1 epoch in mnist2d:
	:CudaArray => 10222
	:fill!     => 2408
	:copy!     => 11597
	:copy      => 1797

	This includes forwback(dtrn), 2x test(dtrn), 2x test(dtst)
	dtrn has 600, dtst has 100 minibatches.  Too complicated.

	foo.jl: single minibatch.  After forw:
	:CudaArray => 13
	:fill!     => 2
	:copy!     => 3

	net.prog has 11 instructions.  net.reg has 11 registers.  All unique.

	forwtest.init: 11 CudaArray, 2 fill!
	forwtest.iter: +2 CudaArray, +3 copy!

	Using Base.show_backtrace(io, backtrace()):
	+2 CudaArray comes from add.jl:75 reshape_to_match
	2 fill! comes from rgen filling the biases with 0.
	+1 copy! comes from input copying.
	DONE: +2 copy! comes from add.jl:116 baddforw1, bias adding copies the array!  we need write-in-place.

	Changing wbf to reuse h reduced init CudaArray to 9.
	CudaArray: init=9 iter=+2 (input copying + reshape_to_match)
	fill!: init=2
	copy!: init=0 iter=+1 (input copying)

	But it should be less, minimum is 7:
	x  = input()
	w1 = par()
	b1 = par()
	h  = f(w1*x+b1)
	w2 = par()
	b2 = par()
	return f(w2*h+b2)

	DONE: Two problems:
	- return always introduces a new register.
	- compound expressions introduce useless registers.
	- will go away when we go back to array sharing.

	forw.init: CudaArray=9 fill!=2
	forw.iter: CudaArray=+2,+5 copy!=+1,+4 copy=+0,+3
	saved: 2w, x, tmp, return
	w's do not get copied, they are persistent.
	need return: output of soft
	need x: dw = dy * x'
	need tmp: output of relu
	So 3 copies are normal, and the fact that we had 0 copies in iter 1 is also good.
	+1 copy! comes from input copying, normal.
	+5 CudaArray is 3 stack copy, 1 input copy!, 1 reshape_to_match.

	forwback:
	forw: (:CudaArray=>11,:fill!=>2,:copy!=>1) ok.
	back: (:CudaArray=>28,:fill!=>14,:copy!=>4)
	forw: (:CudaArray=>33,:fill!=>14,:copy!=>8,:copy=>3) need to undo saved when popping! during initback? reset!?
	back: (:CudaArray=>36,:fill!=>18,:copy!=>11,:copy=>3)
	forw: (:CudaArray=>41,:fill!=>18,:copy!=>15,:copy=>6)
	back: (:CudaArray=>44,:fill!=>22,:copy!=>18,:copy=>6) seems 3*CudaArray, 4*fill!, 3*copy!

	Fixed reg.saved in reset:
	reset:  ()
	forw:   (:CudaArray=>11,:fill!=>2,:copy!=>1)
	back:   (:CudaArray=>28,:fill!=>14,:copy!=>4)  # Initial alloc?
	update: (:CudaArray=>28,:fill!=>14,:copy!=>4)
	reset:  (:CudaArray=>28,:fill!=>20,:copy!=>4)  # Do we need all 6 fills?
	forw:   (:CudaArray=>30,:fill!=>20,:copy!=>5)  # copy fixed, 1*copy! for input, no fill!, 1*reshape_to_match
	back:   (:CudaArray=>33,:fill!=>24,:copy!=>8)  # 3 copy! 4 fill!
	update: (:CudaArray=>33,:fill!=>24,:copy!=>8)
	reset:  (:CudaArray=>33,:fill!=>30,:copy!=>8)
	forw:   (:CudaArray=>35,:fill!=>30,:copy!=>9)
	back:   (:CudaArray=>38,:fill!=>34,:copy!=>12)
	update: (:CudaArray=>38,:fill!=>34,:copy!=>12)

	reset:
	DONE: 4 fills are for parameters, can be avoided if we had non-incr parameters
	DONE: 2 fills are for h => overwritten registers should not be counted as multi-output???

	back:
	initially 3 copy!, 12 fill!, 17 CudaArray
	each iter 3 copy!, 4 fill!
	We have 9 registers, 6 of them incr, 8 of them grad
	so we have 8 dif0, and 6 tmp allocated = 14 CudaArray (the other 3 CudaArray come from copy!)
	and explains the 8 fill.
	1 copy!: softloss@loss.jl:67  normal, ygold is copied for gradient calc.
	ygold copied for gradient. however this may happen twice one for 2arg one for 3arg version of softloss.  prevent that.
	2 copy!: back@add.jl:60
	DONE: this is bias back and we are not sharing dx and dy?  We should be sharing them, it is the same register!? Oh, but it is incr so one is dif one is tmp.
	4 fill!: back@back.jl:62
	DONE: It zeros each h twice!  Will be fixed with incr vs overwrite issue.
	DONE: avoid work in initback if nothing changed

	Big question: we want to reuse registers to avoid copy and take advantage of overwrite.
	This makes them multi-output and brings incr cost?
	Think carefully about overwritten registers, this is a new phenomenon.

	OK, we switched from array sharing to register sharing.  With
	array sharing each register (operation output) was distinct, we
	could reason about their derivatives going back.  If the output of
	one op was used as input to multiple ops we had :incr.  With
	register sharing they are no longer distinct.  Multiple ops may
	overwrite the same register.  So don't think 'gradient wrt a
	register', still think 'gradient wrt an operator output'.
	Registers are just storage, they are not the real variables.  An
	op-output is the real variable.  Only mark with :incr the operator
	outputs that are used more than once (before getting overwritten).
	Is overwriting going to cause other trouble?

	So should :grad and :incr be flags on operators or flags on
	registers?  They should be on operators, registers are just
	storage!  So does that mean a single register may need multiple
	difs?  We have no problem with outs, if the value is needed later,
	it is saved on the stack.  What about the dif?  The execution
	trace has (op, x, y) on the stack that reflect the input and
	output of the operation at the time of the forward pass.  difs
	should not be on registers but on ops as well?  This is almost
	back to array sharing, except we have array names controlled by
	the user.  What about pars?  Back pass reads dy and writes dx
	given (op, x, y).  Why are we sharing arrays anyway?  To avoid
	extra copying and to save memory.  What about dif sharing, if we
	have difs assigned to operators, nobody will share?  Should we go
	back to automatic sharing and revive findout, finddif, findtmp?
	What storage we use, what name the user accesses are two different
	issues.  But this goes back very deep.  Gradient needs to know not
	just which registers were the input, but which operator outputs
	were they at the forw time.  forw() does not need that
	information, it can just keep overwriting registers as the user
	wants.  back() has to be more careful.  Ignore register names,
	compute which operator fed its output to which operators input.
	Difs get assigned to operators.  Multi output operators get tmps
	and incr.  Why not assign out to operators as well?  No storage
	assigned to registers?  Is this going back to the original design?
	Dissociate names completely, handle them separately.  This sharing
	by user was a horrible idea.  All arrays go back to instr.  Reg
	becomes a hash that points to instr as well, or search in the
	"name" field of instr.  The register type disappears.  How do we
	handle stack and saved in the presence of array sharing?  We can
	have a pointer hash.  All the way back to the compiler :( Are you
	sure, can this be salvaged, can we just do difs?

	DONE: use ObjectIdDict for copy-on-write.

2015-11-30  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE - do we stop running forw when we hit return? Yes.
	DONE - rename getprop -> get
	DONE - switch to user based array sharing,
	DONE - fix back and initback,
	DONE - handle sparse.
	DONE - util.jl:reset!
	DONE - back.jl: where does Par.dif get zeroed out?  reset?
	DONE - initforw detects when there is a change and does not compute otherwise.  initback should do the same.
	DONE saman: change default convolution to gemm, no speed diff but handles 3D. confirm 3D works.


2015-11-29  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/net/back.jl: return design:
	We used to have the result of the last op always be returned.
	We now introduced conditionals and a return statement.
	Q: where is back going to match its ygold?
	Q: can we still do back going from N:-1:1?
	Q: how about operations with :forw=false?
	Q: do we allow multiple return statements?
	We have a separate return register
	Q: how can we avoid unnecessary copying for (return x) or (y=x)?
	Q: should return be an operator like input?
	As an op, it already is, writes to external register provided by caller.
	As a model, it is currently allocated internally -- inconsistent.
	If multiple returns are allowed they should have consistent size/type.
	Q: for multiple returns does the first one terminate forw?
	With julia semantics it should.
	initforw would need to set :forw according to cond instead of forw: it already does.
	Q: can we remove the return statement to avoid confusion?
	As a model it is meaningless anyway, we query any register we want.
	Q: what about back?  It needs to know what to match to ygold!
	As an op we can readopt the convention of returning the last value.
	The compiler does not know what the last operation is!  It depends on runtime conditions.
	Does that make return mandatory for ops?
	Q: return as register vs return as op.  multiple returns.  return size consistency.  avoid copy.
	Q: can we rely on ops running in order with some skipped due to cond? yes, no goto statements.
	Q: is it ever the case that return is not the last stmt?
	Q: do we have return?  multiple return?  return with var vs return with op?
	Q: if last-op and cond, which register?

	D1: julia: have return, allow multiple, exit on first.
	exit on first can be implemented by initcond setting the rest :forw=false.
	check for size consistency under all conditions: this will happen automatically if we use the return register.
	- however it may necessitate unnecessary copying.
	back can go N:-1:1 skipping ops with :forw=false and match ygold with first active op encountered.
	if no return, last value is returned in julia: compiler needs to be modified for this to work for ops.
	- do we insert return ops so size check happens?
	- do we just insist for a return?  in which case back needs to check first active op is return.

	@knet function drop2(x; pdrop=0, o...)
	    # implicit x=input() here
	    if training
	        r = rnd(; rgen=Bernoulli(1-pdrop, 1/(1-pdrop)))
	        x = mul(r,x)
	    end
	    return x
	end

	@knet function drop1(x; pdrop=0, o...)
	    if training
	        r = rnd(; rgen=Bernoulli(1-pdrop, 1/(1-pdrop)))
	        return mul(r,x)
	    else
	        return x
	    end
	end

2015-11-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE - fix or delete net.jl and finish @knet macro
	DONE - warn for reassigned locals: this is normal now, how user does register sharing.
	DONE - note that output variables are no longer unique, either with conditions or without more than one instr may overwrite the same variable

	* display: Here is the sequence of calls following display(::Array):
	It looks like we need to overwrite writemime to change the way things are displayed.
	display(x)
	@try_display
	2375 multimedia.jl; display; line: 151	macro?
	2374 multimedia.jl; display; line: 120	display(d::TextDisplay, x)
	display(d::TextDisplay, x) = display(d, MIME"text/plain"(), x)
	display(d::TextDisplay, M::MIME"text/plain", x) = writemime(d.io, M, x)
        2372 replutil.jl; writemime; line: 28	writemime(io::IO, ::MIME"text/plain", v::AbstractArray) = with_output_limit(()->showarray(io, v, header=true, repr=false))
        1958 show.jl; showarray; line: 1238


2015-11-27  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE kunet vs knet: trying to setup kunet, testlayers fails, different cuda lib version? lenet1.jl fails. saman: difference in results of kunet and knet: I was trying to solve the scaling issue regarding Knet and Kunet. I tried to run the example mnist4d.jl without scaling mnist data (i.e. not dividing it by 255)..and i ran it with lr =0.001 instead of 0.1 which we use for scaled data..it doesnot learn anything. If i apply the same changes in KUnet it can still learn the same way with unscaled data. Is it possible for you to look into this? It might be causing issues with LCN since i am still not able to learn anything despite trying the things you mentioned.
	DONE saman: cudnn 3D errors
	DONE + work on compound statements
	DONE + work on arithmetic operators
	DONE + environment table, instruction type with fields for original name, calling function, etc.
	DONE + design conditional language, test concept on nce, s2c, s2s, tagger, dropout, att, ntm, ctc
	DONE + ai-mtg: Being able to access arrays with original variable names.
	DONE + It is difficult to stitch networks or share variables or implement conditional ops. Can we implement s2s/s2c with conditionals?  Is it easier with networks sharing variables?  One network feeding into another?


2015-11-24  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE - get spearmint to work in knet or reimplement in julia: https://github.com/HIPS/Spearmint
	DONE -- vivi concatenation
	DONE saman: I was trying to solve the scaling issue regarding Knet and Kunet. I tried to run the example mnist4d.jl without scaling mnist data (i.e. not dividing it by 255)..and i ran it with lr =0.001 instead of 0.1 which we use for scaled data..it doesnot learn anything. If i apply the same changes in KUnet it can still learn the same way with unscaled data. Is it possible for you to look into this? It might be causing issues with LCN since i am still not able to learn anything despite trying the things you mentioned.
	DONE arr does not work in saman's lcn model: another infersize problem
	DONE large lr pushes weights to inf, use gclip=100 for safety. In 1e9layer1 spearmint why do I still get NaNs with quadloss? Gradcheck fails after initialization just like mnist4d. julia train.jl 1e9data.jld --epochs 20 --lr 1 --drop1 0.45
	DONE compare hyperopt with spearmint

2015-11-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE -- saman kernel for lcn
	DONE - saman: local contrast normalization, manual weight bug:
	DONE - add the other cudnn bias options for broadcasting add.
	DONE - add.jl: document broadcasting and make sure forw and back works in all cases: ok nvidia cudnnAddTensor_v3 does not work as advertised.  waiting to hear back from nvidia.  also there is no general broadcasting add back in cudnn.  cudnnConvolutionBackwardBias only works for 1-D bias.  should look at caffe etc.
	DONE - for aliya ipa model we definitely need s2s decoder.
	DONE saman: fix lstm size inference in lcn branch and push to master
	DONE saman: make xavier or msra default init, test on lcn, push to master


2015-11-07  Deniz Yuret  <dyuret@ku.edu.tr>

	* conditionals:
	- do we handle them in the macro or the compiler?
	- what is the final format?
	-- (Op, in1, in2, ..., out) is the old format out of the compiler
	-- (cond, out, op, in1, ...) could be the new format, just write some accessors so it doesn't matter.
	- cond needs to be a full logical expression:
	-- we want no block structure in compiler output
	-- so each instruction will have to have a complete condition as an expression.
	-- these will go into net.cond array of expressions.
	- the return in the middle problem is different, handle later.
	- the macro produced function handles gensyms and adding the return variable.
	- the compiler does subroutine expansion gets a sequence of primitive actions.
	- who handles the conditionals?
	- should we turn an if else end sequence into multiple independent if's.
	- what environment are we going to evaluate the conditionals against?
	- we'll let the compiler handle conditionals, it can pass the current condition to each recursive compile call
	- the macro just takes care of variable renaming.

	- we could define the knet macro just to quote the function definition and do the rest in the compiler:
	macro knet(x); esc(Expr(:(=),x.args[1].args[1],Expr(:quote,x))); end
	- so the compiler goes from function definition and keyword arguments to Net.
	- the problem is with the keyword arguments which need to be evaluated by the compiler.
	- the compiler can no longer rely on the individual ops to do the substitutions.
	- the value of a keyword argument like out=100 gets passed down layers
	- the compiler can carry it down like a global environment
	- maybe the runtime can use the same trick for the conditionals.
	- this way our function definitions do not need kwargs or o...?
	- which primitive ops use kwargs?
	-- axpb: a,p,b (other actf none)
	-- add: alpha,beta
	-- arr: init
	-- conv: padding, stride, upscale, mode
	-- dot: none
	-- input: none
	-- mul: none
	-- nce: none
	-- par: dims, init, lr, ...
	-- pool: window, padding, stride, mode
	-- rnd: rgen, testrgen (to be deprecated)
	- all these optional keyword arguments customize the behavior of the op
	- and they are frozen at compiler time! is that necessary?  yes, different op, different net.
	- they go into the functor.
	- and we like passing these things from compound ops to primitives.  or if primitives have this format so should the compounds.
	- so the problem is can our compiler do this.


2015-11-06  Deniz Yuret  <dyuret@ku.edu.tr>

	* add-prof: different models favor different add kernels:
	The three numbers in the column headings represent [addforw,baddforw,baddback] versions
	Using anything other than baddback1 is disastrous for mnist4d

			111		112		113		121		211		311		312
	linreg		 0.734662	 0.737347	 0.738270	 0.734238	 0.738811	 0.731515+++	 0.732703
	mnist2d		 7.026818+++	 7.155074	 7.242237	 7.646728	 7.060876	 7.031675	 7.151208
	mnist2dy	 8.764879	 8.473809+++	 8.831164	 9.292171	 8.644037	 8.542426	 8.798958
	mnist2dx	12.221623	12.065425	12.383539	13.012722	11.904061+++	12.005169	12.226228
	mnist2dxy	13.587995	13.508498	13.746538	14.250780	13.361326	13.339761+++	13.618278
	mnist4d		17.094644+++	25.352326	25.423771	18.618623	17.145873	17.135514	25.325004
	mnistpixels	 2.803704	 3.422196	 3.336632	 3.608828	 2.754792	 2.713217+++	 3.300772
	addinglstm	 2.287262	 2.680547	 2.749945	 2.704410	 2.266927	 2.246450+++	 2.567801
	addingirnn	10.609234+++	12.760513	13.137940	12.825036	10.870026	10.703243	12.231000
	rnnlm		23.604012	20.699631	21.767035	22.242024	23.096798	22.972406	20.174617+++
	copyseq		12.510439	13.091394	12.128401	14.763487	11.693091	11.658034+++	12.153890

	* DONE:
	DONE add ncelm to runtests
	DONE add ner to runtests:
	DONE examples/runtests.jl: should run the whole set if no options given.
	DONE fix timing problem
	DONE release nce version
	DONE check vivi job status: 1e9 running, start tenten after.

2015-11-05  Deniz Yuret  <dyuret@ku.edu.tr>

	* jonmay-tok:
	~jonmay/projects/lorelei/unitok/[ben,eng] (also tur and tam but i haven’t done much with those yet)
	eng has eng.mono.xxx.[train|test] for xxx of cdectok, notok, ulftok, unitok. ulftok should be best, notok should be worst. i guess cdectok should be close to ulftok and better than unitok.
	ben has flat, cdectok, ldctok, unitok. cdectok is the character tokenization and should be worst. flat is notok. unitok and ldctok should be close and the best.
	morphsyn: http://groups.csail.mit.edu/rbg/code/morphsyn/

	* cuda-streams:
	http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
	https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu
	http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

	* copytenten.prof:
	1	  11872	total
	0.729616  8662	cudaDeviceSynchronize
	0.66939	  7947	s2s_bptt
	0.521732  6194	op.back
	0.291358  3459	op.forw
	0.215212  2555	dot.back.At_mul_B.dense
	0.206284  2449	dot.back.A_mul_Bt.dense
	0.172675  2050	dot.forw.A_mul_B.dense
	0.0560142 665	add.forw.baddforw2!
	0.0441375 524	add.back.baddback2!
	0.0328504 390	net.back.axpy

	* saman.bug: dyuret@parcore-6-0:/mnt/kufs/scratch/szia13/resizedata148/preprocesseddata/Knettestpretrain.jl
	~/tmp/saman-setup.sh

	* DONE:

	DONE --- ashish nce model: can implement as a compound model with different train and test final layers!
	DONE implement profile nce, s2s/nce model
	DONE jonmay tokenization problem
	DONE profile s2s
	DONE solve saman bug

2015-11-03  Deniz Yuret  <dyuret@ku.edu.tr>

	* nce-prof:

	secs	H	B	N	T	V	desc
	1.69	100	128	128000	20	10000	lstm only
	12.12	1024	128	128000	20	10000	lstm only 10561 wps
	2.96	256	128	128000	20	10000	lstm only
	4.68	512	256	128000	20	10000	lstm only
	7.09	512	64	128000	20	10000	lstm only
	3.53	512	64	64000	20	10000	lstm only
	4.00	256	64	128000	20	10000	lstm only
	5.41	512	128	128000	10	10000	lstm only
	5.50	512	128	128000	40	10000	lstm only
	5.47	512	128	128000	20	10000	lstm only 23400 wps
	11.36	512	128	128000	20	10000	lstm only 2 layers
	17.25	512	128	128000	20	10000	lstm only 3 layers
	6.66	512	128	128000	20	10000	lstm + w.encoder 19219 wps
	6.69	512	128	128000	20	20000	lstm + w.encoder
	6.91	512	128	128000	20	100000	lstm + w.encoder
	12.54	512	128	128000	20	10000	lstm 2 layers + w.encoder
	10.87	512	128	128000	20	10000	lstm + w.encoder + w.decoder
	19.94	512	128	128000	20	10000	lstm + w.encoder + w.decoder + bias
	20.52	512	128	128000	20	10000	lstm + w.encoder + w.decoder + bias + soft
	8.12	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder + bias + soft
	7.97	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder + bias
	3.69	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder
	1.62	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder + bias: fake cudnnAddTensor
	3.69	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder: with cudnnAddTensor
	1.62	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder: with fake add
	1.79	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder: with my add kernel
	1.90	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder + bias: with my add kernel
	2.04	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder + soft: with my add kernel
	2.15	512	128	128000	20	10000	test: lstm + w.encoder + w.decoder + bias + soft: with my add kernel
	20.27	512	128	128000	20	10000	train: lstm + w.encoder + w.decoder + bias + soft: with cudnn addforw and cudnn addback
	14.16	512	128	128000	20	10000	train: lstm + w.encoder + w.decoder + bias + soft: with my addforw and cudnn addback: forw=2.15 back=12.30
	8.87	512	128	128000	20	10000	train: lstm + w.encoder + w.decoder + bias + soft: with my addforw and atomicAdd addback
	8.44	512	128	128000	20	10000	train: lstm + w.encoder + w.decoder + bias + soft: with my addforw and cublas addback
	8.07	512	128	128000	20	10000	train: lstm + w.encoder + w.decoder + bias + soft: with fake forw/back add: forw=1.62 back=6.45 biasback=5.85!

	The final bias is very expensive.  Takes y[V,B] and adds b[V,1]
	forward.  Takes dy[V,B] and accumulates into db[V,1] going
	backward.  The back involves atomic ops but even the forward is
	way too slow.

	mnist2d gives contradictory results?
	7.71	addback=cublas
	7.65	addback=cudnnConvolutionBackwardBias
	7.31	addback=my-atomicAdd-kernel (gives non-deterministic results)

2015-11-02  Deniz Yuret  <dyuret@ku.edu.tr>

2015-10-30  Deniz Yuret  <dyuret@ku.edu.tr>

	* nce: Here are the operations:

	Train:
	0. starting with h.
	1. pick which noise words go with a minibatch.
	2. pick rows of the decoding matrix (minibatch words + noise words)
	3. do the matrix product of hidden layer h with those rows to get s(y)
	4. calculate loss: what exactly is the input here?
	   for real  words: exp(s(y))/(exp(s(y))+kq(y))
	   for noise words:  (k q(y))/(exp(s(y))+kq(y))
	   So the loss function needs k (actually that can be computed from size(x) and q(y) (or maybe I and q(I)) and a way to distinguish noise from real.
	5. calculate gradient:
	   for real  words: 1 - exp(s(y))/(exp(s(y))+kq(y))
	   for noise words:   - exp(s(y))/(exp(s(y))+kq(y))
	   Again, the same things are needed.

	Test:
	0. starting with h.
	1. do the matrix product of hidden layer h with the whole decoding matrix.
	2. apply the softmax transformation to the output.
	3. use softmax loss and softmax gradient.

	It seems adding conditional statements to the compiler is more elegant?
	The same mechanism could also improve dropout and rnd.
	Or do we have two networks with shared weights?

	h = dot(u,x)  # or whatever gives the hidden layer, dims=(hdims,nbatch)
	w = par(; dims=(vocab,hdims))
	a = dot(w,h) if !trn # (vocab,nbatch)
	b = bias(a) if !trn
	c = soft(b) if !trn
	r = nce(?; dims=(k+nbatch,vocab)) if trn # or maybe rnd(; dims=(..), sparse=true, rng=OneHotRows())?  but we need the batch words in there.  maybe two separate r matrices?  we have no concat op.
	s = dot(r,w) if trn  # (k+nbatch,hdims)
	t = dot(s,h) if trn  # (k+nbatch,nbatch)

	nceloss(ypred,ygold,ygrad)
	nceloss(ypred,ygold)

2015-10-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE axpb back is not working -- fixed bug
	DONE unused variables seem to cause problem -- fixed bug
	DONE fix infersize
	DONE figure out saman's scaling problem, why do we need lr=0.5/255 if xrange is [0:255]?
	DONE saman: also could you please add absolute value rectification as well? i would be needing it to replicate the socher model:  /mnt/kufs/scratch/szia13/resizedata148/Knettest1.jl
	DONE src/model/s2s.jl (setrow!): these assume one hot columns, make them more general: no need to do this in s2s, it constructs one-hot.


2015-10-27  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE --- src/op/drop.jl: implement rnd.  implement dropout using rnd and mul: none of the ops are using trn right now! we don't even need a trn flag for ops except for rnd.
	DONE handout-mtg Also conv/pool/rnd/con in the handout have not yet been implemented: no need for con, we have axpb.
	DONE mul.jl:54 when we add rnd for dropout this should be x2,x2,x2; we are not doing broadcasting.
	DONE need train/test switch for nce and dropout: for dropout we use the trn flag, for nce we'll think later.
	DONE src/net.jl: handle con and rnd in tosave: no need for con, we have axpb.  for rnd we need it for back calc because it changes every iteration.  drop does this because in mul we have back_reads_x. back:56:if r.toincr[n] && !isa(r.op[n], Par) fill!(r.dif[n],0): check rnd should never have dif.  but we have back.jl:30 if r.dif[n] == nothing set inputs to nothing: that's find rnd has no inputs. ok we need to debug Nan.
	DONE why do we get nan?  copyseq gives nan with gradient clipping for lr=2 gclip=10,5 on ptb.  can we prevent this?

2015-10-26  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE add can also take a scaling argument?
	DONE put warning to s2s/tagger about not being onehot: no need, they start with sequence generators that produce tokens, not arbitrary feature vectors.
	DONE scale learning rate for gclip
	DONE -- ai-mtg: Onur and Ozan's bidirectional model.
	DONE noop bug: why didn't axpb work as a noop?  ner16.jl. -- fixed back.jl
	DONE primitive ops cannot be used as models?  Net(axpb) failing: That is normal, primitive ops and knet functions have different types of output: axpb(:a,:b) => (Knet.Axpb(1,1,0),:a,:b); noop(:a,:b) => quote; b = axpb(a); end
	DONE do not expose Net to users it is confusing: knet functions and models (FNN, RNN, etc.) should be sufficient.


2015-10-25  Deniz Yuret  <dyuret@ku.edu.tr>

	* ner-prof:

	7895 .../.julia/v0.4/Knet/src/model/tagger.jl; train; line: 11

	1324 .../.julia/v0.4/Knet/src/model/tagger.jl; tagger_loop; line: 20 m.forw
	1341 .../.julia/v0.4/Knet/src/model/tagger.jl; tagger_loop; line: 21 m.back
	313  .../.julia/v0.4/Knet/src/model/tagger.jl; tagger_loop; line: 22 m.pred
	4872 .../.julia/v0.4/Knet/src/model/tagger.jl; tagger_loop; line: 24 m.bptt

	1280 ....julia/v0.4/Knet/src/model/tagger.jl; tagger_forw; line: 36 m.forw.net.forw
	1306 ....julia/v0.4/Knet/src/model/tagger.jl; tagger_forw; line: 36 m.back.net.forw
	259 .../.julia/v0.4/Knet/src/model/tagger.jl; tagger_forw; line: 36 m.pred.net.forw
	637  ....julia/v0.4/Knet/src/model/tagger.jl; tagger_bptt; line: 62 m.pred.net.back
	2099 ....julia/v0.4/Knet/src/model/tagger.jl; tagger_bptt; line: 66 m.back.net.back
	2071 ....julia/v0.4/Knet/src/model/tagger.jl; tagger_bptt; line: 69 m.forw.net.back

        1076 ...dev/.julia/v0.4/Knet/src/net/forw.jl; forw; line: 23 m.forw.op.forw
        1135 ...dev/.julia/v0.4/Knet/src/net/forw.jl; forw; line: 23 m.back.op.forw
        195 ...dev/.julia/v0.4/Knet/src/net/forw.jl; forw; line: 23  m.pred.op.forw
        320 ...dev/.julia/v0.4/Knet/src/net/back.jl; back; line: 39  m.pred.op.back
        45  ...dev/.julia/v0.4/Knet/src/net/back.jl; back; line: 44  m.pred.axpy
        1266 ...ev/.julia/v0.4/Knet/src/net/back.jl; back; line: 39  m.back.op.back
        559  ...ev/.julia/v0.4/Knet/src/net/back.jl; back; line: 44  m.back.axpy
        1279 ...ev/.julia/v0.4/Knet/src/net/back.jl; back; line: 39  m.forw.op.back
        564  ...ev/.julia/v0.4/Knet/src/net/back.jl; back; line: 44  m.forw.axpy

	1045 op/dot.jl         forw              19	=> A_mul_B!
	 239 util/linalg.jl     A_mul_B!          51	=> dense x sparse (similar)
	 201 util/linalg.jl     A_mul_B!          54	=> dense x sparse (csrmm2!)
	 121 util/linalg.jl     A_mul_B!          55	=> dense x sparse (geam!)
	 114 util/linalg.jl     A_mul_B!          56	=> dense x sparse (free)
	 # dense x dense does not even compare, switch to all dense

	547 util/linalg.jl     axpy!            219	=> add_csr_dns_atomic_32
	545 op/add.jl          back              51	=> biasback (cudnnConvolutionBackwardBias)
	511 op/dot.jl          back              23	=> A_mul_Bt!
	485 op/dot.jl          back              24	=> At_mul_B!
	475 util/linalg.jl     At_mul_B!	 35	=> dense
	462 op/add.jl          forw              31	=> biasforw (cudnnAddTensor)
	319 util/linalg.jl     A_mul_Bt!         34	=> dense
	284 op/actf.jl         sigmback          35	=> (cudnnActivationBackward)
	232 util/linalg.jl     mul2!            228	=> mul2_32
	 125 op/mul.jl          forw              18
	 125 op/mul.jl          back              30
	 108 op/mul.jl          back              41
	217 op/loss.jl         softloss          90	=> softlossback32csc (sparse ygold)
	119 op/actf.jl         tanhback          41	=> (cudnnActivationBackward)
	105 util/linalg.jl     A_mul_Bt!        125	=> dense x sparse

2015-10-24  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/op/add.jl: cudnnAddTensor and cudnnAddTensor_v3:

	CUDNN v3 uses NCHW and NCDHW orders by default.  (Reverse these for Julia)

	The deprecated version 2 only supports 4-D and had the following modes:

	11HW: CUDNN_ADD_IMAGE or CUDNN_ADD_SAME_HW: In this mode, the bias tensor is defined as one image with one feature map. This image will be added to every feature map of every image of the input/output tensor.
	1CHW: CUDNN_ADD_FEATURE_MAP or CUDNN_ADD_SAME_CHW: In this mode, the bias tensor is defined as one image with multiple feature maps. This image will be added to every image of the input/output tensor.
	1C11: CUDNN_ADD_SAME_C: In this mode, the bias tensor is defined as one image with multiple feature maps of dimension 1x1; it can be seen as an vector of feature maps. Each feature map of the bias tensor will be added to the corresponding feature map of all height-by- width pixels of every image of the input/output tensor.
	NCHW: CUDNN_ADD_FULL_TENSOR: In this mode, the bias tensor has the same dimensions as the input/output tensor. It will be added point-wise to the input/output tensor.

	‣ Except for the CUDNN_ADD_SAME_C mode, the dimensions h,w of the two tensors must match.
	‣ In the case of CUDNN_ADD_IMAGE mode, the dimensions n,c of the bias tensor must be 1.
	‣ In the case of CUDNN_ADD_FEATURE_MAP mode, the dimension n of the bias tensor must be 1 and the dimension c of the two tensors must match.
	‣ In the case of CUDNN_ADD_FULL_TENSOR mode, the dimensions n,c of the two tensors must match.
	‣ In the case of CUDNN_ADD_SAME_C mode, the dimensions n,w,h of the bias tensor must be 1 and the dimension c of the two tensors must match.

	The new version has a simpler interface:
	Each dimension of the bias tensor must match the coresponding dimension of the srcDest tensor or must be equal to 1.

	The question is: can the new version simulate everything that the old version did, and does it support Saman's requirement?
	I think by setting the appropriate dimensions to 1, the answers to both questions is yes.

	However CUDNN still assumes that the two tensors are of the same dimensionality, Knet supports different dimensionalities.
	So when the bias is shorter, what exactly do we do?
	Currently only 1-D is supported and it is assumed to be 1C11.
	It seems we can assume the leftmost cudnn dimensions are 1 for the other cases to get v2 behavior.
	What is Julia's broadcasting rules?
	Assumes the rightmost dimensions are 1 if not the same ndims, i.e. the same.
	In the ADD_SAME_C case if we have NC, and the bias is C we are ok.
	But if we have NCHW, and the bias is C that requires special handling.

2015-10-23  Deniz Yuret  <dyuret@ku.edu.tr>

	* convolution:

	matlab 1-D y=conv(x,w,'full')  "Full convolution (default)."
	y[k] = sum[j] x[j] w[k-j+1]
	k=1:x+w-1
	j=max(1,k+1-w):min(k,x)
	cudnn: ndims=2, padding=w-1, stride=1, upscale=1, mode=CUDNN_CONVOLUTION
	Note: ndims=1 is not implemented, just use ndims=2 and get the central only non-zero column of the result.

	matlab 1-D y=conv(x,w,'same')  "Central part of the convolution of the same size as u."
	y[k] = sum[j] x[j] w[k-j+w/2+1]
	k=1:x
	j=max(1,k-w/2+1):min(k+w/2,x)
	cudnn: ndims=2, padding=w/2, stride=1, upscale=1, mode=CUDNN_CONVOLUTION
	Note: This works only if length(w) is odd.  There is no setting to get 'same' with w even.

	matlab 1-D y=conv(x,w,'valid')  "Only those parts of the convolution that are computed without the zero-padded edges."
	y[k] = sum[j] x[j] w[k-j+w]
	k=1:x-w+1
	j=k:k+w-1
	cudnn: ndims=2, padding=0, stride=1, upscale=1, mode=CUDNN_CONVOLUTION

2015-10-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* COPY-EXPERIMENTS:
	common-options: ("hidden"=>512,"lr"=>2.0,"dense"=>false,"batchsize"=>128,"winit"=>"Gaussian(0,0.01)","fast"=>true,"gclip"=>10.0,"gcheck"=>0,"epochs"=>100,"ftype"=>"Float32","seed"=>42,"datafiles"=>Any["ptb.train.txt","ptb.valid.txt"])
	10202219-copy256.out: best dev-perplexity 21.8159 @28 epochs 85 secs/epoch: copyseq.jl --fast --hidden 256 --epochs 100 ptb.train.txt ptb.valid.txt
	10210212-copy512.out: best dev-perplexity 16.4968 @28 epochs 138 secs/epoch: copyseq.jl --fast --hidden 512 --epochs 100 ptb.train.txt ptb.valid.txt
	10231733-copy1024.log: best dev-perplexity 24.1769 @16 epochs 584 secs/epoch: copyseq.jl --hidden 1024 --epochs 100 --gclip 10 --lr 1 ptb.train.txt ptb.valid.txt
	10211111-copy1024.out: buggy, actually ran with hidden=512.
	10220837-tenten512copy.log: best perpexity 2.2888 @262(iter)*128(batch)*1000(nbatch)=33,536,000 words, ran for 770 iters total with no further improvement: lr=2 gclip=10 vocab=10024 hidden=512
	10231022-tenten1024.log: best perplexity 1.02315 @846(iter)*128(batch)*1000(nbatch)=108,288,000 words, ran for 858 iters total, probably need to decrease lr to get closer to 1.0: lr=2 gclip=5 vocab=10024 hidden=1024

	* DONE:
	DONE s2s.jl: make lossreset a reporting optional argument.
	DONE copyseq.jl: --fast is broken in dev: julia copyseq.jl --fast --hidden 1024 --epochs 40 ptb.train.txt ptb.valid.txt
	DONE write mikolov ptb downloader like mnist. maybe use github lfs?
	DONE Pkg.test("CUDArt") fails? fixed at 0.2.3.
	DONE check copy results in ~/knet/v0.6/examples/
	DONE consider moving from y=f(x) to f(x,y) for the knet language. -- harder to read.
	DONE src/op/add.jl,mul.jl: handle scalar input going forw and back: implemented ax^p+b in actf instead.
	DONE linalg.jl: axpy!{T}(a,x::CudaSparseMatrixCSR{T},y::CudaSparseMatrixCSR{T}) should check sparseness pattern. no longer relevant.
	DONE util/cudnn.jl: these changes should go into CUDNN

2015-10-21  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	src/data/S2SData.jl (target): reopen when restart.

2015-10-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE wdot wconv bias can be just single parameter versions of dot, conv, add.  no  need for New names: decided this wasn't a good idea, it is only going to effect dot/add/conv and it is going to create more confusion.
	DONE download/install barrets code for testing.
	DONE compare s2s with barret without gclip
	DONE reimplement mlp/rnnlm with the repeat instruction
	DONE complete gradcheck, predict in addition to train/test for models other than fnn
	DONE examples/mnistpixels.jl: gcheck gives terrible results, check.
	DONE examples/mnistsparse.jl: gcheck does not work?
	DONE examples/rnnlm.jl: BUG: start does not handle dense arrays. ArgParse does not handle Real.
	DONE find out why csru breaks load without module: workspace() function clears Main and solves problem (losing all variables)
	DONE fix padding for s2s
	DONE gradcheck for sequences does not work: it has to go through a whole sequence, implement as option or separately.
	DONE s2s.jl: add block sorting
	DONE src/model/s2s.jl: implement gcheck.
	DONE src/op/loss.jl: fix softloss to ignore zero columns.

	* src/model/s2s.jl: we need to handle padding for short sequences
	in a minibatch.  This means certain columns of x/y need to be
	ignored.  During forw calculation different columns do not
	interfere, so nothing to be done, we can leave noise in those
	columns.  During back calculation if we can get the loss gradient
	to be zero for those columns, that would propagate back the
	message "I do not care what is in those columns" and do the right
	thing.  We can rely on zero columns on ygold or use a mask.  (1)
	Zero columns only work for softloss, not quadloss.  They take time
	to produce and detect.  (2) The mask will change with every token,
	so it would have to be stored with ygold[t] (even during encoding
	when ygold is nothing).  Both encode and decode would have to copy
	and push this on the stack.  bptt would have to pop it and pass it
	to back, maybe as a keyword option during both decode and encode
	back.  Actually we don't need it during encoder back, since
	ygold=nothing there is no need to mask it.  Will the noise columns
	interfere during encoder back?  The cell and hidden arrays get
	passed forward, and they have noise columns.  If the loss gradient
	for these columns is 0 we should be ok.  biasback simply sums
	ygrad -> ok.  dotback has dw=dy*x'.  The noise columns in x will
	get multiplied by the zero columns in dy -> ok.  Are dy=0 columns
	preserved going back?  They should but check.  So encoder needs no
	mask, and we can store mask just with dy.  In both cases we need
	s2s and loss modified in coordination.  (3) We could use negative
	numbers.

2015-10-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE barett s2s test (copying english?)
	DONE examples/copyseq.jl: implement train/test.
	DONE profile s2s
	DONE s2s.jl: add warning if not using the whole dataset.
	DONE src/net/initforw.jl: create and use toforw to avoid unnecessary forw calculation. Eliminate S2C! It turns out you can't. Forw doesn't know if ygold=nothing.
	DONE add copyseq to runtests.

	* copyseq-profile:
	Do not use a batchsize less than 100:
	Powers of two do help

		number of sentences
	batch	10000	5120	6400	sparse
	1025	21.63
	1024	16.33	7.68
	1000	18.10
	512	17.50	7.22
	500	18.24
	256	19.09	7.52	10.41	8.36
	200	20.06	7.53	12.81	8.83
	128	20.44	9.62	10.78	9.17
	100	22.29	10.54	11.85	10.02
	64	25.48	9.87	13.69	11.47
	32	36.50	14.14
	16	57.60

	Further improvements using batchsize=128 corpus=6400 from 9.17
	6.20 iw=csr dw=dense removed decoder output bias looking at copyseq.3.prof
	6.06 if we use dw=ArrayAccumulator
	7.67 if we use iw=dw=dense
	3.82 if we use iw=csru dw=ArrayAccumulator
	4.01 if we turn loss/norm calculation back on
	11.13 iw=dw=dense in comparison
	6.42 iw=csr dw=dense
>>>	3.96 iw=csru dw=dense
	2.55 gpusync=nothing
	2.06 batchsize=256

	iw=csru dw=dense gpusync=nothing whole ptb.train.txt, trying batchsize
	41.01 128 (1,1085.8583984375,6.990126f0,0,0)
	38.90 128 noloss
	31.88 256 noloss
	24.64 512 noloss
	21.82 1024 noloss

	iw=csru dw=dense gpusync=nothing whole ptb.train.txt, batch=128, gclip=0, trying lr
	0.1 (1,1085.8583984375,6.990126f0,0,0)
	0.2 (1,956.5218505859375,6.8633037f0,0,0)
	0.5 (1,958.371826171875,6.865236f0,0,0)
>>>	1.0 (1,674.5174560546875,6.5139976f0,0,0)
	1.5 (1,NaN,NaN32,0,0)

	...lr=1, trying gclip
	inf (1,674.5174560546875,6.5139976f0,0,0)
	100 (1,763.9644165039062,6.638521f0,0,55.73273f0)
	50  (1,757.4146728515625,6.629911f0,0,270.44965f0)
	20  (1,681.4532470703125,6.5242276f0,0,95.472626f0)
>>>	10  (1,633.58251953125,6.4513903f0,0,109.699844f0)
	5   (1,681.5367431640625,6.52435f0,0,128.67555f0)
	2   (1,748.1823120117188,6.6176467f0,0,84.69178f0)
	1   (1,755.4359741210938,6.627295f0,0,101.31935f0)

	...gclip=10, trying lr
	.5  (1,743.931884765625,6.6119494f0,0,101.189766f0)
	1   (1,633.58251953125,6.4513903f0,0,109.699844f0)
	1.5 (1,560.0001831054688,6.327937f0,0,55.12631f0)
	2   (1,541.9075927734375,6.2950954f0,0,33.691235f0)
	2.5 (1,550.0517578125,6.3100123f0,0,55.207405f0)
	3   (1,581.8964233398438,6.3662925f0,0,55.3924f0)
	6   (1,711.7434692382812,6.5677176f0,0,28.626701f0)

	lr=2, final gclip check
	5   (1,585.214111328125,6.371978f0,0,188.39285f0)
	10  (1,541.9075927734375,6.2950954f0,0,33.691235f0)
	20  (1,582.9039306640625,6.3680224f0,0,59.838856f0)

	lr=2, gc=10, impact of batchsize on time and accuracy
	1024 23.05 (1,2182.025390625,7.688009f0,0,104.297005f0)
	512  25.93 (1,1387.432861328125,7.2352104f0,0,79.66049f0)
	256  33.57 (1,887.585693359375,6.788505f0,0,42.632828f0)
	128  41.15 (1,541.9075927734375,6.2950954f0,0,33.691235f0)
	64   64.62 (1,315.03179931640625,5.7526736f0,0,56.093933f0)
	32  112.83 (1,173.69186401367188,5.157283f0,0,91.488815f0)
	16  213.02 (1,116.3692398071289,4.756768f0,0,179.33853f0)

	where do they get in 2 epochs?
	1024 46.06 (2,949.5081787109375,6.855944f0,0,48.838074f0)  < 128
	512  52.42 (2,684.6193237304688,6.528863f0,0,191.49686f0)  < 128
	256  66.53 (2,399.59967041015625,5.9904633f0,0,36.83023f0) < 64
	128  82.07 (2,207.21353149414062,5.33375f0,0,54.024696f0)  ?
	64  129.33 (2,123.24575805664062,4.8141804f0,0,56.24083f0) ?
	32  224.85 (2,67.45431518554688,4.2114506f0,0,135.91908f0) > 16

	and 3 epochs
	1024 69.14 (3,717.0039672851562,6.5750813f0,0,34.406246f0) < 64
	512  79.04 (3,454.5836181640625,6.119382f0,0,34.60331f0)   < 128
	256  99.90 (3,218.51165771484375,5.3868394f0,0,60.949215f0) < 128
	128 123.82 (3,119.21369934082031,4.7809176f0,0,68.25633f0) > 64
	64  194.36 (3,75.53480529785156,4.3245935f0,0,92.08937f0)
	32  333.22 (3,44.343299865722656,3.7919617f0,0,203.678f0)

	* best: lr=2 gclip=10 batch=128 iw=csru dw=dense
	1      46.8345 603.903 353.193 345.621 5.84534 0      69.7023 
	2      89.9217 203.996 229.068 216.359 5.37694 0      52.1384 
	3      133.053 116.824 119.512 117.12 4.7632 0      41.745 
	4      176.464 84.6054 90.0316 87.8274 4.47537 0      77.5709 
	5      220.23 64.2483 67.7716 65.6518 4.18437 0      69.8705

	playing with hidden size, comparing output at epoch=20:
	100  20     869.202 19.9098 32.8415 31.4728 3.44912 0      139.686 
	200  20     1320.63 10.4122 18.7426 18.1979 2.90131 0      140.008 
	512  20     2400.58 4.05138 18.1581 17.5052 2.8625 0      139.964
	1024 20     5028.53 2.59843 24.3922 23.2176 3.14491 0      116.486 

2015-10-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE @knet function add2(x,y;out=0) ... end syntax for ops and models.
	DONE ess bugs
	DONE examples/adding.jl: --nettype lstm does not work? - fixed bug
	DONE src/op/compound.jl: add manual init to wconv and wdot -- using init.

2015-10-14  Deniz Yuret  <dyuret@ku.edu.tr>

	* ess-bug: (DONE) happens after constructors: Net( or Constant( etc.  changing fieldnames->names works but breaks 0.3.  find a solution that will work on both.
	Debugger entered--Lisp error: (wrong-type-argument sequencep names)
	#[(s1 s2) "G	GW\207" [s1 s2] 2](names WARNING:)
	sort((WARNING:) #[(s1 s2) "G	GW\207" [s1 s2] 2])
	ess-julia-eldoc-function()
	#[0 "\302 \204 \205G \303\304!\210\304\207	\203 \303	 !\207\305 \306 \211\204$ \304\202B @=\2038 \307\310\"\206B \311!\202B \311!\206B \307\310\"\303!\266\203\207" [eldoc-last-message eldoc-documentation-function eldoc-display-message-p eldoc-message nil eldoc-current-symbol eldoc-fnsym-in-current-sexp apply eldoc-get-fnsym-args-string eldoc-get-var-docstring] 5 "\n\n(fn)"]()
	funcall(#[0 "\302 \204 \205G \303\304!\210\304\207	\203 \303	 !\207\305 \306 \211\204$ \304\202B @=\2038 \307\310\"\206B \311!\202B \311!\206B \307\310\"\303!\266\203\207" [eldoc-last-message eldoc-documentation-function eldoc-display-message-p eldoc-message nil eldoc-current-symbol eldoc-fnsym-in-current-sexp apply eldoc-get-fnsym-args-string eldoc-get-var-docstring] 5 "\n\n(fn)"])
	eldoc-print-current-symbol-info()
	#[0 "\205 \301 \207" [eldoc-mode eldoc-print-current-symbol-info] 1 "\n\n(fn)"]()
	apply(#[0 "\205 \301 \207" [eldoc-mode eldoc-print-current-symbol-info] 1 "\n\n(fn)"] nil)
	byte-code("r\301\302H\303H\"\210)\301\207" [timer apply 5 6] 4)
	timer-event-handler([t 0 0 100000 t #[0 "\205 \301 \207" [eldoc-mode eldoc-print-current-symbol-info] 1 "\n\n(fn)"] nil idle 0])

	* eldoc: (DONE) after display(:
	Debugger entered--Lisp error: (wrong-type-argument sequencep text/plain)
	#[(s1 s2) "G	GW\207" [s1 s2] 2](")},x)" text/plain)
	sort(("(d::Display,mime::AbstractString,x)") #[(s1 s2) "G	GW\207" [s1 s2] 2])
	ess-julia-eldoc-function()
	#[0 "\302 \204 \205G \303\304!\210\304\207	\203 \303	 !\207\305 \306 \211\204$ \304\202B @=\2038 \307\310\"\206B \311!\202B \311!\206B \307\310\"\303!\266\203\207" [eldoc-last-message eldoc-documentation-function eldoc-display-message-p eldoc-message nil eldoc-current-symbol eldoc-fnsym-in-current-sexp apply eldoc-get-fnsym-args-string eldoc-get-var-docstring] 5 "\n\n(fn)"]()
	funcall(#[0 "\302 \204 \205G \303\304!\210\304\207	\203 \303	 !\207\305 \306 \211\204$ \304\202B @=\2038 \307\310\"\206B \311!\202B \311!\206B \307\310\"\303!\266\203\207" [eldoc-last-message eldoc-documentation-function eldoc-display-message-p eldoc-message nil eldoc-current-symbol eldoc-fnsym-in-current-sexp apply eldoc-get-fnsym-args-string eldoc-get-var-docstring] 5 "\n\n(fn)"])
	eldoc-print-current-symbol-info()
	#[0 "\205 \301 \207" [eldoc-mode eldoc-print-current-symbol-info] 1 "\n\n(fn)"]()
	apply(#[0 "\205 \301 \207" [eldoc-mode eldoc-print-current-symbol-info] 1 "\n\n(fn)"] nil)
	byte-code("r\301\302H\303H\"\210)\301\207" [timer apply 5 6] 4)
	timer-event-handler([t 0 0 100000 t #[0 "\205 \301 \207" [eldoc-mode eldoc-print-current-symbol-info] 1 "\n\n(fn)"] nil idle 0])


2015-10-13  Deniz Yuret  <dyuret@ku.edu.tr>

	* @knet: new interface issues:
	- @knet takes a function definition, ends up defining a function,
	with the given name, but one that returns interpolated
	expressions, not evaluating them.
	- the arguments will have to create input expressions.
	- do we still need the input op?
	- keyword args in the body need careful handling.
	- we have ops and knet functions, any interface difference?
	- knet functions used to define both models and new ops, any
	interface difference?
	- at what point do we use gensyms?

	@knet function logreg(x; out=0)  # compiler can handle x without explicit input stmts; we still need input registers so input ops will be in the compiler output
	    y = wdot(x; out=out)         # careful interpolating keyword args, rhs interpolates, lhs doesn't
	    soft(y)                      # assignment could be optional for last stmt?
	    # soft(wdot(x; out=out))     # support embedding?
	end

	@knet function wdot(x; out=0, winit=Gaussian(0,.01), o...)
	    w = par(out,0; init=winit, o...) # careful interpolating o...
	    y = dot(w,x)
	end

	- compiler actually calls these functions during compilation
	- netstmt separates args and params, params for constructor, args for op
	- if the result is an op, it is inserted into ops,inputs,output
	- if the result is an Expr, it is further compiled by subcomp.

	* DONE:
	DONE turn off gpusync() when done with profiling

2015-10-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE decided to keep dw dense for now, speed is similar to ArrayAccumulator and no problems with vecnorm.
	DONE cudnn: update doc and release cudnn
	DONE expensive sparse axpy: check if sparsity pattern remains the same, and do elementwise on nzval when you can. examples/mnistsparse.jl; investigate sparsity patterns for increment (not being used in mnist).  examples/mnistsparse.jl: axpy!(::Int64, ::CUSPARSE.CudaSparseMatrixCSR{Float64}, ::CUDArt.CudaArray{Float64,2})
	DONE performance improvements based on rnnlm profile: need better sparse back: A_mul_Bst->sparse and axpy

	* rnnlm-profile: using sparse input and accummulator dw matrix.
	Note that turning on gpusync for profiling slows it down from 16
	secs to 23 secs.

	1.0000	   22855 ...nn/examples/rnnlm.jl; rnnlm; line: 41
	0.3590	    8206  ...net/src/model/rnn.jl; train; line: 13	# forw
	0.3405	     7781 ...Knet/src/net/forw.jl; forw; line: 23
	0.1380	      3155 .../Knet/src/op/add.jl; forw; line: 32	# biasforw
	0.1380	       3154 .../Knet/src/op/add.jl; biasforw; line: 40
	0.1439	      3288 .../Knet/src/op/dot.jl; forw; line: 18	# A_mul_B! only 166 from sparse
	0.5826	    13315 ...net/src/model/rnn.jl; train; line: 21	# back
	0.4278	     9777 ...Knet/src/net/back.jl; back; line: 41
	0.1069	      2443 .../Knet/src/op/add.jl; back; line: 52	# biasback
	0.0851	      1946 .../Knet/src/op/dot.jl; back; line: 22
	0.0811	       1853 .../src/util/linalg.jl; A_mul_Bt!; line: 36	# dense
	0.1410	      3223 .../Knet/src/op/dot.jl; back; line: 23
	0.1401	       3201 .../src/util/linalg.jl; At_mul_B!; line: 37	# dense
	0.0476	     1087 ...Knet/src/net/back.jl; back; line: 51	# axpy! (gpusync)

	*  (ERROR): vecnorm buggy for Acc. use --max_grad_norm 0 --lr 0.1 until fixed.

	* rnnlm-timing: using ptb.test.txt only: dw += iw; w += dw; w always dense.
	15.27 sec: Float32, iw:sparse, dw:dense, kernel mul, regulr axpy, loss=700.67 buggy: kernel mul produces csru, needs atomicAdd
>>>	15.39 sec: Float32, iw:sparse, dw:dense, kernel mul, atomic axpy, loss=688.35
	15.75 sec: Float32, iw:sparse, dw:spacc, kernel mul, atomic axpy, loss=699.17 ??? spacc vecnorm bug
	16.34 sec: Float32, iw:sparse, dw:dense, cusprs mul, regulr axpy, loss=687.96
	16.39 sec: Float32, iw:sparse, dw:dense, cusprs mul, atomic axpy, loss=687.96
	16.88 sec: Float32, iw:sparse, dw:spars, kernel mul, sparse axpy, loss=693.77 ??? axpy needs uniq?
	19.39 sec: Float32, iw:sparse, dw:spars, cusprs mul, sparse axpy, loss=687.67
	19.57 sec: Float32, iw:dense,  dw:dense, dense  mul, dense  axpy, loss=686.37622
	(ep,perp...,wmax,gmax,lr) = (1,686.376220703125,256.7232f0,153.58725f0,1.0) # dense reference

	19.47 sec: Float64, iw:sparse, dw:dense, kernel mul, regulr axpy, loss=704.61 buggy
	19.57 sec: Float64, iw:sparse, dw:dense, kernel mul, atomic axpy, loss=694.39
	20.08 sec: Float64, iw:sparse, dw:spacc, kernel mul, atomic axpy, loss=700.01 ???
	20.86 sec: Float64, iw:sparse, dw:dense, cusprs mul, regulr axpy, loss=694.28
	20.67 sec: Float64, iw:sparse, dw:dense, cusprs mul, atomic axpy, loss=694.39
	20.65 sec: Float64, iw:sparse, dw:spars, kernel mul, sparse axpy, loss=700.01 ??? axpy needs uniq?
	24.33 sec: Float64, iw:sparse, dw:spars, cusprs mul, sparse axpy, loss=694.42
	28.14 sec: Float64, iw:dense,  dw:dense, dense  mul, dense  axpy, loss=694.351734
	(ep,perp...,wmax,gmax,lr) = (1,694.3517340040148,256.38582340534157,159.3564629264121,1.0) # dense reference

2015-10-10  Deniz Yuret  <dyuret@ku.edu.tr>


	* DONE:
	DONE do we need 2 forw?  1 forw may make the predict interface easier. we could have a keepstate like option for sequences.
	DONE ai-mtg: We need a predict function: new forw can return out, back return dx for net stitching?
	DONE op/loss.jl: Take loss out of the model, make it a train/test option.
	DONE examples/adding.jl: figure out why the new data generator did not work.
	DONE allow tuples of any length for data, (y,) for unsupervised/inputless, (x1,x2,y) for multi-input
	DONE examples/rnnlm.jl: loss per sequence or loss per token?
	DONE examples/s2c.jl: find better way to stitch networks together without exposing guts.
	DONE embedded expr: do we really need to?  abstraction takes care of most variable hiding.
	DONE handout-mtg Avoid variables by allowing embedded expressions: we created compound instead.
	DONE model.jl: handle the case where the data has tuples of length 1 or more than 2.
	DONE src/net/forw.jl: get rid of sequence forw? have one type of forw with options (seq vs keepstate do we need both?)
	DONE src/net/forw.jl: make yout optional last argument to match the op interface? returning internal y instead.
	DONE take loss out: that way there is no need for accuracy vs test difference.  deprecate accuracy.  consider sequence vs item loss.  implement zeroone, perceptron, etc.
	DONE need better net combo: created model/* instead.

	* examples/mnistpixels.jl: How does s2c work in the new interface?
	[1] The sequence proceeds with x=pixel, y=nothing.  At the end we
	could have x=nothing, y=class.  That could be the sign to run
	net2.  [2] Conceptually simpler might be to always have net1/net2
	in a stack.  Maybe the compiler figures out net2 is not necessary
	to run as long as y=nothing.  With the last pixel we would have
	y=class.  Either that could trigger backprop or an extra
	item=nothing after the last pixel.  How much of this is
	generalizable to s2s?  There we definitely need item=nothing to
	signal end of sequence.  We could imagine stacking net1 and net2
	there as well?  No, because during decoding input is directly fed
	into net2.  So we will have to have different train routines?
	That ruins our original train(model,data) interface.  Unless we go
	back to data elements possibly being sequences, is there any way
	to salvage that?  It seems we are moving complexity from forw/back
	into train/test.  New itembased picture.  Models only input/output
	individual items, not sequences.  Data only generates individual
	items, not sequences.  Then if we are doing different things with
	different items, train/test has to deal with it?  I should look at
	NTM and the attention models now so I don't regret design
	decisions later.  But right now it looks like model.jl is
	deprecated, each model type will have to have its own
	data/train/test.  There is not common interface.

2015-10-09  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	DONE initback: minimize array zeroing.
	DONE src/op/conv.jl: implement xavier in par.jl.
	DONE src/op2: start a collection of compound ops: wdot, bias, lstm, wbf...

	* forw-loss-back: What is the best interface?
	forw(m,x) => y
	loss(m,ygold) => ygrad
	back(m,ygrad) => xgrad

	There is no point in making loss and back functional by providing
	the output y.  The output y along with a number of internal
	outputs have to be remembered by the network.  Even if we made it
	functional, giving back a different y would not work because the
	internal outputs would be not consistent with it.

	Pre-allocating arrays outside (y for forw, ygrad for loss, xgrad
	for back) results in unnecessary copying.  Instead we resign to
	the fact that the model shares internal arrays and copy them if
	necessary.  What about loss vs grad calculation?  We can use two
	functions:

	loss(m,ygold) => loss value
	grad(m,ygold) => ygrad array

	Or use a single function that has lots of options.  We need to be
	able to choose from a number of loss functions.  These could be
	keyword arguments, or we could have quadloss(m,ygold),
	softloss(m,ygold) etc.  The sequence of events is:

	train: x->forw->y, ygold->grad->ygrad, ygrad->back->xgrad.

	No need for loss value unless asked for.  y, ygrad, xgrad are
	internally allocated, x and ygold come from outside.  We are
	losing the ability to state which xgrad we want from
	back.

	test: x->forw->y, ygold->loss->lossval
	predict: x->forw->y, there is no ygold

	hcat: x->forw1->y1, y1->forw2->y2 (no need to alloc/copy y1 if it
	is already on gpu).  ygold2->grad2->ygrad2, ygrad2->back2->ygrad1,
	ygrad1->back1->xgrad1.  (again should not have two copies of
	ygrad1).

	Ability to use different loss functions:

	test: x->forw->y, ygold->loss->lossval (loss should be an arg to test).
	train: x->forw->y, ygold->grad->ygrad (grad should be an arg to train).

	Low level loss fn is defined as loss(y,ygold)->lossval
	Low level grad fn is defined as grad(y,ygold,ygrad)->ygrad

	Defining a single loss function signature:
	loss(y,ygold) -> computes loss
	loss(y,ygold,ygrad) -> computes loss and grad
	loss(y,ygold,ygrad; getloss=false) -> computes grad

	Is there any advantage of merging loss and grad together.  For
	softloss the calculations are independent:
	loss = -sum(ygold * log(y))
	grad = 1 - ygold/y

	quadloss not so; grad is necessary for loss.
	grad = y-ygold
	loss = (1/2)*sum((y-ygold)^2)

	xentloss: both need ynorm at some point
	loss = -sum(ygold * log(ynorm)) where ynorm is normalized exp(y)/z
	grad = ynorm - ygold

	Mainly we want to be able to:
	test(model,data; loss=quadloss)
	test(model,data; loss=percloss)
	test(model,data; loss=xentloss)
	train(model,data; loss=softloss)
	train(model,data; loss=quadloss)

	We don't want to have to compute lossval unless asked for.

	function train(model,data; lossfn=quadloss)
	    for (x,ygold) in data
	        y = forw(model,x)
	        l = loss(model,ygold; lossfn=lossfn) # optional
	        ygrad = grad(model,ygold; lossfn=lossfn)
	        xgrad = back(model,ygrad)
	        update!(model)
	    end
	end

	Do we need to pass around ygrad?  grad(model,ygold) sets dif[N].
	Just like forw sets out[N].  We don't pass out[N] to back, why
	should we pass dif[N]?

	Or should we do back(model,ygold; lossfn=lossfn)?  That would make
	more sense and mirror forw.  No need for extra grad fn.  This is
	getting close to having our own loss layer.  But loss layer wasn't
	doing anything in forw pass anyway.  And it was doing something
	unusual in the back pass (not treating the input as gradient).  So
	treating it as a special fn is ok.

	Two questions: (1) the xgrad problem and (2) lossval problem.
	lossval (2.1) needs extra space and (2.2) sometimes shares
	computation with grad.  lossval needs y (on gpu) and ygold (on
	cpu).  We could choose to do it on cpu like we used to.  In any
	case the two arg version of lossfn should take care of this.  If
	we do cpu we won't need temp space, if we do gpu we will.  We
	should look at relative speed etc.  (3) what does back call for
	initial gradient?  the three argument version of loss?  (4) how
	does train get its lossval?  through separate call? or getloss
	option to back? (1) how do we tell back we want xgrad?  getgrad
	option? (TODO)  effects the value of toback.  instead of passing
	arrays, pass booleans for dx.

	xloss(y,ygold) -> lossval
	xloss(y,ygold,ygrad) -> overwrites ygrad

	train, test, and back take loss keyword arg.

	(5) sequence vs item training: train needs to know how often to
	run back.  every iteration?  wait until end of sequence
	(item==nothing)?  every n iterations?  split these into separate
	training functions?  how much code in common?  how about
	keepstate?

	seq flag: cannot get rid of it in train (when to run back), and in
	set_toincr (parameters only need incremental updating for seq
	input).  So might as well make it a precondition for push/pop.

2015-10-08  Deniz Yuret  <dyuret@ku.edu.tr>

	* rnnlm-profile:

	msec	ratio	where
	5515	1.0000	model.train
	1620	0.2937	 forw.sforw:48
	10	0.0018	  forw.sforw.initforw:49
	1610	0.2919	  forw.sforw.forw:55
	205	0.0372	    forw.push:20
	800	0.1451	    forw.opforw:25
	119	0.0216	     actf
	7	0.0013	     softmax
	47	0.0085	     add
	127	0.0230	     bias
	243	0.0441	     dot
	117	0.0212	      dense.dot.dense
	102	0.0185	      dense.dot.sparse (y=w*xS)
	11	0.0020	     loss.copy
	24	0.0044	     mul
	13	0.0024	     par.init
	512	0.0928	    forw.oploss:32
	66	0.0120	     soft.similar (fix tmp?)
	31	0.0056	     soft.softloss64csc
>>>	403	0.0731	     soft.asum (rewrite asum: did not speed up)
	3808	0.6905	 train.back:49
	46	0.0083	  back.initback:55
	3762	0.6821	  back.back:58
	2606	0.4725	   back.opback:30
	75	0.0136	    sigmback
	41	0.0074	    softback
	78	0.0141	    tanhback
	116	0.0210	    biasback
	1909	0.3461	    dot.A_mul_Bt
	154	0.0279	     dense.dense->dense
>>>	1256	0.2277	     dense.sparse->sparse:79+3 (sparsify?, write kernel? /home/nlg-05/dy_052/julia/latest/base/sparse/linalg.jl:110)
	457	0.0829	     dense.sparse->sparse:80 (gemm!)
	123	0.0223	    dot.At_mul_B
	1007	0.1826	   back.axpy:33
	183	0.0332	    axpy.dense
>>>	562	0.1019	    axpy.sparse.geam:89 (could just add nzVal?)
	251	0.0455	    axpy.sparse.free:90
	30	0.0054	 train.wnorm:51
	28	0.0051	 train.gnorm:52
	26	0.0047	 train.update!:53

	* linreg-profile:
	1. before the vecnorm optimization
	2. after the vecnorm optimization

	ratio1	msec1	ratio2	msec2
	1.0000	1676	1.0000	1303	train
	0.0525	88	0.0660	86	train.next
	0.0961	161	0.1105	144	train.gradcheck
	0.2942	493	0.3231	421	train.forw
	0.0066	11	0.0138	18	train.forw.initforw
	0.0066	11	0.0069	9	train.forw.x
	0.0292	49	0.0284	37	train.forw.copy
	0.0525	88	0.0698	91	train.forw.forw
	0.0280	47	0.0261	34	train.forw.forw.dot
	0.0227	38	0.0215	28	train.forw.forw.dot.gemm!
	0.0119	20	0.0246	32	train.forw.forw.loss
	0.0119	20	0.0230	30	train.forw.forw.loss.copy! ??? why do we copy here? will change when we take losslayer out of model.
	0.1957	328	0.1865	243	train.forw.quadlossloss
	0.0209	35	0.0545	71	train.forw.quadlossloss.similar
	0.0131	22	0.0246	32	train.forw.quadlossloss.copy!
	0.0125	21	0.0200	26	train.forw.quadlossloss.axpy!
>>	0.1492	250	0.0844	110	train.forw.quadlossloss.vecnorm
	0.1390	233	0.2571	335	train.back
	0.0507	85	0.1167	152	train.back.initback
	0.0442	74	0.0944	123	train.back.initback.fill!.dif0
	0.0107	18	0.0161	21	train.back.copy!.dif0.dy
	0.0686	115	0.0975	127	train.back.back
	0.0215	36	0.0361	47	train.back.back.dot
	0.0322	54	0.0384	50	train.back.back.quadloss
>>	0.1897	318	0.0913	119	train.wnorm
>>	0.1981	332	0.0867	113	train.gnorm
	0.0298	50	0.0637	83	train.update!

2015-10-02  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	+ fix pdf
	+ setup test harness.
	+ make conv accept all cudnn options.
	DONE handout-mtg: initforw/initback: The array sharing optimizations are currently turned off.
	DONE deprecate/fix KUdense?  edit colops and linalg to fit.
	DONE put optimizations back.
	DONE profile rnnlm and check out the number of registers in lstm: come up with steps and time spent on each step for various models.


2015-09-28  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE: fix conv/pool. solve compiler namespace problem.

2015-09-27  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/KUnet.jl: DONE: figure out the namespace issue.  compiler
	does not work when inside the module.

	* examples/adding.jl: DONE: make sure back does not call ops with
	nothing.

	* examples/mnist2d.jl (net): DONE: find out where the slight
	difference comes from. bias=0.

	* src/netcomp.jl: for special par ops do we have valid flags?
	tosave: false, never necessary since we do not change between iterations
	toincr: always true if seq, true when multi otherwise
	toback: always true
	tozero: not necessary for anyone with nothings

	for regular ops during non-seq runs:
	tosave: always false, no need for stack
	toincr: multi should be true
	toback: true if par descendent. cycles? handled.
	tozero: deprecated

	following should work:
	tosave: !par & seq & back_needs_it: netcomp sets !par &
	        back_needs_it, use seq as extra condition for push/pop. ok.
	toincr: multi || (par & seq): netcomp gives multi,
	        we need to handle par & seq
	toback: par or descendent (fixed): netcomp sets it.

	* DONE: toincr is a problem:
	- netinit:33 difsparse says no if toincr, why?
	- paramaters are incrementally updated in a sequence.
	- however we want to use sparse dw to go with sparse x.
	- we have to do this (hoping sparsity pattern stays the same)
	- finddif also uses it to not pick multi ops for sharing (also
	excludes par).
	- netback:9  toincr[N] is it possible for output to be a par?
	- this also effects which tmp are created.
	- should we take this out of net and make it a keyword arg?
	- we can forbit seq to switch from call to call.
	  just like we dont allow sparsity or element types change.
	- seq vs no seq effects tosave/stack, toincr, thus tmp.

	* src/op/pool.jl: DONE: rewrite.

	* src/op/conv.jl: DONE: rewrite.

	* src/netback.jl:
	# DONE: use dx=nothing instead of returndx=false, that way we can choose which dx to return for multi-input case.
	# DONE: use path analysis to stop back/dx calculation for any path that does not lead to a parameter.

	* docs/rnn.md (example): DONE:
	## Interface issues:

	op interface:
	forw(op, x..., y)
	back(op, dy, dx...; x, y)

	net interface:
	forw(net, x...; yout, ygold)
	back(net, dy; dx)

	* src/netinit.jl: I should test with unoptimized version where all
	arrays are distinct and all are zeroed at the beginning, slowly
	introducing optimizations.

	* DONE:
	./KUnet.jl
	./data.jl
	./model.jl
	./model/*: update -> examples
	./model/irnn.jl: update -> examples
	./model/kperceptron.jl: replace
	./model/lstm.jl: update -> examples
	./model/s2c.jl: update -> examples
	./netback.jl: REWRITE
	./netcomp.jl: done
	./netforw.jl: done
	./netinit.jl: complete back
	./nettest.jl: deprecate
	./netutil.jl: done
	./op.jl: done
	./op/actf.jl: test
	./op/add.jl: test
	./op/conv.jl: update
	./op/dot.jl: test
	./op/drop.jl: replace
	./op/input.jl: done
	./op/loss.jl: test
	./op/mul.jl: test
	./op/par.jl: opts, averaging, adagrad etc.
	./op/pool.jl: update
	./update.jl: REWRITE -> par?
	./util/*: check usage
	./util/array.jl: check usage
	./util/colops.jl: check usage
	./util/cudart.jl
	./util/curand.jl
	./util/cusparse.jl
	./util/deepcopy.jl
	./util/dense.jl: deprecate?
	./util/gpu.jl
	./util/linalg.jl

2015-09-26  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/netforw.jl:
	# DONE
	# + performance: figure out when no back needed, no returndx needed
	# + performance: do better register optimization
	# + if an op is using an external array, it should not store it.
	# + if we take xi/yi as a parameter for back maybe the net would not have to remember it?
	# x rename train->trn for ops: decided mode=:train
	# x rename compiler->net, net->netfunc?
	# + change calling convention to forw(y,x...)
	# + figure out par resizing.

2015-09-25  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/net.jl: netforw.jl, netback.jl, netinit.jl/netcomp.jl/net.jl?
	(psize): work on size/type inference for both initforw and the compiler.

	* nothing: DONE: nothings are a pain.  we use them to prevent
	unnecessary allocation and copying of zero arrays.  there has to
	be a better way.  maybe constrain them only inside net and don't
	have ops deal with them?
	- nothings are back.  without them there is too many unnecessary zeroing, copying, pushing, popping etc.

	* src/op/par.jl: OK, this is too much too fast, things will be
	broken for too long.  Better to transition gradually.  Use
	existing calling conventions, existing ops.  First make things up
	and running with the new compiler, then think of consolidating the
	code.

2015-09-24  Deniz Yuret  <dyuret@ku.edu.tr>

	* examples/zaremba14.jl: @@hpc
	13311883.hpc-pbs.hpcc.  dy_052      isi      rnnlm1            52076     1      1    --   24:00:00 R  00:09:11   hpc3004/0
	13311993.hpc-pbs.hpcc.  dy_052      isi      zaremba1              0     1      1    --   24:00:00 R  00:05:53   hpc3005/0
	13311994.hpc-pbs.hpcc.  dy_052      isi      zaremba2              0     1      1    --   48:00:00 R  00:05:53   hpc3007/0
	13312198.hpc-pbs.hpcc.  dy_052      isi      rnnlm2              --      1      1    --   48:00:00 Q       --     --


2015-09-23  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE: big plan
	- book: find markdown viewer, code pretty printer, latex code formatter?
	- rnnlm
	- new compiler
	- profiler
	- debugging
	- code todo

	* docs/rnn.md (rnd): DONE: Profile and stop useless dx calc

2015-09-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/op/mmul.jl: DONE: adding.jl complains about issparse(void), why is
	x void?

2015-09-21  Deniz Yuret  <dyuret@ku.edu.tr>

2015-09-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* examples/mnistsparse.jl: NO: fix sparse arrays.  Rename KUdense
	-> DynamicArrayCPU/GPU.  SparseArrayCPU/GPU.  Decided to retire
	them instead.

2015-09-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* test/testmnist.jl: collect data and code in one place.

	* test/cusparse.jl:

	# nh = 1000
	# nd = 100000
	# nb = 100
	1. GPU forw using csrmm2!(x',w,y')+transpose 0.05ms
	2. GPU back (all sparse) using gemm!(x,dy',dw') 2.13ms
	3. GPU back (all sparse) using gemm!(dy,x',dw) 7.26ms

	For the first we copy sparse minibatch x to x' in gpu,
	which is a direct copy since gpu is csr.  w already right shape.
	Need to transpose y before going further, needs extra space. (THINK)
	There is really no alternative to #1 because sparse x has to come first.

	For the second we get regular dy which we need to transpose, same
	problem as the first.  Now we also need to transpose x and dw.
	Maybe I can get away with dw by setting a transpose bit since I
	have to write w+=dw kernel anyway.  (Cancelled: check update and
	gradient clip).  Transposing x is a pain (THINK).

	The third one is slower but we already have dy, x', and dw comes
	out right.  In fact with other shapes (nh=1500, nd=10000, nb=20)
	third one is faster, so we go with it.

	DONE:
	- write w+=dw kernel for update.
	- update ops should be compatible with sparse dw.
	- get rid of KUsparse, all code uses CUSPARSE.
	- fix loss to deal with sparse dw.
	- assume user provides all arrays in cpu.
	- test sparse mnist.
	- complete lm.
	- write new assembler.

2015-09-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	- new assembly language: (1,Mmul(50),2), (3,Mul2(),4,5)

	subroutine convention: takes input from 0,-1 etc. leaves output at
	last register.  no stack.  could allow symbols, variables? no
	need.  consecutive numbers enforced? makes them redundant.
	allowing symbols makes sense: (:a,Mmul(50),:b,:c).  But then what
	are the inputs?  We could use the Julia parser to get fancy
	:(a=Mmul(50,b)).  Who needs 26 variable names (lstm), numbers are
	just fine for now.  We could have (1,Input()), (2,Input()) to
	prevent arbitrary 0,-1 etc. numbering.  Good.  Eliminate the need
	for empty parens, i.e. (3,Mul2,4,5).  Detect type and create
	object.  Good.  Ops no longer manage input output storage, why not
	take params out as well: (1,Input), (2,Param), (3,Mmul,1,2).  We
	already have a Param type!  Only mmul,bias,conv currently use
	params.  They specify sizes differently, Bias(), Mmul(50),
	Conv(5,20).  Param can init an empty array using same args: Bias
	creates (0,)->(n,), Mmul creates (m,0)->(m,n), Conv creates
	:(x,0,o)->(x,x,i,o).  We can totally do the same!  Param creates
	empty array by adding one more zero dimension than supplied
	:(modify the constructor).  No actually compiler should take care
	of this, just like it adds an evaluation to Mul2.  But that's only
	possible if we use quoted expressions.  Explicitly marking params is cool.
	Increases the number of lines (but there is not bias without
	param!). Ops have no internal memory is cool.  (x's and y's are in
	registers).  (dx's are in registers).  Difference between x's and
	w's?  x's are ephemeral, w's stay.  The Param's are never used by
	anything but the associated op.  Unless we have Param sharing!?
	e.g. each word in the window (represented by one-hot) gets
	multiplied by the same embedding matrix!  ok.  sold.  Nothing
	except Param needs size args, most ops can be written without
	parens, except when we need keyword arguments.  but those argument
	are always passed to param!  Param lines do not create registers,
	other lines do (or we can call them param registers that are more
	permanent and have associated update options).  Or do create
	registers for them, just don't zero them out!  No need for diff,
	inc, they use the same mechanism as other registers.  Same multi
	detection etc?  Or rather keep them separate as (1) they are not
	zeroed in initforw, (2) they are not zeroed (dif) for incremental
	every back step.  But that's fine.  We'll just have conditions.
	So Input and Param will have to be ops to preserve the op/register
	alignment.  How to keep train options? (TODO: change setparam to
	setoption and call them options).  Have a separate option array
	for them?  We are distributing related info in different locations
	rather than one param object.  setoption will have to be provided
	an optional number? (need to remember the original numbers or keep
	them consecutive).  (but there is number xform when we compile
	high level units).  setoption cannot be specified per parameter
	after construction.  global setoption is still possible.  that's
	fine.  what about splitting dropout?  need rand layer. need to
	figure out trn/tst difference.  then just mul2.  Example prog:
	a=:[(1,Input),(2,Param(50)),(3,Mmul,1,2),(4,Param),(5,Bias,3,4),(6,Relu,5)]
	Composition? will change all the numbers.  so we could in
	principle not make param an op.  input is an op that produces
	something.  most likely executed by the net.  but param does not
	produce anything at least anything different every iteration.
	should it crowd the op list even if it is a noop?  Alternative is
	to push them to the end of the list like we do now?  Keep a
	separate set of param registers?  and difparam registers?  naah.
	Param is an op that outputs a constant.  With Input and Rand these
	ops produce an output without reading other registers.
	ok next: compiling from expressions is good, but what about
	composition?  Do we want LSTM to be a symbol as well?  No, LSTM is
	another expression.  That can be embedded: (3,LSTM,5) etc.  Cannot
	be, LSTM has to be a function that takes a size
	parameter. (3,LSTM(10),5).  Could be a constructor.  Compiler will
	have to evaluate and get an expression.  Then can be merged.
	foo(h)=:[(1,Param($h))] works.  If it is not an op, it should be a
	function that returns a net expression.  Nets cannot be composed,
	expressions can be composed.  But what if the LSTM expression have
	embedded expressions.  We got that compilation for free if LSTM
	was a net.  If numbers don't have to be consecutive, merging is
	easy, just renumber, match inputs and outputs.
merge mul2 and drop -> mul
	merge add2 and bias -> add
	mmul -> gemm
	conv, pool
	actf, loss
	define higher levels out of these
	sigm(n)= par(n) gemm par() add sigm
	sigm2(n) = par(n) gemm par(n) gemm add par() add sigm
	drop(p) = rand(p) mul
	can we define gru?
	(1-z) h1 + z h2
	either represent 1-z as a separate op axpy?
	or have an option in mul?
	axpy is par mul par add, except pars are scalar and constant
	(lr=0)
	we need mul and add to be very smart broadcasters.
	and par have some smart init options (ndims?)
	par(5,0) par(0) par(5,5,0,10) always use 0 for unk dimension.
	this way a scalar is par(1). and conv for difft dims is possible.
	add2 : par(0,0), bias: par(0), mul2: par(0,0), rnd(0,0) vs rnd(0)
	no implicit zeros!  dims always clear.
	how do we init par to a known value?  give an array? par([1]).
	par(a) where a is a variable?  compiler passes to eval?
	averaging, adagrad, dropout difft trn tst behavior.  test=false
	option?  test=average option?  all par/rnd options.
	Consider allowing both integers and variables.


	- sparse operations: lm needs:
	: efficient forw w:dense x:sparse => y:dense
	- compare my implementation against cusparse: cusparse a lot faster.
	- cusparse produces transposed output: need temp space to transpose.
	- use something like dif1?
	- for in-place transpose:
	http://dl.acm.org/citation.cfm?id=2555253
	https://github.com/BryanCatanzaro/inplace: still needs some space,
	probably not worth it.
	: efficient loss y:dense + dy:sparse => dx:dense: write your own
	: efficient back dy:dense x':sparse => dw:sparse?
	: efficient update w:dense + dw:sparse =>  w:dense: write your own, same as loss.

	- test cusparse again.
	- update to cudnn v3.

	* src/op/loss.jl: DONE: accept sparse target.

	* src/param.jl: DONE: rename to Param. actually wait for the new assembler, probably end up with par.

2015-08-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* FIXED: /home/nlg-02/data07/eng/google-ngram/CompoundWords.dbg:
	julia my_train_and_random_test.3.jl train.data.part test.data.1
	gives illegal memory access error.  This was an issue with 32 vs
	64 bit integer indices with sparse matrices.

2015-08-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* RNN:
	- an rnn is still a vector of micro-layers.
	- we need two new layers, adder and elementwise multiplier for lstm?
	- each layer (except the adder) has a single input
	- the input is specified by two indices:
	-- index into rnn (absolute?), index into time (relative)
	- the adder can have a list of indices. (should we allow all to
	have this?)
	- x, y become a vector of matrices rather than a single matrix
	- we may want to use different variables for backward
	compatibility: xlist ylist?
	- dx, dy?
	- resetting the vectors?
	- feeding the input?
	- backprop?

2015-08-14  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE: impement RNNs.

	* DONE: doc before release:
	- Perceptron.
	- Layers.
	- Update (including averaging etc.)
	- ScalLoss?

2015-08-02  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/bias.jl: DONE: Create a @gpu macro.

2015-07-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/kperceptron.jl:
	- klinear != perceptron, averaging difference?
	+ klinear,single,sparse,cpu gets stuck: changed slow algorithm.

	* src/util/sparse.jl: Frequent problem I have been having is due
	to the following: when a function parameter has a parametrized
	type like KUsparse{Array}, that is parametrized by another type,
	it is better to make the inner type a variable.  i.e. instead of
	foo(a::KUsparse{Array}) use foo{A<:Array}(a::KUsparse{A})

2015-07-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/util/linalg.cu (_A_mul_Bs_32): DONE: Test this.

	* src/mmul.jl: DONE:
	+ not using averaging for prediction.
	+ cudaarray is not giving the same answer as array
	+ sparse does not work

	* src/net.jl: conversions in train and predict: do we end up with
	the same array type we started with?

2015-07-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* test/tutorial.jl: profile, did it get slower?

	* DONE:
	+ test mmul with dense and sparse
	+ fix kperceptron, perceptron
	+ make layers generic, support regular arrays, if possible
	in-place sparse arrays.
	+ unit test cadd!

	* src/util/linalg.jl: DONE:
	+ add linalg tests to dense and sparse.

2015-07-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* test/testsparse.jl (density): DONE:
	+ implement full
	+ type cannot be constructed: use ::Type{T} notation.
	+ implement uniq! for KUsparse

	* test/testdense.jl: DONE: add colops and linalg? tests.  Write
	testsparse.

2015-07-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* BUGS:
	+ cpucopy/gpucopy: fixed
	- load/save: doubles the storage.
	- savenet is failing.
	- reshape is missing.
	- cpucopy leaves KUparam{CudaArray}
	- cpucopy does not work with net
	- savenet causes seg fault
	- test for cpu only machine
	- introduce update_list and skip_list
	+ turn all cu params into double

	* layers: DONE: Layers are functions.  They should be generic.
	Their output types should match their input types.  Internally
	they can do whatever they want.  If the input type does not match
	the parameters the parameters should be moved?  No, better if only
	flexible at init.  We should support regular arrays not just
	KUarray's for input/output.  KUarray's just make certain
	operations faster, so train/predict can use them during
	minibatching.  We should specify the array types supported by each
	layer.

2015-07-16  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/dense.jl: ok, we have three types of arrays: sparse, dense,
	param.  These have different members, so they can't just be
	parametrizations of a common type.  However each of these has
	array type, element type and dimensions as parameters.

	* src/bias.jl: should rename param for consistency.  data->arr.
	Define GPUparam and CPUparam?

	* src/mmul.jl: where do we need extensible arrays?  x, y,
	dataflow.  Train batches.  Not w params.  Except kernel perceptron
	support vectors and weights.  If we used regular arrays for
	params, load/save problem would go away.  Do we replace each l.x
	with l.x.arr?  Or do we define ops that work on KUarrays?  Then we
	need to consider mixtures.  Best to just work with your own arrays...

	OK, best of both worlds: we have two types of arrays, Param and
	Data arrays.  They have different requirements.  Just define the
	necessary functions for each.  Data arrays are resizeable whereas
	Param arrays have derivatives and associated update parameters.

2015-07-15  Deniz Yuret  <dyuret@ku.edu.tr>

	* ARRAYS:
	We want to support cpu/gpu, sparse/dense, float32/64, 2D/4D dynamic
	(efficiently extendible) array types in KUnet.  The relevant part of
	the current array hierarchy in Julia is as follows:

	AbstractArray
	  DenseArray
	    Array
	  AbstractSparseArray
	    SparseMatrixCSC
	  SubArray

	The CUDArt hierarchy is disconnected:

	AbstractCudaArray
	  CudaArray
	  CudaPitchedArray
	AbstractArray
	  HostArray

	CUDArt also defines two convenience types:

	typealias CdArray{T} Union(DenseArray{T},HostArray{T},AbstractCudaArray{T})
	typealias ContiguousArray{T} Union(Array{T},HostArray{T},CudaArray{T})

	Notes:

	So far I have added CudaDynArray and CudaSparseMatrixCSC to this
	list.

	CPU Arrays currently do not handle resize and hcat efficiently.
	They probably need a special type as well.  We might as well
	define all 4 types.

	Sparse arrays are needed only as inputs and possibly support vector
	matrices.  Things quickly turn dense after that.

	-- Layers Should Adapt to Their Inputs --
	Declaring a model with specific input output array types, vs
	declaring an abstract model which then takes the appropriate shape
	when first batch of data is seen.  Certainly the abstract
	declaration has an ease of use advantage, the constructions are
	more readable.  The more abstract the model code the better.
	However this means a lot of low level initialization code goes
	into layer definitions (initforw etc.).  What if input type
	changes during the lifetime of the model?  We adapt to size
	changes, why not type changes? eltype changes?  Device, sparsity,
	array implementation (there could be more than one sparse array
	type), eltype, ndims: layers should adapt to their input and be
	able to change (maybe with a warning) midstream.  The output types
	(both y and dx) should match the input types (except y may not
	match x in ndims.)

	Operations to be supported:
	train: shufflexy!, size, ndims, x2b
	predict: b2y
	layers: similar! (for changing type), resize (for changing size)
	kperceptron: hcat! (bparse also needs this)
	mmul: A_mul_B! A_mul_Bt!

	Each layer should document which array types supported.

	* DONE: organize code around operations (array.jl, linalg.jl) or
	data types (sparse, cusparse)?

	* DONE: release perceptron.
	+ do a pre-release
	- finish documentation.
	- implement structured perceptron? just a difft training rule?
	- test mmul+percloss, need averaging, implement ASGD.

2015-07-09  Deniz Yuret  <dyuret@ku.edu.tr>

	* qsub: automatic spawn of worker machines in qsub.  From JonMay:
	Once a job starts you will have the environment variable
	PBS_NODEFILE and this will contain a list of machines, one per
	line.

	e.g. (sorry, it happened to be a one-node job and i didn’t want to
	wait for my 2-node demo to start)

	[jonmay@hpc1457 quake]$ echo $PBS_NODEFILE
	/var/spool/torque/aux//12554804.hpc-pbs.hpcc.usc.edu
	[jonmay@hpc1457 quake]$ cat
	/var/spool/torque/aux//12554804.hpc-pbs.hpcc.usc.edu
	hpc1457

2015-07-05  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/util.jl: DONE: use size! instead of realloc in similar!
	look at size! vs resize! again
	make sure regular arrays will work
	similar! does not match array type?  reconsider its semantics.

	* net.jl: profiling kunet.  nothing much to do.

	# train: most time spent copying data to gpu
	gpu() && gc()  # L46: 76
	for b = 1:batch:ninst # L47: 1
        e = min(ninst, b + batch - 1)
        xx = x2b(xx, x, b:e) # L49: 25641
        yy = x2b(yy, y, b:e) # L50: 39
        backprop(net, xx, yy; o...) # L51: 516
        update(net; o...) # L52: 160
        (iters > 0) && (e/batch >= iters) && break
        gpu() && (gpumem() < (1<<30)) && gc() # L54: 32
	end
	strip!(net)
	gpu() && gc()

	# predict: most time spent on copying back from gpu.
	ninst = size(x, ndims(x))
	(batch == 0 || batch > ninst) && (batch = ninst)
	xx = yy = nothing
	for b = 1:batch:ninst
        e  = min(ninst, b + batch - 1)
        xx = x2b(xx, x, b:e)  # L68: 382
        yy = forw(net, xx; predict=true, o...) # L69: 150
        y  = b2y(y, yy, b:e, x) # L70: 4150
	end
	return y


2015-07-03  Deniz Yuret  <dyuret@ku.edu.tr>

	* test/simpleIPC/fooipc.cu (main): seems like the only choices are
	using lightweight threads or memcopy.  a memory handle seems
	sharable by only one other process?

	* src/cumatrix.jl: DONE: is it better to split code by function
	rather than datatype?  more similarity.

	* src/sparse.jl: DONE: better interface for hcat!, last two args
	should be optional.

	* test/runtests.jl: DONE: test different cpu/gpu sparse/dense
	float32/float64 combos.

	* src/util.jl: DONE: use CudaDynArray everywhere and minimize the realloc.

	* conv.jl: DONE: conv fails tutorial.jl code.  fix
	initialization: should figure out input size/type from first x
	like bias/mmul.

	* cusparse.jl: TODO: implement uniq! for sparse perceptron.

	* perceptron.jl: TODO: implement addx! for gpu perceptron.

	* usage.md: TODO: write loss documentation.

	* usage.md: TODO: write perceptron documentation.

	* param.jl: DONE: add initxavier for conv.jl

	* param.jl: TODO: fix adagrad initialization.

	* src/kperceptron.jl: TODO: make uniq! part of training somehow.

	* net.jl: profile train to see if gpumem hurts: it doesn't.  most
	of the time is spent during x2b.


2015-06-24  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/perceptron.jl: keep w the same orientation as x, makes
	easier to add.  Use w'*x like kperceptron.  Decided otherwise;
	wrong on both counts: w is dense, so not easier to add in either
	direction, and kperceptron holds it in w*k(x) position.

2015-06-23    <dyuret@ku.edu.tr>

	* src/kperceptron.jl: TODO: gpu/dense does not work with
	kpoly/kgauss yet.  gpu/sparse needs to be written.  Define new
	type CudaSparseMatrixCSC and test:
	- train (x2b,b2y)
	- initforw
	- forw
	- initback
	- back
	- update
	- hcat!
	- kpoly (gpu sparse/dense)
	- kgauss (gpu sparse/dense)

2015-06-22    <dyuret@ku.edu.tr>

	* DONE: xavier: https://github.com/BVLC/caffe/blob/master/include/caffe/filler.hpp#L129

2015-06-21    <dyuret@ku.edu.tr>

	* test/testkperceptron.jl:

	Q: linear kperceptron and perceptron do not give the same result?
	The difference is due to kperceptron not having a bias.  Removing
	bias from perceptron makes the results equal to numerical
	accuracy.

	Q: is K sparse or dense?  sparse.  +0 makes it dense.

	Q: should we add bias back to kperceptron but make it optional?
	it helps klinear.  does it help kpoly?  why does it hurt kgauss?

	Q: for loop is much slower than sparse mmul?

2015-06-20    <dyuret@ku.edu.tr>

	* kperceptron.jl: To move any further we need to sort out this
	array type business.  Since we introduced sparse arrays we are no
	longer limited to two types, Array and CudaArray.  That means the
	atype mechanism is no longer ok.

	First of all the original data comes in a cpu array.  It gets
	copied into minibatches by train and predict.  It can be full or
	sparse.  Nothing after the original data needs to be sparse except
	support vectors.  The minibatches could be on cpu or gpu (if gpu
	usage is specified).  We will keep the sparseness of the input in
	minibatches.  The users preferences should specify the cpu/gpu and
	the eltype that should be used internally in layer computations.
	Do we allow ftype to be different?  If not we could eliminate
	that and take it from the input as well?  So the input data
	determines the ftype and sparseness of layer calculations.  GPU
	used if present.  User has the option to turn gpu off.

	Conversions take place in param.jl, net.jl (train/predict).
	conv/pool only work with gpu arrays right now.
	perceptron/kperceptron only works with cpu arrays right now.

	So get rid of atype/ftype.  Get it from the input or during
	initialization.  So how do we initialize an mmul layer?  By
	specifying number of outputs.  Just like the perceptron.  The
	weight matrix gets constructed when the first input is received in
	initforw.  Mmul(array), Mmul(out), Mmul(out,in),
	Mmul(ftype,out,in) could be the initializers (modeled after
	zeros).  cpu/gpu is decided based on GPU setting (make it
	non-constant as before).  ftype defaults to Float64 as the rest of
	Julia (e.g. zeros(dims...), rand(dims...) etc).  So (out,in) can
	create a matrix.  Only (out) cannot, it will have to wait.  Param
	can play a more passive role?  Or we pass info to param?


2015-06-09    <dyuret@ku.edu.tr>

	* src/kernel.jl: Poly kernels working with sparse/full arrays.
	* DONE: Try on CudaArray: mul, xpose, hcat vs does not work.
	* DONE: Try on SparseCudaArray (cpu sparse is 5x slower than cpu
	full on mnist) -- postponed.
	* DONE: rbf kernel

2015-06-03    <dyuret@ku.edu.tr>

	* src/percloss.jl: Added perceptron loss.  A multiclass perceptron
	is just Mmul followed by PercLoss.
	* DONE: write unit test for PercLoss.
	* DONE: implement and test kernel perceptron next.
	* DONE: test subarray and concat with full/sparse on cpu/gpu.
	* DONE:	test KUnet with full/sparse, cpu/gpu, Float32/64 -- postponed
	* TODO: perform kuparser experiment comparing dense vs sparse
	features.
	* TODO: test onur's 4D code.
	* TODO: write doc on xent loss
	* TODO: write doc on perceptrons

2015-05-17    <dyuret@ku.edu.tr>

	* src/sigm.jl: done: make sure back returns dx.

	* src/logp.jl: done: gpu impl for back. why doesn't runtests catch
	this?  because dx=dy.

	* src/KUnet.jl: done: import copy etc.

	* src/net.jl:
	? add y=x and dx=dy optional args to forw and back.
	? the problem is do we copy if we don't modify?
	+ rename the options: fx=>dropout, dx=>returndx
	+ add o... options to all forw,back,copy,update,setparam!,loss

	* test/runtests.jl: done: cpu-only test.

	* test/runtests.jl: TODO: julia4 test.

	* src/param.jl: done: find a solution to copy.

2015-05-16    <dyuret@ku.edu.tr>

	* docs: TODO: update docs.

	* src/jldio.jl: done: update for new version.

	* src/net.jl: done: add shuffling back to train.

	* src/param.jl: done: compile cuda parts.

	* test/lenet.jl: done: need xavier init? the training does
	not take off until epoch 7 - turns out larger lr needed.

	* issimilar: done: add issimilar checks to all forw/back.

2015-05-15    <dyuret@ku.edu.tr>

	* test/runtests.jl: passed:
	+ bias.jl
	? conv.jl: only gpu, only 4D, no gradcheck
	+ drop.jl
	+ logp.jl
	+ logploss.jl
	+ mmul.jl
	? pool.jl: only gpu, only 4D, no gradcheck
	+ quadloss.jl
	+ relu.jl
	+ sigm.jl
	+ soft.jl
	+ softloss.jl
	+ tanh.jl
	+ xentloss.jl

	* src/xentloss.jl: Implementing loss functions as layers.  forw
	only records the outgoing y.  back takes the desired answer and
	overwrites it with the gradient of y wrt loss.  Note that this has
	two problems: (1) the actual loss value is never returned, i.e. we
	just compute gradients for training, use a separate loss function
	for eval. (2) the semantics of back is different: a regular
	layer's back takes dy, the loss gradient wrt output y, and returns
	dx, the loss gradient wrt input x.  A loss layer takes p, the gold
	answers, and returns dx, the loss gradient wrt input x.

2015-05-14    <dyuret@ku.edu.tr>

	* TODO:
	+ prob layer and loss fns: logploss, probloss, mseloss?, softmax?
	+ gradient check
	+ float64 support for Drop and Logp
	+ modify net.jl with the new loss convention.
	+ loss layers in gpu
	- conv pool gradcheck.
	- caffe comparison
	- conv pool cpu
	- conv pool 5D
	+ clean up float32:

	* test/runtests.jl: use Atype and Ftype to get gpu/cpu and
	float32/float64 behavior.

	* done: add a prob layer that computes normalized probabilities.
	Then rename three different softmax layers whether their input is
	unnormalized logp, logp, or prob.

2015-05-12    <dyuret@ku.edu.tr>

	* design: I have a new design:
	+ split every operation, including bias and activation.
	+ basically every operation in forw becomes its own "layer".
	+ each "layer" implements forw, back, update, setparam, copy.
	+ each "layer" overrides forw/back for arrays, cudaarrays, tensors.
	+ rename AbstractLayer -> Layer

2015-05-11    <dyuret@ku.edu.tr>

	* src/conv.jl:
	# TODO: How to transition from convolutional to fully connected layer?
	# Does a network pass around tensors with the same number of dimensions?
	# Can we write the code generic enough so it can deal with 2d matrices, 4d, 5d tensors?
	# In any case fc layer is different from conv layer...
	# n instances with c features is represented in caffe using a (1,1,C,N) tensor.
	# i.e. a 0-D image with C channels.
	# 2-D images are represented as (W,H,C,N).
	# 3-D images are represented as (D,W,H,C,N).
	# Is there any use for just (H,C,N)?
	# What is convolution for <2D or fc for >=2D?
	# Locality is important for convolution, i.e. dimensions other than
	# C,N give "neighboring" pixels so we can do convolution.
	# In a regular feature vector, there is no "neighbors" all features
	# are equally far from each other, that is why we use C,N.  There
	# would be no convolution operation in that case either.
	# N = instances in general (each instance leads to one class
	# prediction)
	# C = features in general
	# All other dimensions represent local neighborhoods.
	# So more generic data structure is reverse(N,C,I1,I2,I3,...)
	# And our regular layers are in fact 0-D, and conv can only be defined on >= 1-D.
	# We'd need to implement all except 2-D right now.
	# What does FC mean for >= 1-D?

	# TODO: do we really need the ConvolutionDescriptor

	# TODO: cudnn supports both Float32 and Float64, the rest of KUnet
	should too.

2015-03-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO:
	- Accept tuples for newnet and setparam to specify different
	values for different layers.  At least modify train.jl to be more
	similar to KUparser/test/train.jl.
	- sizeof, print for nets?
	- put an option for f of final layer (ffinal).
	+ add options to copy for testing (no need for training params)
	and saving (no need for transient fields).
	+ ERROR: CUDA runtime API library cannot be found - on yunus.
	- train.jl: allow length(v)==2*length(net) for param spec each w,b


2015-03-12  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO:
	- In KUnet, can we avoid reallocating everything unless we need more space?
	#    If batch gets smaller, just run the big batch through and copy part of the result?
	#    This needs some more thinking especially for training.


2015-03-01  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO:
	- make number type generic, test Float64
	- implement rmsprop: https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides/lec6.pdf
	- implement adam: http://arxiv.org/pdf/1412.6980v2.pdf
	- understand adam math. what to do with g1?  what to do with g2?
	these are not stationary and our estimates are noisy.  what to do
	if we had perfect information?  does this correspond to newton
	with diagonal covariance matrix?  volkan's slides to adam email.
	- implement copynet cpu/gpu.
	- write parameter documentation.
	- implement hinge loss
	- implement squared loss
	- implement gradient clipping: pascanu and mikolov 2012.
	- implement rnn
	-- implement lstm
	-- implement gru
	-- steeper gates nips: lyu and zhu (piecewise linear)
	- orthogonal initialization: andrew sax
	- can we do piecewise linear approx to softmax? (hinge?)
	- try on machine with CUDArt but no gpu.

2015-02-24  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO:
	+ start writing documentation.
	+ try install/test on a new gpu/cpu machine.
	+ build tests based on mnist.
	x compare with matlab/caffe if they exist.
	x what other tests?  gradient?  store answers?

2015-02-23  Deniz Yuret  <dyuret@ku.edu.tr>

	* src/KUnet.jl:
	+ reconsider the constructors: they should only allow meaningful
	fields to be set, and they should call setparam for updateparams.
	- implement convnet: ConvLayer <: Layer
	+ centralize memory allocation
	- hdf5 save for whole net: use jld?

2015-02-22  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	x Make InplaceOps work without patch using CUBLAS generics.

2015-02-20  Deniz Yuret  <dyuret@ku.edu.tr>

	* TODO:

	- implement/test maxout?
	- Write blogpost on overfitting (needs mnist)
	+ Write blogpost/README: deep learning in 250 lines of julia (needs mnist)
	+ Cleanup hdf5 files.
	+ Figure out automatic testing.
	+ Make softloss, get rid of soft layer.
	+ Add other losses
	+ make loss a training option.
	+ Add sigmoid layer.
	+ Make b and yforw conditional?
	+ Figure out if we have a gpu and if we are using a gpu, test code on no-gpu machine
	+ Export functions
	+ Make layer constructors that take size and generate random matrices
	+ Make layer constructors that take arbitrary matrices, h5 files
	x Error checking in cuda.jl
	+ pull request for InplaceOps
	+ pull request for CUBLAS
	+ pull request for ArgParse
	x Cleanup kernel calls in kunet.cu
	x Have kernel softmax return loss?
	x Cleanup hdf5 format in kunet_h5.cu, get rid of xfunc, yfunc,
	+ make dropout a layer option.
	+ Make train call backprop
	x implement/test maxnorm?
	+ use mnist for regression testing.

2015-02-19  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	- Verify generic functions in cuda.jl
	- Try to make update.jl more readable
	+ HDF5 should store the name of the functions
	+ Find a good way to handle dropout during training and testing.
	x maybe xforms should be part of the trainer not the layer.
	x caffe has it as another layer
	x i have tried as a separate fn or as part of forw/back before.

	+ implement/test dropout
	+ gpuseed not working, but random numbers start at the same place?
	  need to recreate RNG.
	+ cuda and julia not equal?
	x change dropout in cuda as well to use xmask for storage
	- change / check adagrad, nesterov options in cuda
	- implement/test maxnorm (cuda/matlab, no caffe test)

2015-02-18  Deniz Yuret  <dyuret@ku.edu.tr>

	* DEBUG:
	+ test_fw_cuda: 2.26s
	+ test_fw_caffe: 3.82s
	+ test_fw_matlab: 3.83s
	+ test_fw_julia_cpu: 21.64s
	+ test_fw_julia_gpu: 5.39s ??? (check ger vs badd; do test with direct ccalls)
	+ who is allocating 35MB?
	+ elapsed time: 5.395230859 seconds (35 MB allocated, 0.06% gc time in 1 pauses with 0 full sweep)

2015-02-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	Possible design changes:
	+ Take training options out of Layer and pass them as options to layer update.
	+ That could support independent layer options but not w vs b.
	+ Group parameter and its diff in a blob like caffe: l->w->data, l->w->grad?
	x Make w and b two elements of an array: l->w[0,1]->ndims,dims,data,diff,diff1,diff2?
	x x and y have data,diff but no diff1 diff2.
	x But x has xmask, xones; we could use tmp1 and tmp2 as common names.
	+ Each w and b could have its own update options?
	+ Update can take each w, b individually, i.e. blob / options.
	x So can forward and back, take matrices instead of layers, but that's pushing it.
	+ To simplify memory management rely on xcols being correct in forw/drop and centralize alloc changes.

	+ figure out cuda rand to implement reproducible dropout.
	+ test dropout: figure out matlab seed, caffe layer.

2015-02-17  Deniz Yuret  <dyuret@ku.edu.tr>

	* DONE:
	+ juliagpu implementation:
	+ inplace or devectorize macros adapted for gpu.
	x need to solve collections of options to c.
	x need to solve collections of arrays to c.
	+ there should be a generic julia implementation.
	+ the gpu support should be activated from the main script.

	+ speed test?
	+ momentum kernel, shrink code.
	x cpu/gpu blobs like caffe?  main goal: generic readable code.

	+ implement cuda/predict.
	+ implement cuda/backprop.
	+ implement cuda/train.
	+ implement data.h5 comparison.
	+ implement matlab/predict.
	+ compare cuda/predict to matlab.
	+ implement layer.h5 comparison.
	+ implement matlab/backprop.
	+ compare cuda/backprop to matlab. 
	+ implement matlab/train.
	+ compare cuda/train to matlab. 
	+ implement caffe/predict.
	+ implement caffe/backprop.
	+ implement caffe/train.
	+ compare cuda/predict to caffe.
	+ compare cuda/backprop to caffe. 
	+ compare cuda/train to caffe. 

	train options?
	+ already in file?
	+ take as cmd line opts?
	x try all variations?
	+ we'll need cmd-line opts at least for batch, epoch, etc.
	x (or assume epoch=1 and batch=100?)
	x yeah, simple train interface with train x l1 l2 .. y as well.
	x these are just test scripts after all.
	x maybe just do batch in the future.
	+ julia version:
	x layers fully opaque?
	+ train options?
	+ separate options from weights?
