Thursday, August 16, 2012

A Simple Method to Reduce Systematic Bias in Grades


Grades in universities may suffer inflation over time. Moreover, the grade-average of some classes is much lower or higher than would be expected from the students attending the class. Here is a simple (and imperfect) method to avoid these:

First, one needs to decide on the average grade v. The gpa of any student who hasn't completed any classes yet is set to v. To find the grades of students in a class, the average c of the gpas of the students is computed, and then their final grades in the class are curved so that the average grade for that class becomes c.

A somewhat orthogonal approach would be to take into account previous raw grades for the same class and teacher.

Additionally, one could try to normalize variance as well.

On Rich-Substrate Books


For digital books, we are no longer bound by the medium of text only. Hypertext is a good start to the ways we can make content more digestible. Another step to take is to define various “views” for the same content. For example, a textbook could have, amongst other, a "learning" and a "reviewing" view. In the learning view, definitions are supplemented with comments and more exercises are visible, while in the reviewing view, only those things necessary to refresh the memory are visible.

Monday, August 6, 2012

Evos

Let us define a simple evolutionary scheme, henceforth evos, as made up of the following:

  • a genomic language G.
  • a phenomic space S.
  • a procedure E from G to S, called the egen (for embriogenesis).
  • a machine M from G to G, called the mutator.

All genomes are strings in G. E maps genomes to phenomes. M is the reproduction operator, which usually returns a mutated genome.

Given this definition, there are various meaningful ways in which we can compare evos and organize them in hierarchies.

The most obvious way to do this is to look at the subspace of S that is attainable by an evos. If the function defined by E is not surjective, this directly limits the attainable subspace of S. Another possibility is that the iteration of M from a starting genome may never attain certain other genomes. However, I see no good reason to use such schemes; it always seems better to restrict E instead. So we shall focus on schemes where it is possible to attain any genome from any starting genome through an iteration of M (there is a name for such processes, but I don't recall it). This is also why I chose not to include a starting genome (set) in the definition.

For example, if G only contains fixed-length strings, but S is an infinite-dimension space (like it would be in many cases), it is impossible for E to be surjective. Another possibility would be to fix certain parts of the phenome. In this case, E can only affect some parts of the phenome, and so can't be surjective. A similar possibility is to deliberately restrict E to map only to a certain subspace of interest.

With this way of comparison, an evos is higher in the hierarchy than another if its attainable subspace is larger.

Another, perhaps more interesting way is to look at the expressive power of E. A trivial egen is an identity procedure, so that in such an evos, the genome and the phenome of an individual are conflated. At the other end of this hierarchy are Turing-complete egens.

Similarly, we may also look at the expressive power of the mutator, and get a third way of comparing different evos. At the bottom of this hierarchy (keeping in mind the restriction we put on M), we find mutators that mutate genomes at a fixed rate. Another example is a mutator that could mutate the genome at a rate that varies as it goes through the genome, depending on annotations on the genome (these annotations would be part of the genome, but ignored by the egen). At the top of the hierarchy are Turing-complete mutators.



Saturday, August 4, 2012

The Ideal Learner


Given a machine learning problem, i.e. given some training data O of type T, the ideal model M is found as follows:

  • First, a prior P over all (computable) distributions over T is found from past experience and prior knowledge of the problem (which must be independent of O). The prior may be represented as a function, that given a distribution over T, returns its weight. We shall write P(d) to denote the application of this function to a distribution d.
  • Then, the posterior S is derived from O and P by reweighing the distributions in T according to their likelihood given the data. So S(d), where the function S is defined similarly to P, is equal to P(d)*L(d|O) / n, where n is the normalization factor (equal to the sum over all d of  P(d)*L(d|O)).

Given a partial test case o (possibly trivial), M posteriorizes each distribution in S according to o. Then a weighted sum of all these posteriors is taken, according to the weights in S. This gives the ideal distribution for o given the knowledge available.

For example, given a classification problem with a training set <xi, yi>, where the xi belong to Rn and the yi to R, one would first find P over all distributions over Rn+1. Then one would find S, of the same type as P. Given a test case x, one would restrict all the distributions in S to the points whose first n coordinates are equal to x, then take the weighted sum of these to find the ideal distribution. One could further take the mode of the ideal distribution to get a concrete y.

Of course, finding M is usually impossible, and even approximating it is usually impractical. In practice, the prior is chosen mostly for convenience reasons. For example, choosing to use certain SVMs along with the way they are computed from training data defines a prior over all distributions over T (assigning probability zero to most of them).

On Machine Emulation Slowdown


The worst-case optimal slowdown for a virtual machine A being emulated on a machine B can be found by optimally emulating B in A and run it on B, and compare the speed of this emulation of B to the speed of B. (If the language of A is not as expressive as the language of B, then the emulation of B in A is impossible, and the worst-case slowdown doesn't make any sense, or can be said to be infinite.)

So any emulator of B in the language of A allows us to get an upper bound on the worst-case slowdown occurring necessarily when using the virtual machine A instead of B.

Similarly, when a high-level language A compiles to a low-level language B, an interpreter from B to A allows us to get an upper bound on the worst-case slowdown induced by programming in A instead of B. (It should be noted however that this factor will tend to be much worse than the average slowdown.)

Note: When designing a high-level language A to be compiled to B, one could always add the primitives available in B to A to make the worst-case slowdown equal to 1 (C may be an example of a language that does that, as it is possible to include assembly in a C program). However, the inclusion of B in A should be done very carefully, as it could potentially break the nice properties of A.

Hello World!

:-)