Problem G
Logistic Regression
Implement simple logistic regression in $d$ dimensions:
\[ f(\mathbf{x}, \mathbf{w}) = s(w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_ d x_ d) = s(\mathbf{w}^ T \mathbf{x}), \]where we assume that $x_0 = 1$ and
\[ s(\alpha ) = (1 + \exp (-\alpha ))^{-1} \]is the logistic sigmoid.
In other words, choose the values $w_0$ through $w_ d$ that maximize the likelihood for a set of $(\mathbf{x}, y)$ example pairs, where $y \in \{ -1,1\} $.
Input
Input for your submission is a training set of $n$ examples in $d$ dimensions.
The first line contains two integers $1 \le n \le 10\, 000$ and $1 \le d \le 10$. Following are $n$ lines, each containing $d$ real numbers and an integer. For line $i$ ($1 \le i \le n)$, the first $d$ numbers are $x_{i,1}$ through $x_{i,d}$, and the last is for the corresponding value $y_ i \in \{ -1,1\} $. Each real value has at most $5$ digits past the decimal point and is in the range $[-2 \cdot 10^9, 2 \cdot 10^9]$.
It is guaranteed that no feature is constant and each pair of features is not collinear. Specifically, let $\mu _ j = \frac{1}{n} \sum _{i=1}^ n x_{i,j}$ be the mean of feature $j$. Then it is guaranteed that
\[ 0.001 \le \sum _{i=1}^ n |x_{i,j} - \mu _ j| \](no feature is constant), and also that for any pair of features $1 \le j < k \le d$:
\[ \frac{|\sum _{i=1}^ n (x_{i,j} - \mu _ j)(x_{i,k} - \mu _ k)| }{\sqrt{\sum _{i=1}^ n (x_{i,j} - \mu _ j)^2} \sqrt{\sum _{i=1}^ n (x_{i,k} - \mu _ k)^2} } \le 0.99 \](no pair of features is collinear).
Output
Implement an algorithm to create a logistic regression model that fits the training data. Output $d+1$ real values corresponding to the coefficients $w_0$ to $w_ d$ in the definition of $f(\mathbf{x}, \mathbf{w})$ above.
Note that the sample output shown is actually the test data on which a model would be evaluated (see below). It is in the same format as the input. A reasonable output for the given sample input might be:
-2.61588441789 1.07522029192
which would give a score of about $90.3045$ on the test data, according to the metric below.
Scoring
The output of your submission will be used to construct a logistic regression model to evaluate on another set of data, called the testing set. The test set comes from a distribution similar to the training set, but is not known to your submission.
Assume that the test data has $m$ examples $\{ (\mathbf{x}_1,y_1), \ldots , (\mathbf{x}_ m,y_ m)\} $. The score your model receives for each these $m$ examples is:
\[ \frac{100}{m} \sum _{i=1}^ m \left| (1 - y_ i)/2 - f(\mathbf{x_ i}, \mathbf{w}) \right| \]This measures the average accuracy over the $m$ examples; higher values are better.
Since predicting a constant (e.g. all weights being $0$ would give a constant prediction of $f(\mathbf{x}, \mathbf{0}) = 0.5$) will get you a score of $50$, if your score is not above $50$ for any test case, your submission will be judged as incorrect.
Sample Input 1 | Sample Output 1 |
---|---|
5 1 1 -1 2 1 3 -1 4 1 5 1 |
5 1 0 -1 2 -1 6 1 7 1 8 1 |