# Problem B

Logistic Regression

Implement simple logistic regression in $d$ dimensions:

\[ f(\mathbf{x}, \mathbf{w}) = s(w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_ d x_ d) = s(\mathbf{w}^ T \mathbf{x}), \]where we assume that $x_0 = 1$ and

\[ s(\alpha ) = (1 + \exp (-\alpha ))^{-1} \]is the logistic sigmoid.

In other words, choose the values $w_0$ through $w_ d$ that maximize the likelihood for a set of $(\mathbf{x}, y)$ example pairs, where $y \in \{ -1,1\} $.

## Input

Input for your submission is a training set of $n$ examples in $d$ dimensions.

The first line contains two integers $1 \le n \le 10\, 000$ and $1 \le d \le 10$. Following are $n$ lines, each containing $d$ real numbers and an integer. For line $i$ ($1 \le i \le n)$, the first $d$ numbers are $x_{i,1}$ through $x_{i,d}$, and the last is for the corresponding value $y_ i \in \{ -1,1\} $. Each real value has at most $5$ digits past the decimal point and is in the range $[-2 \cdot 10^9, 2 \cdot 10^9]$.

It is guaranteed that no feature is constant and each pair of features is not collinear. Specifically, let $\mu _ j = \frac{1}{n} \sum _{i=1}^ n x_{i,j}$ be the mean of feature $j$. Then it is guaranteed that

\[ 0.001 \le \sum _{i=1}^ n |x_{i,j} - \mu _ j| \](no feature is constant), and also that for any pair of features $1 \le j < k \le d$:

\[ \frac{|\sum _{i=1}^ n (x_{i,j} - \mu _ j)(x_{i,k} - \mu _ k)| }{\sqrt{\sum _{i=1}^ n (x_{i,j} - \mu _ j)^2} \sqrt{\sum _{i=1}^ n (x_{i,k} - \mu _ k)^2} } \le 0.99 \](no pair of features is collinear).

## Output

Implement an algorithm to create a logistic regression model that fits the training data. Output $d+1$ real values corresponding to the coefficients $w_0$ to $w_ d$ in the definition of $f(\mathbf{x}, \mathbf{w})$ above.

Note that the sample output shown is actually the test data on which a model would be evaluated (see below). It is in the same format as the input. A reasonable output for the given sample input might be:

-2.61588441789 1.07522029192

which would give a score of about $90.3045$ on the test data, according to the metric below.

## Scoring

The output of your submission will be used to construct a logistic regression model to evaluate on another set of data, called the testing set. The test set comes from a distribution similar to the training set, but is not known to your submission.

Assume that the test data has $m$ examples $\{ (\mathbf{x}_1,y_1), \ldots , (\mathbf{x}_ m,y_ m)\} $. The score your model receives for each these $m$ examples is:

\[ \frac{100}{m} \sum _{i=1}^ m \left| (1 - y_ i)/2 - f(\mathbf{x_ i}, \mathbf{w}) \right| \]This measures the average accuracy over the $m$ examples; higher values are better.

Since predicting a constant (e.g. all weights being $0$ would give a constant prediction of $f(\mathbf{x}, \mathbf{0}) = 0.5$) will get you a score of $50$, if your score is not above $50$ for any test case, your submission will be judged as incorrect.

Sample Input 1 | Sample Output 1 |
---|---|

5 1 1 -1 2 1 3 -1 4 1 5 1 |
5 1 0 -1 2 -1 6 1 7 1 8 1 |