Problem G
Logistic Regression

Implement simple logistic regression in $d$ dimensions:

\[ f(\mathbf{x}, \mathbf{w}) = s(w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_ d x_ d) = s(\mathbf{w}^ T \mathbf{x}), \]

where we assume that $x_0 = 1$ and

\[ s(\alpha ) = (1 + \exp (-\alpha ))^{-1} \]

is the logistic sigmoid.

In other words, choose the values $w_0$ through $w_ d$ that maximize the likelihood for a set of $(\mathbf{x}, y)$ example pairs, where $y \in \{ -1,1\} $.

\includegraphics[width=0.5\textwidth ]{simple.png}
Figure 1: Plots of training set (top) and testing set (bottom) for the sample data. For this problem, the training set is given as input, and the testing set is listed as output (but not given to your submission).


Input for your submission is a training set of $n$ examples in $d$ dimensions.

The first line contains two integers $1 \le n \le 10\, 000$ and $1 \le d \le 10$. Following are $n$ lines, each containing $d$ real numbers and an integer. For line $i$ ($1 \le i \le n)$, the first $d$ numbers are $x_{i,1}$ through $x_{i,d}$, and the last is for the corresponding value $y_ i \in \{ -1,1\} $. Each real value has at most $5$ digits past the decimal point and is in the range $[-2 \cdot 10^9, 2 \cdot 10^9]$.

It is guaranteed that no feature is constant and each pair of features is not collinear. Specifically, let $\mu _ j = \frac{1}{n} \sum _{i=1}^ n x_{i,j}$ be the mean of feature $j$. Then it is guaranteed that

\[ 0.001 \le \sum _{i=1}^ n |x_{i,j} - \mu _ j| \]

(no feature is constant), and also that for any pair of features $1 \le j < k \le d$:

\[ \frac{|\sum _{i=1}^ n (x_{i,j} - \mu _ j)(x_{i,k} - \mu _ k)| }{\sqrt{\sum _{i=1}^ n (x_{i,j} - \mu _ j)^2} \sqrt{\sum _{i=1}^ n (x_{i,k} - \mu _ k)^2} } \le 0.99 \]

(no pair of features is collinear).


Implement an algorithm to create a logistic regression model that fits the training data. Output $d+1$ real values corresponding to the coefficients $w_0$ to $w_ d$ in the definition of $f(\mathbf{x}, \mathbf{w})$ above.

Note that the sample output shown is actually the test data on which a model would be evaluated (see below). It is in the same format as the input. A reasonable output for the given sample input might be:

-2.61588441789 1.07522029192

which would give a score of about $90.3045$ on the test data, according to the metric below.


The output of your submission will be used to construct a logistic regression model to evaluate on another set of data, called the testing set. The test set comes from a distribution similar to the training set, but is not known to your submission.

Assume that the test data has $m$ examples $\{ (\mathbf{x}_1,y_1), \ldots , (\mathbf{x}_ m,y_ m)\} $. The score your model receives for each these $m$ examples is:

\[ \frac{100}{m} \sum _{i=1}^ m \left| (1 - y_ i)/2 - f(\mathbf{x_ i}, \mathbf{w}) \right| \]

This measures the average accuracy over the $m$ examples; higher values are better.

Since predicting a constant (e.g. all weights being $0$ would give a constant prediction of $f(\mathbf{x}, \mathbf{0}) = 0.5$) will get you a score of $50$, if your score is not above $50$ for any test case, your submission will be judged as incorrect.

Sample Input 1 Sample Output 1
5 1
1 -1
2 1
3 -1
4 1
5 1
5 1
0 -1
2 -1
6 1
7 1
8 1
CPU Time limit 2 seconds
Memory limit 1024 MB
Statistics Show
Greg Hamerly
Source Baylor University Machine Learning
License Creative Commons License (cc by-sa)

Please log in to submit a solution to this problem

Log in