Problem F
Linear Regression

Implement simple linear regression in $d$ dimensions:

\[ f(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_ d x_ d. \]

In other words, choose the values $w_0$ through $w_ d$ that make the best linear fit of a set of $(\mathbf{x}, y)$ example pairs. The error of the prediction $f(\mathbf{x})$ for target value $y$ is $(f(\mathbf{x}) - y)^2$ (i.e. squared error).

\includegraphics[width=0.5\textwidth ]{simple.png}
Figure 1: Plots of training set (top) and testing set (bottom) for the sample data. For this problem, the training set is given as input, and the testing set is listed as output (but not given to your submission).


Input for your submission is a training set of $n$ examples in $d$ dimensions.

The first line contains two integers $1 \le n \le 10\, 000$ and $1 \le d \le 10$. Following are $n$ lines, each containing $d+1$ real numbers. For line $i$ ($1 \le i \le n)$, the first $d$ numbers are $x_{i,1}$ through $x_{i,d}$, and the last is for the corresponding $y_ i$ value. Each real value has at most $5$ digits past the decimal point and is in the range $[-2 \cdot 10^9, 2 \cdot 10^9]$.

It is guaranteed that no feature is constant and each pair of features is not collinear. Specifically, let $\mu _ j = \frac{1}{n} \sum _{i=1}^ n x_{i,j}$ be the mean of feature $j$. Then it is guaranteed that

\[ 0.001 \le \sum _{i=1}^ n |x_{i,j} - \mu _ j| \]

(no feature is constant), and also that for any pair of features $1 \le j < k \le d$:

\[ \frac{|\sum _{i=1}^ n (x_{i,j} - \mu _ j)(x_{i,k} - \mu _ k)| }{\sqrt{\sum _{i=1}^ n (x_{i,j} - \mu _ j)^2} \sqrt{\sum _{i=1}^ n (x_{i,k} - \mu _ k)^2} } \le 0.99 \]

(no pair of features is collinear).


Implement an algorithm to create a linear model that fits the training data. Recommended approaches are gradient descent, least-squares estimation using the Moore-Penrose pseudoinverse, or ridge regression.

Output $d+1$ real values corresponding to the coefficients $w_0$ to $w_ d$ in the definition of $f(\mathbf{x})$ above.

Note that the sample output shown is actually the test data on which a model would be evaluated (see below). It is in the same format as the input. A reasonable output for the given sample input might be:

8.60271 1.19910

which would give a score of about $96.31$ on the test data, according to the metric below.


For each test file, the output of your submission will be used to construct a linear model to evaluate on another set of data, called the testing set. The test set comes from a distribution similar to the training set, but is not known to your submission.

Assume that the test data has $m$ examples $\{ (\mathbf{x}_1,y_1), \ldots , (\mathbf{x}_ m,y_ m)\} $. The score your model receives for each test case is based on the following:

\[ 100 \left(1 - \frac{\sum _{i=1}^ m (f(\mathbf{x}_ i) - y_ i)^2}{\sum _{i=1}^ m (\mu _ y - y_ i)^2} \right) \]

where $\mu _ y = \frac{1}{m} \sum _{i=1}^ m y_ i$ is the average $y$ value. A value close to $100$ is good, as that indicates the linear model $f$ is predicting $y$ well. If you are familiar with the $r^2$ coefficient of determination, this score is equal to $100 r^2$. We clamp the score to be within the range $[0, 100]$.

The overall score for your submission is the average score over all hidden test files. The sample data is excluded from this average.

Sample Input 1 Sample Output 1
5 1
1 10
2 11
3 12
4 13
5 15
5 1
0 9.5
2 10
6 16.5
7 17.5
8 18
CPU Time limit 3 seconds
Memory limit 1024 MB
Statistics Show
Greg Hamerly
Source Baylor University Machine Learning
License Creative Commons License (cc by-sa)

Please log in to submit a solution to this problem

Log in