Problem F
Linear Regression
Implement simple linear regression in $d$ dimensions:
\[ f(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_ d x_ d. \]In other words, choose the values $w_0$ through $w_ d$ that make the best linear fit of a set of $(\mathbf{x}, y)$ example pairs. The error of the prediction $f(\mathbf{x})$ for target value $y$ is $(f(\mathbf{x}) - y)^2$ (i.e. squared error).
Input
Input for your submission is a training set of $n$ examples in $d$ dimensions.
The first line contains two integers $1 \le n \le 10\, 000$ and $1 \le d \le 10$. Following are $n$ lines, each containing $d+1$ real numbers. For line $i$ ($1 \le i \le n)$, the first $d$ numbers are $x_{i,1}$ through $x_{i,d}$, and the last is for the corresponding $y_ i$ value. Each real value has at most $5$ digits past the decimal point and is in the range $[-2 \cdot 10^9, 2 \cdot 10^9]$.
It is guaranteed that no feature is constant and each pair of features is not collinear. Specifically, let $\mu _ j = \frac{1}{n} \sum _{i=1}^ n x_{i,j}$ be the mean of feature $j$. Then it is guaranteed that
\[ 0.001 \le \sum _{i=1}^ n |x_{i,j} - \mu _ j| \](no feature is constant), and also that for any pair of features $1 \le j < k \le d$:
\[ \frac{|\sum _{i=1}^ n (x_{i,j} - \mu _ j)(x_{i,k} - \mu _ k)| }{\sqrt{\sum _{i=1}^ n (x_{i,j} - \mu _ j)^2} \sqrt{\sum _{i=1}^ n (x_{i,k} - \mu _ k)^2} } \le 0.99 \](no pair of features is collinear).
Output
Implement an algorithm to create a linear model that fits the training data. Recommended approaches are gradient descent, least-squares estimation using the Moore-Penrose pseudoinverse, or ridge regression.
Output $d+1$ real values corresponding to the coefficients $w_0$ to $w_ d$ in the definition of $f(\mathbf{x})$ above.
Note that the sample output shown is actually the test data on which a model would be evaluated (see below). It is in the same format as the input. A reasonable output for the given sample input might be:
8.60271 1.19910
which would give a score of about $96.31$ on the test data, according to the metric below.
Scoring
For each test file, the output of your submission will be used to construct a linear model to evaluate on another set of data, called the testing set. The test set comes from a distribution similar to the training set, but is not known to your submission.
Assume that the test data has $m$ examples $\{ (\mathbf{x}_1,y_1), \ldots , (\mathbf{x}_ m,y_ m)\} $. The score your model receives for each test case is based on the following:
\[ 100 \left(1 - \frac{\sum _{i=1}^ m (f(\mathbf{x}_ i) - y_ i)^2}{\sum _{i=1}^ m (\mu _ y - y_ i)^2} \right) \]where $\mu _ y = \frac{1}{m} \sum _{i=1}^ m y_ i$ is the average $y$ value. A value close to $100$ is good, as that indicates the linear model $f$ is predicting $y$ well. If you are familiar with the $r^2$ coefficient of determination, this score is equal to $100 r^2$. We clamp the score to be within the range $[0, 100]$.
The overall score for your submission is the average score over all hidden test files. The sample data is excluded from this average.
Sample Input 1 | Sample Output 1 |
---|---|
5 1 1 10 2 11 3 12 4 13 5 15 |
5 1 0 9.5 2 10 6 16.5 7 17.5 8 18 |