STAT217
Worksheet #6
Question 1
You are given the following data of shipping distance and the time it takes for a shipment to travel that distance by courier.
|
Distance in km |
825 |
215 |
1070 |
550 |
480 |
920 |
1350 |
325 |
670 |
1215 |
|
Time in days |
3.5 |
1 |
4 |
2 |
1 |
3 |
4.5 |
1.5 |
3 |
5 |
a) Construct a model in which the shipping time depends on the distance. State what the model is. (time = 0.1181 + 0.0036*distance)
b) What percentage of the variation in shipping time is explained by the model? (90.05%)
c) Test the hypothesis that the model is significant at the 5% level of significance. (F = 72.3959; conclude model is significant)
d) Construct a 95% confidence interval of the slope coefficient. If this interval were used to test the hypothesis in part c, why would the same conclusion be reached? (0.0026 < B1 < 0.0046; hypothesized slope of zero is not in the interval)
e) If the distance is 500 km, how many days should you expect the shipping time to be? Round to 2 decimals. (1.91 days)
f) If the distance is 500 km, what is the range of the average shipping time for 95% of the time? Round to 2 decimals. (1.48 to 2.34 days)
g) For a particular shipment that travels 500 km, what is the range of the shipping time for 95% of the time? Round to 2 decimals. (0.72 to 3.1 days)
Question 2
Suppose you are given the following information:
size of home (in thousands of square feet)
home price (in thousands of dollars)
|
size |
1.82 |
1.59 |
1.57 |
1.81 |
2.01 |
1.57 |
1.87 |
1.82 |
1.59 |
1.95 |
|
price |
173.1 |
160 |
164.6 |
183.5 |
194.8 |
166 |
178.7 |
181.5 |
160.5 |
196.5 |
a) Construct a model in which the home price depends on the home size (price = 43.4702 + 75.2556*size)
b) What percentage of the variation in price is explained by the model? (88.52%)
c) Test the hypothesis that the model is significant at the 5% level of significance. (F = 61.6842; conclude model is significant)
d) Construct a 95% confidence interval of the slope coefficient. If this interval were used to test the hypothesis in question 10, why would the same conclusion be reached? (53.1597 < B1 < 97.3515; hypothesized slope of zero is not in the interval)
e) If a home has 1500 square feet, what would you expect the price to be? Round to the nearest hundred. ($156,400)
f) If the square footage is 1500 square feet, what is the range of the average home price for 95% of the time? Round to the nearest hundred. ($149,600 to $163,100)
g) If a particular home with 1500 square feet is put up for sale, what is the range of the selling price of this home for 95% of the time? Round to the nearest hundred. ($143,400 to $169,300)
Question 3
You are given the following 11 readings of wood density and stiffness:
|
Density |
Stiffness |
Density |
Stiffness |
|
21.7 |
47661 |
15 |
25319 |
|
15.2 |
28028 |
25.6 |
96305 |
|
23.4 |
104170 |
15 |
26222 |
|
15.4 |
25312 |
24.4 |
72594 |
|
14.5 |
22148 |
7 |
5304 |
|
16.7 |
49499 |
|
|
An initial model was built in which density is used to predict stiffness. Here is the plot of the fitted values against the residuals:

a) Why does a linear transformation appear to be in order? (see key)
b) If density is used to predict stiffness, which is the better transformation of y: natural log or square root based on r2? Find a suitable transformation and create a linear model using the transformed data. (ln stiffness = 7.9059 + 0.145*density)
c) Is the model significant? Test at a 5% level of significance. (F = 95.7331; conclude the model is significant)
d) If a piece of wood has a density of 20, what would you expect the stiffness to be? Round to the nearest hundred. (49,300)
e) Construct a 95% confidence of the average stiffness for a density of 20, rounding the limits to the nearest hundred? (40,600 < my < 59,900)
f) Suppose a square root transformation had been done instead. What would be the 95% confidence interval of the average stiffness for a density of 20, rounding the limits to the nearest hundred? (45,600 < my < 62,700)
Question 4
Given the following data:
|
speed
limit |
30 |
40 |
50 |
60 |
70 |
80 |
90 |
100 |
110 |
|
accidents |
8 |
30 |
31 |
27 |
56 |
71 |
79 |
97 |
134 |
A linear regression model was built in which the speed limit was used to predict the number of accidents in a five-year period. Here is the plot of the fitted values against the residuals:

Along with the plot points:
|
Fitted |
2.888889 |
16.97222 |
31.05556 |
45.13889 |
59.22222 |
73.30556 |
87.38889 |
101.4722 |
115.5556 |
|
Residuals |
5.111111 |
13.02778 |
-0.05556 |
-18.1389 |
-3.22222 |
-2.30556 |
-8.38889 |
-4.47222 |
18.44444 |
a) What would be the best course of action? (see key)
Here is the output of the second model:
|
SUMMARY
OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple
R |
0.9823 |
|
|
|
|
|
|
R
Square |
0.9648 |
|
|
|
|
|
|
Adjusted
R Square |
0.9531 |
|
|
|
|
|
|
Standard
Error |
8.6845 |
|
|
|
|
|
|
Observations |
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
2 |
12419.03175 |
6209.516 |
82.332 |
4.34541E-05 |
|
|
Residual |
6 |
452.5238095 |
75.42063 |
|
|
|
|
Total |
8 |
12871.55556 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
-48.0119 |
8.9920 |
-5.3394 |
0.0018 |
-70.0147 |
-26.0092 |
|
speed |
1.4083 |
0.1121 |
12.5613 |
2E-05 |
1.1340 |
1.6827 |
|
(speed-70)2 |
0.0130 |
0.0049 |
2.6223 |
0.0395 |
0.0009 |
0.0251 |
b) What percentage of the variation in the number of accidents over a 5-year period is explained by the model? (96.48%)
c) If the speed limit is 60 km per hour, what is the expected number of accidents over a 5-year period? Round to the nearest whole number. (38)
d) Is the model significant? Test at a 5% level of significance. (conclude model is significant)
e) Based on a 5% level of significance, why are both independent variables significant? (both p-values < 5%)
Question 5
The average gas mileage for vehicles (L/100 km) was recorded as well as the engine size (cc), vehicle weight (kg) and whether or not the vehicle had been serviced in the past 6 months or not (1 = yes, 0 = no). Various models were built to determine which of these factors contributed to gas mileage. Here is a computer output of the initial model using all the variables:
|
Regression
model: |
|
|
|
|
|
|
mileage
= -2.5385 + 0.0039size + 0.0003weight - 0.9772service |
|||||
|
Percentage
of variation in mileage explained by the model: 92.23% |
|||||
|
Adjusted
for the number of variables: 90.78% |
|
||||
|
Ho:
the model is not significant |
|
|
|
||
|
Ha:
the model is significant |
|
|
|
||
|
Reject
Ho if test statistic > 3.239 |
|
|
|
||
|
Test
statistic = 63.329 |
|
|
|
||
|
P-value
= 0 |
|
|
|
|
|
|
Reject
Ho |
|
|
|
|
|
|
Conclude
the model is significant |
|
|
|
||
|
|
|
|
|
|
|
|
|
|
95%
Confidence Interval |
|
||
|
|
Coefficient |
Lower limit |
Upper limit |
P-value |
VIF |
|
Intercept |
-2.5385 |
-5.4566 |
0.3796 |
0.0838 |
|
|
size |
0.0039 |
-0.0072 |
0.0149 |
0.4714 |
91.1945 |
|
weight |
0.0003 |
-0.0007 |
0.0014 |
0.528 |
91.117 |
|
service |
-0.9772 |
-1.759 |
-0.1953 |
0.0175 |
1.0176 |
Next a second model using size and service:
|
Regression
model: |
|
|
|
|
|
|
mileage
= -2.7125 + 0.0072size - 0.9713service |
|
||||
|
Percentage
of variation in mileage explained by the model: 92.03% |
|||||
|
Adjusted
for the number of variables: 91.09% |
|
||||
|
Ho:
the model is not significant |
|
|
|
||
|
Ha:
the model is significant |
|
|
|
||
|
Reject
Ho if test statistic > 3.592 |
|
|
|
||
|
Test
statistic = 98.157 |
|
|
|
||
|
P-value
= 0 |
|
|
|
|
|
|
Reject
Ho |
|
|
|
|
|
|
Conclude
the model is significant |
|
|
|
||
|
|
|
|
|
|
|
|
|
|
95%
Confidence Interval |
|
||
|
|
Coefficient |
Lower limit |
Upper limit |
P-value |
VIF |
|
Intercept |
-2.7125 |
-5.511 |
0.086 |
0.0567 |
|
|
size |
0.0072 |
0.0061 |
0.0083 |
0 |
1.017 |
|
service |
-0.9713 |
-1.7357 |
-0.2068 |
0.0158 |
1.017 |
Finally a third model using weight and service:
|
Regression
model: |
|
|
|
|
|
|
mileage
= -2.2372 + 0.0007weight - 0.9876service |
|
||||
|
Percentage
of variation in mileage explained by the model: 91.97% |
|||||
|
Adjusted
for the number of variables: 91.02% |
|
||||
|
Ho:
the model is not significant |
|
|
|
||
|
Ha:
the model is significant |
|
|
|
||
|
Reject
Ho if test statistic > 3.592 |
|
|
|
||
|
Test
statistic = 97.331 |
|
|
|
||
|
P-value
= 0 |
|
|
|
|
|
|
Reject
Ho |
|
|
|
|
|
|
Conclude
the model is significant |
|
|
|
||
|
|
|
|
|
|
|
|
|
|
95%
Confidence Interval |
|
||
|
|
Coefficient |
Lower limit |
Upper limit |
P-value |
VIF |
|
Intercept |
-2.2372 |
-4.9732 |
0.4988 |
0.1026 |
|
|
weight |
0.0007 |
0.0006 |
0.0008 |
0 |
1.0161 |
|
service |
-0.9876 |
-1.7547 |
-0.2205 |
0.0147 |
1.0161 |
a) Using the criteria of adjusted r2, ANOVA p-values, individual t tests (testing at 5%) and presence of multicollinearity, which of the 3 models is the best? (model #2)
b) Using the second model, if an engine is 1500 cc and has been serviced, what would its average mileage be? (7.1162 L/100 km)