Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Ø Linear!Model!and!Regression!Line
Ø Interpreting!Coefficients
Ø Correlation!Table
ØTesting!the!Full!Model
ØTesting!Individual!Predictors
Ø Indicator!Variables
Multiple Linear Regression
Lecture!20Sections!17.1�17.4,!18.1
Linear Model and Regression Line
• Linear Model: ! = "# + "$%$ + "&%& +'+ "(%( + )• !:!Actual!value!of!response!variable!from!data
• %$, %&, � , %(:!Actual!values!of!* predictor!variables!from!data
• "#:!True!y-intercept!for!the!line!that!fits!the!population
• "$, "&, � , "(:!True!slopes!of!regression!line!for!the!line!the!fits!the!population!(parameters)
• ):!Error!between!actual!(!)!and!predicted!values!( -!)
• Regression Line: -! = .# + .$%$ + .&%& +'+ .(%(• -!:!Predicted!value!of!response
• .#:!Estimated!y-intercept!(statistic)
• .$, .&� , .(:!Estimated!slopes!for!the!regression!line!that!fits!the!sample
Note: .#, .$, � , .( are the best estimates for the parameters "#, "$, � , "( .
Example: Setting Up the Regression Model and Line
• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity
• Question: What!is!the!regression!model?
• Answer: ___________________________________________• %$ = ________
• %& = __________________________
• Question: What!is!the!equation!of!the!regression!line?
• Answer: ___________________________________________
Example: Using the Multiple Regression Line
• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity.!!Justin!Verlander!was!34!years!old!and!won!15!games!in!2017.
• Question: What!is!the!predicted!average!fastball!velocity!for!Justin!Verlander!in!2017?
• Answer: -! = _________________________________________
= _________________________________________
= _________________
Standard Deviation of the Residuals
• Standard Deviation of Residuals: measure!of!the!size!of!a!typical!residual• In!multiple!regression,!it!is!calculated!as:
/0 =12&
3 4 * 4 5
where!* is!the!number!of!predictor!variables!in!the!multiple!regression!
model!and!2 = ! 4 -!.
Example: Standard Deviation of the Residuals
• Scenario: Justin!Verlander�s!actual!average!velocity!was!95.2!mph!and!his!predicted!average!velocity!was!92.013!mph.
• Question: How!unusual!is!Verlander�s!observation?
• Answer: _____________________• Standard Deviation: __________
• Residual: 2 = __________________________________________
• ________!standard!deviations!_________!the!predicted!value
Interpreting Coefficients
• If!variables!are!uncorrelated!(or!weakly!correlated),!each!slope!coefficient!gives!the!increase!in!the!predicted!response!given!a!one!unit!increase!in!the!predictor!variable after accounting for all other variables in the model.
• If!the!variables!are!more!strongly!correlated,!then!the!issue!of!collinearity arises!and!the!interpretation!of!the!slope!coefficients!is!not!so!straightforward.• Collinearity!will!be!discussed!in!a!future!class
Example: Interpreting Coefficients
• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity
• Question: How!should!the!slope!coefficients!be!interpreted?
• Answer:• Age: After!accounting!for!the!___________________,!for!every!additional!______!___________________,!his!predicted!fastball!velocity!____________________________.
• Wins: After!accounting!for!_____,!for!every!additional!______________________!_________,!his!predicted!fastball!velocity!______________________________________.
Correlation Tables
• Correlation Table: a!table!displaying!the!correlation!between!each!pair!of!variables• Diagonal!entries!will!always!be!1!because!variables!are!perfectly!correlated!with!themselves
• Bottom!half!generally!gets!filled!in!with!correlations
• Top!half!generally!left!blank!because!correlations!will!be!the!same!as!the!bottom!half
Example: Correlation Tables
• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity
• Question: What!does!the!correlation!table!reveal!about!the!relationship!between!the!variables?
• Answer:• Predictor!variables!(age!and!wins)!are!___________________
• As!a!pitcher!gets!older,!their!average!velocity!tends!to!_____________
• As!a!pitcher!wins!more!games,!their!average!velocity!tends!to!____________
Avg. FB Vel. Age Wins
Avg. FB Vel. 1.00
Age -0.461 1.00
Wins 0.267 0.000 1.00
Example: Effect of Correlation on Coefficients
• Scenario: Regress!average!fastball!velocity!on!age!and!wins!individually!(simple)!and!simultaneously!(multiple)
• Question: What!happens!to!the!slope!coefficients!when!the!second!variable!is!included?
• Answer: _________________• Being!_______________!means!the!inclusion!of!age!has!_____________!on!the!effect!that!number!of!wins!has!on!average!fastball!velocity
Avg.
FB Vel.
Age Wins
Avg. FB Vel. 1.00
Age -0.461 1.00
Wins 0.267 0.000 1.00
Example: Effect of Correlation on Coefficients
• Scenario: Regress!a!person�s!height!on!the!height!of!their!mother!and!father!individually!(simple)!and!simultaneously!(multiple)
• Question: What!happens!to!the!slope!coefficients!when!the!second!variable!is!included?
• Answer: ____________________• Correlation!between!predictors!is!_________
• Weak!correlation!means!the!inclusion!of!the!father�s!height!has!a!_________________!on!the!_________!for!the!mother�s!height
Height Height
(Mom)
Height
(Dad)
Height 1.00
Height (Mom) 0.437 1.00
Height (Dad) 0.331 0.193 1.00
Testing the Full Model: Hypotheses and Conditions
•Hypotheses:• 6#: "$ = "& = ' = "( = 7à 8 = "# + )• None!of!the!predictors!provide!any!meaningful!information!about!the!response
• 69: At!least!one!"; < 7à 8 = "# + ";% + ) for!at!least!one!>• At!least!one!predictor!yields!some!useful!information!about!the!response
• Assumptions and Conditions:• Linearity Condition: Scatterplots!of!the!response!against!each!of!the!predictors!are!relatively!straight
• Randomization: Observations!comprise!a!random!sample
• Equal Spread Condition: Standard!deviation!of!residuals!constant!across!all!predictor!values
• Nearly Normal: Normal!probability!plot!near!diagonal!straight!line
Example: Testing the Full Model
• Scenario: Use!latitude,!longitude,!annual!precipitation,!and!altitude!to!predict!average!July!high!temperature!in!23!U.S.!cities
• Question: Are!any!of!the!predictors!significant?
• Linear Model: ________________________________________________________• %$ = __________________
• %& = __________________
• %? = __________________
• %@ = __________________
•Hypotheses:• 6#: ______________________________________
• 6$: ______________________________________
Example: Testing the Full Model
• Question: Are!any!of!the!predictors!significant?
• Scatterplots of Temperature vs. Predictors:
• Conditions:• Linearity: No!issues!with!____________________________;!_______________________!_________________!may!each!have!an!influential!point,!but!they!probably!are!__________________________!to!impact!the!regression
• Randomization: Cities!were!_________________________!from!the!U.S.
Example: Testing the Full Model
• Question: Are!any!of!the!predictors!significant?
• Residual Plot and Normal Probability Plot:
• Conditions:• Equal Spread: Residual!plot!looks!___________!with!no!______________________
• Nearly Normal: Normal!probability!plot!does!not!deviate!significantly!from!a!_________________________
Residual!Plot Normal!Probability!Plot
Types of Variation
• Explained Variation: differences!in!the!responses!due!to!the!relationship!between!the!predictors!and!response• Also!known!as!sum!of!squares!due!to!regression!(SSR)
• Unexplained Variation:differences!in!the!responses!due!to!natural!variability!in!the!population• Also!known!as!sum!of!squares!due!to!error!(SSE)
Full Model Test Statistic
• F-Distribution: continuous!probability!distribution!that!has!the!following!properties:• Unimodal!and!right-skewed
• Always!non-negative
• Two!parameters!for!degrees!of!freedom• One!for!numerator!and!one!for!denominator
• Used!to!compare!the!ratio!of!two!sources!of!variability
• Test Statistic:
A(, BC(C$ =DE Regression
DEF=
EEGH*
EEFHI3 4 * 4 5J
where!* is!the!number!of!predictors!and!3 is!the!sample!size
Explained
Unexplained
Using Excel
Note: Excel does not create an overall residual plot of
predicted vs. actual values for multiple regression.
Example: Testing the Full Model
A =KLMNKLHO
M57NOPH5Q= SNOL5
P-Value
Summary!Statistics
ANOVA!Output!(Full!Model)
Coefficient!Output!(Individual!Predictors)
Example: Testing the Full Model
• Scenario: Use!latitude,!longitude,!annual!precipitation,!and!altitude!to!predict!average!July!high!temperature!in!23!U.S.!cities
• Question: Are!any!of!the!predictors!significant?
•Mechanics:• Test Statistic: ____________
• Degrees of Freedom: _________________
• P-Value: __________
• Conclusion: With!a!p-value!of!_________,!there!is!_____________________!suggesting!that!__________________________________________!in!predicting!the!_____________________________________________
Testing Individual Predictors
• After!testing!the!full!model:• If!there!is!evidence!of!a!relationship!between!at!least!one!individual!predictor!and!the!response,!we!need!to!figure!out!which!one(s)
• If!there!is!not!evidence!of!a!relationship,!no!further!testing!is!necessary
• To!test!individual!predictors,!run!the!following!test!on!each:• Hypotheses: 6#: "; = 7 vs.!69: "; < 7
• Test Statistic: TBC& =UVCWV
XYIUZJ
• Same!as!test!statistic!for!simple!linear!regression
• Output!will!return!a!p-value!that!reflects!how!strong!the!relationship!is!between!the!predictor!and!response
Example: Testing Individual Predictors
• Question: Which!individual!predictors!are!significant?
• Answer: Latitude Altitude Longitude Precipitation
Hypotheses 6#: _______
69: _______
6#: _______
69: _______
6#: _______
69: _______
6#: _______
69: _______
Test Statistic ___________ ___________ ___________ ___________
P-Value ___________ ___________ ___________ ___________
Significant? ___________ ___________ ___________ ___________
Motivation: Indicator Variables
• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in.!!Could!use!two!separate!regressions.!!To!describe!the!relationship.• Small Venues: -! = 4KQONP + [N7L[%
• Large Venues: -! = 45[,OLS + [N75P%
• Question: What!is!the!problem!with!using!both!talent!cost!and!venue!type!in!a!single!model?
• Answer: ________________________________• Regression!is!only!appropriate!for!_________________!data
• Question: What!is!the!solution?
• Answer: Create!an!_________________________!for!venue!type
Indicator Variable
• Indicator Variable: a!variable!that!takes!on!values!of!0!or!1!to!indicate!if!an!observation!from!a!categorical!variable!falls!into!a!given!category• Also!known!as!dummy variables.
• Commonly!represented!by!\ to!differentiate!them!from!continuous!predictors!denoted!by!]
• Choose!a!category!for!the!indicator!variable!to!represent
• Any!observations!that!fall!into!the!category!get!coded!as!1;!all!other!observations!get!coded!as!0
• For!a!variable!with!* categories,!* 4 5 indicator!variables!are!needed
• Baseline Category: the!category!that!is!represented!by!zeros!in!every!indicator!variable
Creating Indicator Variables in Excel
• Choose!a!category!to!be!the!baseline• Use!the!�IF�!function!in!Excel!to!assign!
values!to!indicator!variables• When!running!the!multiple!regression,!only!select!predictors!with!indicator!variables!� not!the!text!variable
=IF(D17=“Large”, 1, 0)
• If the venue is “Large”, code the indicator variable as 1
• Otherwise, code the indicator variable as 0
Example: Indicator Variables
• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in
• Question: What!is!the!regression!model?
• Answer: _______________________________________• %$ = ______________________
• ^$ = ___________________!(1!=!__________!venue,!0!=!__________!venue)
• Question: What!is!the!equation!of!the!regression!line?
• Answer: ________________________________________________________________
Example: Indicator Variables
• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in
• Question: What!is!the!predicted!revenue!for!a!small!venue!where!the!talent!cost!was!$20,000?
• Answer:• Small!venues!are!coded!as!____!à Coefficient!for!venue!type!_______________
• -! = ________________________________________________
= ________________________________________________
= ____________________
Example: Indicator Variables
• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in
• Question: What!is!the!predicted!revenue!for!a!large!venue!where!the!talent!cost!was!$20,000?
• Answer:• Large!venues!are!coded!as!____!à Coefficient!for!venue!type!______________
• -! = ________________________________________________
= ________________________________________________
= ____________________
Example: Indicator Variables
• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in
• Question: What!is!the!interpretation!of!the!coefficient!for!the!indicator!variable?
• Answer: For!the!same!cost!to!____________________,!large!venues!are!expected!to!make!_____________________________compared!to!small!venues.
• Takeaway: An!indicator!variable!by itself_________________________________________!by!the!____________________________________________
Example: More Than Three Categories
• Scenario: Predict!average!July!high!temperature!using!latitude,!altitude,!and!if!city!sits!on!body!of!water!(coast,!river/lake,!none)
• Question: How!can!we!create!indicator!variables!for!a!variable!with!more!than!two!categories?
• Answer: Choose!one!category!to!be!the!_________________________!and!create!a!__________________________________________________________________
Example: More Than Three Categories
• Scenario: Predict!average!July!high!temperature!using!latitude,!altitude,!and!if!city!sits!on!body!of!water!(coast,!river/lake,!none)
• Question: What!is!the!regression!model?
• Answer: _____________________________________________________________• %$ = _________________
• %& = _________________
• Question: What!is!the!regression!line?
• Answer: ________________________________________________________________
• ^$ = _______________________• ^& = _______________________
Example: More Than Three Categories
• Scenario: Predict!average!July!high!temperature!using!latitude,!altitude,!and!if!city!sits!on!body!of!water!(coast,!river/lake,!none)
• Question: How!can!we!interpret!the!interaction!coefficients?
• Answer:• Coast: For!the!same!latitude!and!altitude,!cities!on!the!coast!are!expected!to!be!________________________!than!those!that!sit!on!a!river!or!lake.
• Landlocked: For!the!same!latitude!and!altitude,!cities!that!are!landlocked!are!expected!to!be!___________________________!than!those!that!sit!on!a!river!or!lake.