11
Multiple Linear Regression Linear Model and Regression Line Linear Model: Regression Line: Note : are the best estimates for the parameters Example: Setting Up the Regression Model and Line Scenario: Question: Answer: Question: Answer:

Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Ø Linear!Model!and!Regression!Line

Ø Interpreting!Coefficients

Ø Correlation!Table

ØTesting!the!Full!Model

ØTesting!Individual!Predictors

Ø Indicator!Variables

Multiple Linear Regression

Lecture!20Sections!17.1�17.4,!18.1

Linear Model and Regression Line

• Linear Model: ! = "# + "$%$ + "&%& +'+ "(%( + )• !:!Actual!value!of!response!variable!from!data

• %$, %&, � , %(:!Actual!values!of!* predictor!variables!from!data

• "#:!True!y-intercept!for!the!line!that!fits!the!population

• "$, "&, � , "(:!True!slopes!of!regression!line!for!the!line!the!fits!the!population!(parameters)

• ):!Error!between!actual!(!)!and!predicted!values!( -!)

• Regression Line: -! = .# + .$%$ + .&%& +'+ .(%(• -!:!Predicted!value!of!response

• .#:!Estimated!y-intercept!(statistic)

• .$, .&� , .(:!Estimated!slopes!for!the!regression!line!that!fits!the!sample

Note: .#, .$, � , .( are the best estimates for the parameters "#, "$, � , "( .

Example: Setting Up the Regression Model and Line

• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity

• Question: What!is!the!regression!model?

• Answer: ___________________________________________• %$ = ________

• %& = __________________________

• Question: What!is!the!equation!of!the!regression!line?

• Answer: ___________________________________________

Page 2: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Example: Using the Multiple Regression Line

• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity.!!Justin!Verlander!was!34!years!old!and!won!15!games!in!2017.

• Question: What!is!the!predicted!average!fastball!velocity!for!Justin!Verlander!in!2017?

• Answer: -! = _________________________________________

= _________________________________________

= _________________

Standard Deviation of the Residuals

• Standard Deviation of Residuals: measure!of!the!size!of!a!typical!residual• In!multiple!regression,!it!is!calculated!as:

/0 =12&

3 4 * 4 5

where!* is!the!number!of!predictor!variables!in!the!multiple!regression!

model!and!2 = ! 4 -!.

Example: Standard Deviation of the Residuals

• Scenario: Justin!Verlander�s!actual!average!velocity!was!95.2!mph!and!his!predicted!average!velocity!was!92.013!mph.

• Question: How!unusual!is!Verlander�s!observation?

• Answer: _____________________• Standard Deviation: __________

• Residual: 2 = __________________________________________

• ________!standard!deviations!_________!the!predicted!value

Page 3: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Interpreting Coefficients

• If!variables!are!uncorrelated!(or!weakly!correlated),!each!slope!coefficient!gives!the!increase!in!the!predicted!response!given!a!one!unit!increase!in!the!predictor!variable after accounting for all other variables in the model.

• If!the!variables!are!more!strongly!correlated,!then!the!issue!of!collinearity arises!and!the!interpretation!of!the!slope!coefficients!is!not!so!straightforward.• Collinearity!will!be!discussed!in!a!future!class

Example: Interpreting Coefficients

• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity

• Question: How!should!the!slope!coefficients!be!interpreted?

• Answer:• Age: After!accounting!for!the!___________________,!for!every!additional!______!___________________,!his!predicted!fastball!velocity!____________________________.

• Wins: After!accounting!for!_____,!for!every!additional!______________________!_________,!his!predicted!fastball!velocity!______________________________________.

Correlation Tables

• Correlation Table: a!table!displaying!the!correlation!between!each!pair!of!variables• Diagonal!entries!will!always!be!1!because!variables!are!perfectly!correlated!with!themselves

• Bottom!half!generally!gets!filled!in!with!correlations

• Top!half!generally!left!blank!because!correlations!will!be!the!same!as!the!bottom!half

Page 4: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Example: Correlation Tables

• Scenario: Use!random!sample!of!MLB!pitchers�!ages!and!number!of!wins!in!2017!season!to!predict!average!fastball!velocity

• Question: What!does!the!correlation!table!reveal!about!the!relationship!between!the!variables?

• Answer:• Predictor!variables!(age!and!wins)!are!___________________

• As!a!pitcher!gets!older,!their!average!velocity!tends!to!_____________

• As!a!pitcher!wins!more!games,!their!average!velocity!tends!to!____________

Avg. FB Vel. Age Wins

Avg. FB Vel. 1.00

Age -0.461 1.00

Wins 0.267 0.000 1.00

Example: Effect of Correlation on Coefficients

• Scenario: Regress!average!fastball!velocity!on!age!and!wins!individually!(simple)!and!simultaneously!(multiple)

• Question: What!happens!to!the!slope!coefficients!when!the!second!variable!is!included?

• Answer: _________________• Being!_______________!means!the!inclusion!of!age!has!_____________!on!the!effect!that!number!of!wins!has!on!average!fastball!velocity

Avg.

FB Vel.

Age Wins

Avg. FB Vel. 1.00

Age -0.461 1.00

Wins 0.267 0.000 1.00

Example: Effect of Correlation on Coefficients

• Scenario: Regress!a!person�s!height!on!the!height!of!their!mother!and!father!individually!(simple)!and!simultaneously!(multiple)

• Question: What!happens!to!the!slope!coefficients!when!the!second!variable!is!included?

• Answer: ____________________• Correlation!between!predictors!is!_________

• Weak!correlation!means!the!inclusion!of!the!father�s!height!has!a!_________________!on!the!_________!for!the!mother�s!height

Height Height

(Mom)

Height

(Dad)

Height 1.00

Height (Mom) 0.437 1.00

Height (Dad) 0.331 0.193 1.00

Page 5: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Testing the Full Model: Hypotheses and Conditions

•Hypotheses:• 6#: "$ = "& = ' = "( = 7à 8 = "# + )• None!of!the!predictors!provide!any!meaningful!information!about!the!response

• 69: At!least!one!"; < 7à 8 = "# + ";% + ) for!at!least!one!>• At!least!one!predictor!yields!some!useful!information!about!the!response

• Assumptions and Conditions:• Linearity Condition: Scatterplots!of!the!response!against!each!of!the!predictors!are!relatively!straight

• Randomization: Observations!comprise!a!random!sample

• Equal Spread Condition: Standard!deviation!of!residuals!constant!across!all!predictor!values

• Nearly Normal: Normal!probability!plot!near!diagonal!straight!line

Example: Testing the Full Model

• Scenario: Use!latitude,!longitude,!annual!precipitation,!and!altitude!to!predict!average!July!high!temperature!in!23!U.S.!cities

• Question: Are!any!of!the!predictors!significant?

• Linear Model: ________________________________________________________• %$ = __________________

• %& = __________________

• %? = __________________

• %@ = __________________

•Hypotheses:• 6#: ______________________________________

• 6$: ______________________________________

Example: Testing the Full Model

• Question: Are!any!of!the!predictors!significant?

• Scatterplots of Temperature vs. Predictors:

• Conditions:• Linearity: No!issues!with!____________________________;!_______________________!_________________!may!each!have!an!influential!point,!but!they!probably!are!__________________________!to!impact!the!regression

• Randomization: Cities!were!_________________________!from!the!U.S.

Page 6: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Example: Testing the Full Model

• Question: Are!any!of!the!predictors!significant?

• Residual Plot and Normal Probability Plot:

• Conditions:• Equal Spread: Residual!plot!looks!___________!with!no!______________________

• Nearly Normal: Normal!probability!plot!does!not!deviate!significantly!from!a!_________________________

Residual!Plot Normal!Probability!Plot

Types of Variation

• Explained Variation: differences!in!the!responses!due!to!the!relationship!between!the!predictors!and!response• Also!known!as!sum!of!squares!due!to!regression!(SSR)

• Unexplained Variation:differences!in!the!responses!due!to!natural!variability!in!the!population• Also!known!as!sum!of!squares!due!to!error!(SSE)

Full Model Test Statistic

• F-Distribution: continuous!probability!distribution!that!has!the!following!properties:• Unimodal!and!right-skewed

• Always!non-negative

• Two!parameters!for!degrees!of!freedom• One!for!numerator!and!one!for!denominator

• Used!to!compare!the!ratio!of!two!sources!of!variability

• Test Statistic:

A(, BC(C$ =DE Regression

DEF=

EEGH*

EEFHI3 4 * 4 5J

where!* is!the!number!of!predictors!and!3 is!the!sample!size

Explained

Unexplained

Page 7: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Using Excel

Note: Excel does not create an overall residual plot of

predicted vs. actual values for multiple regression.

Example: Testing the Full Model

A =KLMNKLHO

M57NOPH5Q= SNOL5

P-Value

Summary!Statistics

ANOVA!Output!(Full!Model)

Coefficient!Output!(Individual!Predictors)

Example: Testing the Full Model

• Scenario: Use!latitude,!longitude,!annual!precipitation,!and!altitude!to!predict!average!July!high!temperature!in!23!U.S.!cities

• Question: Are!any!of!the!predictors!significant?

•Mechanics:• Test Statistic: ____________

• Degrees of Freedom: _________________

• P-Value: __________

• Conclusion: With!a!p-value!of!_________,!there!is!_____________________!suggesting!that!__________________________________________!in!predicting!the!_____________________________________________

Page 8: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Testing Individual Predictors

• After!testing!the!full!model:• If!there!is!evidence!of!a!relationship!between!at!least!one!individual!predictor!and!the!response,!we!need!to!figure!out!which!one(s)

• If!there!is!not!evidence!of!a!relationship,!no!further!testing!is!necessary

• To!test!individual!predictors,!run!the!following!test!on!each:• Hypotheses: 6#: "; = 7 vs.!69: "; < 7

• Test Statistic: TBC& =UVCWV

XYIUZJ

• Same!as!test!statistic!for!simple!linear!regression

• Output!will!return!a!p-value!that!reflects!how!strong!the!relationship!is!between!the!predictor!and!response

Example: Testing Individual Predictors

• Question: Which!individual!predictors!are!significant?

• Answer: Latitude Altitude Longitude Precipitation

Hypotheses 6#: _______

69: _______

6#: _______

69: _______

6#: _______

69: _______

6#: _______

69: _______

Test Statistic ___________ ___________ ___________ ___________

P-Value ___________ ___________ ___________ ___________

Significant? ___________ ___________ ___________ ___________

Motivation: Indicator Variables

• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in.!!Could!use!two!separate!regressions.!!To!describe!the!relationship.• Small Venues: -! = 4KQONP + [N7L[%

• Large Venues: -! = 45[,OLS + [N75P%

• Question: What!is!the!problem!with!using!both!talent!cost!and!venue!type!in!a!single!model?

• Answer: ________________________________• Regression!is!only!appropriate!for!_________________!data

• Question: What!is!the!solution?

• Answer: Create!an!_________________________!for!venue!type

Page 9: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Indicator Variable

• Indicator Variable: a!variable!that!takes!on!values!of!0!or!1!to!indicate!if!an!observation!from!a!categorical!variable!falls!into!a!given!category• Also!known!as!dummy variables.

• Commonly!represented!by!\ to!differentiate!them!from!continuous!predictors!denoted!by!]

• Choose!a!category!for!the!indicator!variable!to!represent

• Any!observations!that!fall!into!the!category!get!coded!as!1;!all!other!observations!get!coded!as!0

• For!a!variable!with!* categories,!* 4 5 indicator!variables!are!needed

• Baseline Category: the!category!that!is!represented!by!zeros!in!every!indicator!variable

Creating Indicator Variables in Excel

• Choose!a!category!to!be!the!baseline• Use!the!�IF�!function!in!Excel!to!assign!

values!to!indicator!variables• When!running!the!multiple!regression,!only!select!predictors!with!indicator!variables!� not!the!text!variable

=IF(D17=“Large”, 1, 0)

• If the venue is “Large”, code the indicator variable as 1

• Otherwise, code the indicator variable as 0

Example: Indicator Variables

• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in

• Question: What!is!the!regression!model?

• Answer: _______________________________________• %$ = ______________________

• ^$ = ___________________!(1!=!__________!venue,!0!=!__________!venue)

• Question: What!is!the!equation!of!the!regression!line?

• Answer: ________________________________________________________________

Page 10: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Example: Indicator Variables

• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in

• Question: What!is!the!predicted!revenue!for!a!small!venue!where!the!talent!cost!was!$20,000?

• Answer:• Small!venues!are!coded!as!____!à Coefficient!for!venue!type!_______________

• -! = ________________________________________________

= ________________________________________________

= ____________________

Example: Indicator Variables

• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in

• Question: What!is!the!predicted!revenue!for!a!large!venue!where!the!talent!cost!was!$20,000?

• Answer:• Large!venues!are!coded!as!____!à Coefficient!for!venue!type!______________

• -! = ________________________________________________

= ________________________________________________

= ____________________

Example: Indicator Variables

• Scenario: Comparing!amount!required!to!recruit!talent!for!concerts!against!the!amount!of!revenue!taken!in

• Question: What!is!the!interpretation!of!the!coefficient!for!the!indicator!variable?

• Answer: For!the!same!cost!to!____________________,!large!venues!are!expected!to!make!_____________________________compared!to!small!venues.

• Takeaway: An!indicator!variable!by itself_________________________________________!by!the!____________________________________________

Page 11: Multiple Linear Regression · 2019-01-06 · Using Excel Note : Excel does not create an overall residual plot of predicted vs. actual values for multiple regression. Example: Testing

Example: More Than Three Categories

• Scenario: Predict!average!July!high!temperature!using!latitude,!altitude,!and!if!city!sits!on!body!of!water!(coast,!river/lake,!none)

• Question: How!can!we!create!indicator!variables!for!a!variable!with!more!than!two!categories?

• Answer: Choose!one!category!to!be!the!_________________________!and!create!a!__________________________________________________________________

Example: More Than Three Categories

• Scenario: Predict!average!July!high!temperature!using!latitude,!altitude,!and!if!city!sits!on!body!of!water!(coast,!river/lake,!none)

• Question: What!is!the!regression!model?

• Answer: _____________________________________________________________• %$ = _________________

• %& = _________________

• Question: What!is!the!regression!line?

• Answer: ________________________________________________________________

• ^$ = _______________________• ^& = _______________________

Example: More Than Three Categories

• Scenario: Predict!average!July!high!temperature!using!latitude,!altitude,!and!if!city!sits!on!body!of!water!(coast,!river/lake,!none)

• Question: How!can!we!interpret!the!interaction!coefficients?

• Answer:• Coast: For!the!same!latitude!and!altitude,!cities!on!the!coast!are!expected!to!be!________________________!than!those!that!sit!on!a!river!or!lake.

• Landlocked: For!the!same!latitude!and!altitude,!cities!that!are!landlocked!are!expected!to!be!___________________________!than!those!that!sit!on!a!river!or!lake.