This assignment involves linear regression. The data can be found on Sakai: go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Data Analysis Assignments \(\rightarrow\) Assignment 1. Please type your solutions using R Markdown, LaTeX or any other word processor but YOU MUST knit or convert the final output file to “.pdf”. Submissions should be made on gradescope: go to Assignments \(\rightarrow\) Data Analysis Assignment 1.
DO NOT INCLUDE R CODE OR OUTPUT IN YOUR SOLUTIONS/REPORTS All R code can be included in an appendix, and R outputs should be converted to nicely formatted tables. Feel free to use R packages such as kable
, xtable
, stargazer
, etc.
Also, you can round up ALL numbers/estimates to 2 decimal places (4 decimal places at the most to avoid exact zeros when possible).
Reminder: You are allowed and even encouraged to talk to each other about general concepts, or to the instructor/TAs. However, the write-ups, solutions, and code MUST be entirely your own work.
Questions 1 and 2 below were taken and adapted from Chapter 7 of Ramsey, F.L. and Schafer, D.W. (2013), “The Statistical Sleuth: A Course in Methods of Data Analysis (3rd ed).”.
Side Note: We will use textbook datasets on some of the data analysis assignments. This is intentional as a way to start with clean and small datasets. For team projects, we will focus a bit more on “messy” datasets.
RESPIRATORY RATES FOR CHILDREN. A high respiratory rate is a potential diagnostic indicator of respiratory infection in children. To judge whether a respiratory rate is truly “high,” however, a physician must have a clear picture of the distribution of “normal” respiratory rates. To this end, Italian researchers measured the respiratory rates of 618 children that are at most 3 years old.
The data for this question can be found in the file “Respiratory.csv” on Sakai.
THE DRAMATIC U.S. PRESIDENTIAL ELECTION OF 2000. The U.S. presidential election of November 7, 2000 was one of the closest in history. As returns were counted on election night it became clear that the outcome in the state of Florida would determine the next president. At one point in the evening, television networks projected that the state was carried by the Democratic nominee, Al Gore, but a retraction of the projection followed a few hours later. Then, early in the morning of November 8, the networks projected that the Republican nominee, George W. Bush, had carried Florida and won the presidency. Gore called Bush to concede. While on route to his concession speech, though, the Florida count changed rapidly in his favor. The networks once again reversed their projection, and Gore called Bush to retract his concession. When the roughly 6 million Florida votes had been counted, Bush was shown to be leading by only 1,738, and the narrow margin triggered an automatic recount. The recount, completed in the evening of November 9, showed Bush’s lead to be less than 400.
Meanwhile, angry Democratic voters in Palm Beach County complained that a confusing “butterfly” lay-out ballot caused them to accidentally vote for the Reform Party candidate Pat Buchanan instead of Gore. The ballot, as illustrated below, listed presidential candidates on both a left-hand and a right-hand page.
Voters were to register their vote by punching the circle corresponding to their choice, from the column of circles between the pages. It was suspected that since Bush’s name was listed first on the left-hand page, Bush voters likely selected the first circle. Since Gore’s name was listed second on the left-hand side, many voters—who already knew who they wished to vote for—did not bother examining the right-hand side and consequently selected the second circle in the column; the one actually corresponding to Buchanan. Two pieces of evidence supported this claim: Buchanan had an unusually high percentage of the vote in that county, and an unusually large number of ballots (19,000) were discarded because voters had marked two circles (possibly by inadvertently voting for Buchanan and then trying to correct the mistake by then voting for Gore).
The data for this question can be found in the file “Elections.csv” on Sakai.
AIRBNB LISTINGS FOR SEATTLE, WA. AirBnB is a rental online marketplace. The company itself is based in San Francisco CA, and there are millions of listings in cities across the world. In this problem, you will only focus on data for AirBnB listings in Queen Anne, Seattle, WA. Specifically, you will try to understand how certain factors influence the price of a listing. The data we will use here is a very small subset of the overall available data. For more on the data, or if you are interested in using AirBnB data, see http://insideairbnb.com/get-the-data.html.
The data for this question can be found in the file “Listings_QueenAnne.txt” on Sakai.
host_is_superhost
,host_identity_verified
, room_type
, accommodates
, bathrooms
and bedrooms
as predictors. You should start by doing EDA, then model fitting, and model assessment. You should consider transformations if needed.Code Book
Variable | Description |
---|---|
id | Unique identifier for listings |
host_is_superhost | Whether or not the host is a “superhost,” meaning they satisfy AirBnB’s criteria for high-quality listings, high response rate, and reliability |
host_identity_verified | Whether or not the host has verified their identity with AirBnB |
room_type | Entire home/apt, private room, or shared room |
accommodates | Number of people that the listing can accommodate |
bathrooms | Number of bathrooms in the listing |
bedrooms | Number of bedrooms in the listing |
price | Price of the listing for one night (Use this as the response variable) |
30 points: 10 points for each question