Ch01

Introduction.

Warning: Section 1.4 does not describe the method we will use for accessing R.

  • What are the steps used in a typical data science project?
  • Define what the authors mean by “tidy data”.
  • Transformations
    • There are three types of transformations listed. Give each type.
    • Briefly describe a (possibly imaginary) dataset (not from the book).
    • Using this data set, give an example of each type of transformation.
  • The book mentions both visualization and modeling. Which is more appropriate for the beginning of an investigation, and why?
  • How large does the book consider to be a “small” data set? How big would a “larger” data set be? Does the book give a clue at what point the authors start to consider a data set “Big Data”?
  • Name at least two other computer languages commonly uesd in data science.
  • What would the authors think about me teaching Python and Hadoop alongside of R in this class? Explain their reasoning.
  • In a nutshell, what is the difference between exploration and hypothesis confirmation?

Getting Help

Section 1.6 has some brief information about how to figure out what is going on when something goes wrong. At the start, here is some advice:

  • Make sure you can clearly state what the code you are typing is supposed to do. If you’re not sure what the result should look like, get clear on that first.
  • Make a small example instead of being frustrated with a large one. If you have problems reading the hundred line data file, will it work when you cut it down to two lines? The book says it beautifully:

Do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand, and the easier it is to fix.

  • Check every step. If one line has a bunch of commands on it, break them down into separate lines and look at the answer after each one.
  • Post your question for the class.

Application Question

The Society of Actuaries has a 45 million record data set about group long term disability insurance. Each record is an instance of a disability claim. There are about 29 pieces of information associated with each claim. Do you think this qualifies as big data? Explain why or why not. (Data format is in Appendix A of the SOA Data Request.)

Review Questions

  • When someone talks about “wrangling” data, what do they mean?
Last modified August 18, 2023: 2022-2023 End State (7352e87)