Data science 2018

introduction

My goal here is to provide a structured and (moderately) scaffolded path for an engaged person to acquire skills, vocabulary, and concepts necessary to build and communicate data science projects. As a framework, we’ll step through specific statistical learning techniques.

Stretch goals for the highly-motivated learner would be to work towards
a) designing or leading such projects,
b) working at the cutting edge of data science applications, or
c) building a foundation for doing research in the field (for example, pursue a PhD).

I assume familiarity with linear algebra, calculus, and programming. You’ll need to know how to write and run scripts in R or Python. You should know what the following terms mean (and know the notation that describe them): dot products, eigenvalues and eigenvectors, partial derivatives (okay, that’s kind of of optional, but you’ll want to know what a derivative is), integrals, and computational complexity. These concepts aren’t always key to doing the work–with modern tools, a decent programmer or business analyst is able to do some pretty cool data science–but these concepts are key to understanding what you are doing.

weekly handouts

The journal entry guidelines describe generally how I’m assessing your write-ups. And try to follow these principles for developing, documenting, and sharing computational work (items 1-4 particularly relevant to the work for this course.)

resources

  • DataCamp. We will make extensive use of DataCamp for getting up-to-speed on the tools. There are more than a hundred mini-courses at variety of levels. It looks like they’ve done a good job organizing material into focused chunks, and they’ve got some instructors whose work I really respect. I’m pretty excited about the opportunity to try their courses out, and I would love to hear your feedback. Contact me directly if you’re formally enrolled in the class and either
    1. need to be re-invited to the group or
    2. want help developing a study-plan that meets your goals.

good data sources

for the course

I’m still considering a few others, such as

  • the US Dept of Transportation flight delay data,
  • a home ownership dataset (Trulia? Zillow? Federal data?), and
  • an image dataset (perhaps one at deeplearning.net?).

If you have a suggestion or an opinion, please let me know. Also, I said that I didn’t want to use Kaggle data because it was cleaned to the point of not being a lot of the meaning stripped from it. After checking out a few of the publicly contributed ones, they still look fairly clean, but seem to have more substance than the often-sterile competition data sets.

others

blogs, news

I follow about 50 data science blogs and news sources. When one jumps out at me as particularly interesting to the class, I’ll try to add to the list below. I don’t know that these are the best, but I do pay attention to their posts.

  • Kaggle’s blog: Mostly news, but also some decent, quick tutorials.
  • Revolutions: Highlights cool projects and tutorials, as well as data science news.
  • FlowingData: I think Nathan Yau does a great job for doing some original analyses, but also keeping in touch with the cool and innovative dataviz going on.
  • R-bloggers: A blog aggregator and, overall, a mixed bag of fluff, highly technical posts, announcements of R packages, news, cool projects, and more… Most contributors are high-quality and/or interesting most of the time.
  • DataScience+: This is an exception to the “I pay attention to the posts” statement, because I just learned about it (thanks, Jason!). Pretty cool examples and tutorials.

Leave a Reply

Your email address will not be published. Required fields are marked *