Core Data Scientist Skills

What are the relevant skills in the arsenal of Data Scientist? With the advent of new technologies every day, how to choose and choose what you need?

A few ideas related to this discussion:

  • Knowing SQL and using a database like MySQL, PostgreSQL was great before NoSql and non-relational databases. MongoDB, CouchDB, etc. Become popular for working with web-based data.
  • A knowledge of a statistics tool such as R is sufficient for analysis, but you may need to add Java, Python, etc. to the list to create applications.
  • Now the data comes in the form of text, URLs, multimedia, to name a few, and there are various paradigms associated with their manipulation.
  • What about cluster computing, parallel computing, cloud, Amazon EC2, Hadoop?
  • OLS regression now has artificial neural networks, random forests, and other relatively exotic machine studies / data aldehydes. For the company

Thoughts?

+51
r
May 18 '10 at 19:13
source share
11 answers

To quote from the introduction to the thesis Hadley phd :

First you get the data in a form that you can work with ... Second, you print the data to understand what is going on ... Third, you repeat between the graphics and the models to create a brief quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to make a better future.

Step 1 almost certainly involves data collection and may include access to a database or web scraper. Knowing the people who create the data is also helpful. (I register this in the "networking" section.)

Step 2 means visualization / charting skills.

Step 3 means statistics or modeling skills. Since this is a stupidly broad category, the ability to delegate to a modeler is also a useful skill.

The final step is mainly about mental skills such as introspection and management skills.

The question also mentioned software skills, and I agree that they are very convenient. Software Carpentry has a good list of all the basic software skills you should have.

+22
May 19 '10 at 15:21
source share
โ€” -

Just throw some ideas for others to state:

With some ridiculously high level of abstraction, all work with data includes the following steps:

  • Data collection
  • Data Storage / Retrieval
  • Data Manipulation / Synthesis / Modeling
  • Results Report
  • Story

At a minimum, a data scientist should have at least some skills in each of these areas. But depending on the specialty, you can spend much more time in a limited range.

+20
May 18, '10 at 21:15
source share

JD's are great, and for a little more depth on these ideas, read Michael Driscoll's great post Three Sex Data Geek Skills :

  • Mastery # 1 : Statistics (Learning)
  • Skill # 2 : Data Mutation (Suffering)
  • Skill No. 3 : Visualization (Story)
+11
May 19 '10 at 10:49
source share

In a dataist, the question is considered in general terms with a good Venn diagram:

venn diagram

+10
Oct. 15 '10 at 9:45
source share

JD hit him on the head: Storytelling. Although he forgot an OTHER important story: a story about why you used the <insert fancy technique here>. Being able to answer this question is the most important skill you can develop.

The rest is just hammers. Don't get me wrong, something like R is great. R is a whole bag of hammers, but an important bit is to know how to use your hammers and something else useful.

+5
May 19 '10 at 5:51 a.m.
source share

I think itโ€™s important to have a team for a commercial database or two. In the financial world that I consult, I often see DB / 2 and Oracle on large hardware and SQL Server on distributed servers. This basically means being able to read and write SQL code. You should be able to extract data from the storage and into your analytical tool.

As for analytical tools, I believe that R is becoming more and more important. I also find it very useful to know how to use at least one other statistics package. It can be SAS or SPSS ... it really depends on the company or client you work for and what they expect.

Finally, you can get an incredible view of all these packages and are still not very valuable. It is extremely important to have a sufficient amount of expert knowledge in a specific area and to be able to inform the appropriate users and managers about the problems associated with your analysis, as well as about your results.

+4
May 18 '10 at 22:43
source share

Matrix Algebra is My Best Choice

+4
May 19 '10 at 1:16
source share
  • The ability to work together.

Great science, in almost any discipline, is rarely done by people these days.

+4
May 19 '10 at 12:48
source share

There are several topics for the computer sciences that are useful to scientists and scientists, many of which have been mentioned: distributed computing, operating systems, and databases.

The analysis of algorithms , which understands the need for time and space in computing, is one of the most important problems in the field of computer science for data scientists. This is useful for implementing effective code, from statistical training methods to data collection; and determine your computing needs, such as the amount of RAM or the number of Hadoop nodes.

+3
May 19 '10 at 15:16
source share

Patience - both in order to obtain results in a reasonable way, and in order to go back and change it for what was "really".

+2
May 19 '10 at 3:18 p.m.
source share

Learn linear algebra at MIT Open course ware 18.06 and replace your research with the book Introduction to Linear Algebra. Linear algebra is one of the core skill sets in data analytics in addition to the skills mentioned above.

0
Jan 13 '14 at 7:00
source share



All Articles