How to calculate descriptive statistics on a set of vectors of different sizes

In the task, I have a set of vectors. Each vector has a sensor reading, but has a different length. I would like to calculate the same descriptive statistics for each of these vectors. My question is how to store them in R. Using c() , the concatenation of vectors. Using list() seems to lead to incorrect functions, such as mean() . Is the data frame the correct object?

What is the best practice for applying the same function to vectors if different sizes? Assuming the data is on an SQL server, how to import it?

+4
source share
4 answers

Vectors of different sizes should be combined into a list: data.frame expects each column to have the same length.

Use lapply to retrieve your data. Then use lapply again to get descriptive statistics.

 x <- lapply(ids, sqlfunction) stats <- lapply(x, summary) 

Where sqlfunction is some function that you created to query your database. You can collapse the stats list in data.frame by calling do.call(rbind, stats) or using plyr :

 library(plyr) x <- llply(ids, sqlfunction) stats <- ldply(x, summary) 
+7
source

Most plot and regression functions expect the data to be in a "long" format: numeric values ​​in one column and grouping or covariance of values ​​in others. The stack function will accept lists of irregular lengths, and tapply or aggregate will allow functions to work on variables of an irregular length category:

 dlist <- list(a=1:2, b=13:15, cc= 5:1) s.dfrm <- stack(dlist) s.dfrm values ind 1 1 a 2 2 a 3 13 b 4 14 b 5 15 b 6 5 cc 7 4 cc 8 3 cc 9 2 cc 10 1 cc tapply(s.dfrm$values, s.dfrm$ind, mean) ab cc 1.5 14.0 3.0 
+2
source

"What is the best practice of applying the same function to vectors if they have different sizes? Suppose the data is on an SQL server, how to import it?"

As Shane suggested, here is your choice. Of course, you can use it with custom functions too - if you feel that the summary does not provide enough information.

For the SQL part: for most relational DBMSs there are packages: RPostgreSQL, RMySQL, ROracle, and RODBC - as a general one. If you are talking about MS SQL Server, I'm not sure if there is any specific package, but RODBC should do the job. I do not know if you are married to MS SQL Server material, but if you want to run your own local database for R-RMySQL, it is very easy to configure.

In general, using database packages, you use wrappers such as dbListTable or dbReadTable, which simply turn the table into R data.frame.

If you really want to import data that could use the .csv export of your database and use read.table or read.csv depending on what suits you. But I suggest connecting directly to the database - it’s not so difficult, even if you haven’t done it before, and it has become more fun.

EDIT: I don't use MS, but others did it before the mail list helps

+1
source

I would like to import this into a data frame, not a list. Each of your individual vectors is likely to differ in one or more significant variables. Let's say you wanted to track the time of data collection and the location from which it was collected. In the data frame, you should have one column that would be the entire related vector, but each of them would be differentiated by the values ​​in the time and location columns. To get each individual vector value, then tapply () can be a selection tool.

 tapply(df$y, list(df$time, df$location), mean) 

Or perhaps aggregate () will be even better, depending on the number of variables and your future needs.

+1
source

Source: https://habr.com/ru/post/1315485/


All Articles