Slashdot Mirror


Ask Slashdot: Statistical Analysis Packages For Libraries?

HolyLime writes "I'm a librarian in a small academic library. Increasingly the administration is asking our department to collect data on various aspects of our activities, class taught, students helped, circulation, collection development, and so on. This is generating a large stream of data that is making it difficult, and time consuming, to qualitatively analyze. For anything complicated, I currently use excel, or an analogous spreadsheet program. I am aware of statistical analysis programs, like SPSS or SAS. Can anyone give me recommendations for statistical analysis programs? I also place emphasis on anything that is open source and easy to implement since it will allow me to bypass the convoluted purchase approval process."

36 of 146 comments (clear)

  1. R or WEKA ... Wait, What Exactly Are You Doing? by eldavojohn · · Score: 5, Informative

    R is my personal favorite but you're going to have to get down and dirty with some high level programming (scripting). Check out the data import package (you would probably export your spreadsheets to flat txt files and import although the functionality is ever increasing). There's no user interface in this suggestion ... what there is, however, is a massive collection of packages for statistical analysis. Very well maintained, constantly updated and ever expanding.

    The other suggestion has a better GUI but is really heavyweight. WEKA has helped me time and time again perform advanced statistical calculations on data sets and it's in Java so runs on just about anything. Their interface occasionally improves too, they now have an explorer that I use to prep data and remove outliers/null data (don't worry, this isn't climate data). It's well documented.

    These (probably) require an intermediate data transformation step but are open source and extensively supported. Any examples of what you wanted to do? Simple stuff like standard deviation or complex stuff like principle component analysis (PCA)? I guess if it was just simple stuff, that'd be built into Excel, right? Maybe your problems are simple enough to just need a good macro writer to tackle? Whatever happens, good luck!

    --
    My work here is dung.
    1. Re:R or WEKA ... Wait, What Exactly Are You Doing? by logical_failure · · Score: 3, Informative

      Came here for the mention of R, and leave satisfied. R is an excellent choice.

      --
      Sock Puppets: damn_registrars=pudge_confirmer=jimmy_slimmy=raiigunner=cml4524=a_klavan=red4men=ronpaulisanidiot
    2. Re:R or WEKA ... Wait, What Exactly Are You Doing? by Anonymous Coward · · Score: 2, Insightful

      Easy of deployment does NOT leave out open source. Ease of deployment simply depends on how the package was programmed. Many closed software are just as hard to deploy as open source. Saying you must buy software is useless without actually giving information out on the general field of statisical analysis like various options and comparisons between closed sourced and open.

      As for support. It's true open source generally has limited support but often enough it's enough since many closed software also provide limited or slow support unless your one of their larger customers. Also, sometimes you can find companys to pay for support when dealing with open source software. Simply speaking, support varies GREATLY depending on the software in question be it open or closed source. Note, however, he did NOT mention anything about his requirements in terms of software support so whether the issue of support exists is up in the air.

      This is also a library probably with limited funds. Organizations like these can take insane amount of time before software is approved when not even factoring in that the software can easily be rejected. While is true open source doesn't mean free, it might as well in the majority of cases. Most software that are open sourced often release the binaries for free (companies like cedega are in the extreme minority where they hide access to the source and charge for the binary). If a open source product can meet his requirements, why shouldn't he go with it? Both open and closed source take time to deploy and the amount of time spent trying to get a closed software to be approved can also be spent on deployment an open sourced software.

      *Note, I don't advocate open vs closed source. I'm just speaking for this specific case. If closed software fits and is less hassle, go for it. If open source fits and easier to deploy (be it deployment or approval), go for that instead. Really depends on the requiresments. Software are tools, use the one that fits best for your needs.

    3. Re:R or WEKA ... Wait, What Exactly Are You Doing? by Alan+Shutko · · Score: 4, Insightful

      In this case, you're not quite correct. The head of our statisticians wants to get R in here to supplement SAS (which we pay a lot of money for) because it is both good software, and also being used heavily for research. As he put it "If we started using R, we could start using new tools as soon as we read the paper, since most of the researchers are using R."

    4. Re:R or WEKA ... Wait, What Exactly Are You Doing? by Warlord88 · · Score: 3, Informative

      Why do you think R is not easy to implement? My company has been using SAS for a long time and we are finally making the change to R. As far as OP's requirements are concerned, I think R is way superior to SAS or SPSS because of its free, modular nature. It is clean, simple and suitable for a wide range of users. The commercial packages are filled with way too much business lingo garbage for me.

      I personally think commercial support is overrated. I can install software on my own. I know how to browse through manuals and other information to find what I need. For a package like R, I almost always get any questions answered in at most few hours on online forums. So what exactly do I get from commercial support for my money? But, if OP needs commercial support, there is an enterprise edition of R by Revolution Analytics located here: http://www.revolutionanalytics.com/products/revolution-enterprise.php. Might be worth looking into.

      Bottom line: R all the way.

    5. Re:R or WEKA ... Wait, What Exactly Are You Doing? by Anonymous Coward · · Score: 2, Interesting

      He said he wants something that is easy to implement, and only reason he is going with open source is because then he doesn't have to ask for purchase approval. Which IMO is a really stupid reason and will hurt in the long run - it's insane to take worse software just because you don't want to ask your boss if it's okay to buy this one.

      Horse shit. I've seen projects die because they couldn't get software through the approval process. Better to try 10 apps that are free and run in userspace (so no need to get IT involved for an Administrator install) than to wait for management approvals, budget cycles, and IT support, and never get the project done. If I'd done that on the job, I'd have been fired for taking too long to do my work.

      I also resent the implication the "free" means "worse."

      Sorry to burst your bubble, but if you want good support and easy implementation, you have to look for normal paid-for solutions. Besides, open source is not synonym for free. This is especially true with specialized software or something you want good support for. Open source just means you get the code aswell, so you can implement your own additions (without use of plugins) or change it.

      I'm guessing you haven't used R. Not only is there a thorough user manual, but there are books from most major statistical and instructional groups on how to use R, AND the R-help mailing list answers every R question I've ever had about it, AND there are local R user groups where you can get support similar to how LUG's work.

      But unless you get an product from a company that is spending money to develop it, you never get good software and good support. No one can make both because everything in this world costs money, and developers have to live too. Open source and free software model works well for the likes of Google and Firefox because the developments get paid by money made with advertising. Statistical analysis software, and other specialized software is a different matter.

      Please shut up. If your assumption were true, R would not exist. R exists, so you're just an asshat.

      My advice to the original poster: Use R if you have any familiarity with programming. Any higher level math/stat course OR experience with basic programming will let you get started in R. If you've been doing this all in Excel already, you're probably ready to hop into R. If you're still uncomfortable, I'm sure one of the people who value your academic library could help out.

    6. Re:R or WEKA ... Wait, What Exactly Are You Doing? by Anonymous Coward · · Score: 2, Insightful

      I was at a "Large Data Sets" conference where there was an awkard pissing contest over who had the biggest data set. Then it became a question of whether you had to time-adjust the size of a data set, since a megabyte data set used to be huge. Then someone pointed out that large is relative; what is a large data set for a stats student (or librarian) is trivial for people working on the largest of the day, but it is still large for that person. I don't know what the OP is analyzing, but for them, this is large AND it fits in Excel. (And, since an Excel sheet expended from 2^24 to 2^34 cells, it now can hold a fairly large amount).

      TL;DR: "Large" is a matter of perspective, so don't think Excel makes it a small data set.

    7. Re:R or WEKA ... Wait, What Exactly Are You Doing? by kiwigrant · · Score: 2, Informative

      Try SOFA (http://www.sofastatistics.com/) alongside R. SOFA (Statistics Open For All) focuses on making some of the most important statistical tests easy to use and understand. It also has attractive charting and report tables. There are also videos, on-line documentation, and direct support from the developer. Disclosure #1 - I am the lead developer of SOFA. #2 I already posted accidentally as AC

    8. Re:R or WEKA ... Wait, What Exactly Are You Doing? by demonbug · · Score: 2

      I second R, and would also suggest adding in R Commander. Adds a fairly usable GUI simplifying lots of common tasks, while maintaining the flexibility of R.

  2. This may be a bad idea by bluefoxlucid · · Score: 2

    I find that libraries carry a lot of common information and not so much uncommon information. This sort of muckery seems to encourage concentration of information into a smaller and smaller realm, constantly sorting out first the never-used, then the minimally-used, to maximize volume of return but minimize the use of the library as a haven for obscure and long-forgotten knowledge. Effectively, like burning some books while not burning other books--removes knowledge.

    As with all things, there must be balance. A library where you don't increase holding of more useful texts is less immediately useful; although if you removed all the most used texts, you would have an interesting outcome... the obscure and oft-overlooked need retention, too.

    1. Re:This may be a bad idea by Galaga88 · · Score: 2

      Libraries don't necessarily enjoy removing materials from the collection, but the two main reasons to do so are to make sure we have current/accurate materials and make room in our always limited shelf space. (The first is of presumably higher importance in an academic library.)

      Unless libraries can get an unlimited budget for expansion of their physical space or off-site archives, weeding materials will be a necessary evil.

  3. SAGE by MetalliQaZ · · Score: 2

    Sage (formerly SAGE?) is an open source mathematical package that includes statistical functions. I wanted to add that to the usual mentions of R, etc.

    However, are you sure this is what you want? It sounds to me like your real problem is that you have too much data to store. If you're currently using Excel to process your data, and it has been working except that you are running out of space, perhaps what you really need is a database, like Access. If you want OSS, you can probably try LibreOffice, or engage a local student to design a web based system based on MySQL.

    --
    "Here Lies Philip J. Fry, named for his uncle, to carry on his spirit"
  4. A good database? by Anonymous Coward · · Score: 2, Interesting

    Hear me out. We deal with about 3 million data-producing elements and track in real-time to near-real-time. We ingest everything into MySQL (via macros, scripts, tools, etc.) and normalize the data on the way in. For analysis we simply query. Those queries may have their outcome displayed in a simple report generator, or (more often than not) via HTML5 Canvas graphs/charts, Cacti graphs, etc. What we're doing doesn't lend itself well to a SAS type solution. If you could use SAS for what you're doing, this probably wouldn't work for you.

  5. PSPP by Geste · · Score: 5, Informative

    Look at the free SPSS work-alike PSPP. http://www.gnu.org/software/pspp/ Sounds like R might be a bit much for your needs.

  6. PowerPivot maybe by AaronLS · · Score: 2

    Depending on the type of "analysis" you might be better off with something like PowerPivot. There's alot that you can probably gleen from your data without doing sophisticated statistics, but instead using PowerPivot to slice/dice/summarize/chart your data in different ways. It is easiest to use if you structure your data in a data warehouse/star schema fashion.

  7. What output do they want and what answer? by vlm · · Score: 3, Informative

    Blue skying the toolset is not gonna work. What output do they want, then figure out what tools can generate that output.
    If the most important thing is inserting pretty graphs into newsletters, thats one thing.
    If the most important thing is hard core data warehousing analysis (for a library?) thats another thing.

    The other thing is what answer do they want? They're just looking for data to back up an unpopular decision or glorify themselves demonstrating their amazing management talents. So figure out what that is (by asking them?) and help them get the data they want. Don't give them a graph of declining circulation if they're trying to emphasize their brilliant leadership. Don't give them a graph of increasing student help, if they're trying to justify downsizing.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
  8. Stick with Excel by syousef · · Score: 2, Insightful

    Seriously, stick with Excel. You and anyone who comes after you would need to learn whatever statistical package you introduce. That is either overkill for the kind of data you're collecting and analysing, or it's a full time job requiring specialist knowledge for which they should be hiring someone else.

    Excel has a few bugs but for the most part it's very capable. Ensure you run the service packs and can install the addons that come with it (analysis pack). Get them to send you on advanced short courses for Excel and Statistics. If there isn't that kind of commitment there's no room for any statistical package.

    Almost all ask slashdot stories that are work related can be answered the same way - bad idea: you're already out of your depth and if you can't be bothered to google for the information the project is doomed.

    --
    These posts express my own personal views, not those of my employer
  9. R and Python (Rpy2) by mpetch · · Score: 3, Interesting

    I have grown accustomed to doing statistical analysis using Python and R using http://rpy.sourceforge.net/rpy2

  10. Go Ahead and List Them Then by eldavojohn · · Score: 4, Interesting

    I also place emphasis on anything that is open source and easy to implement since it will allow me to bypass the convoluted purchase approval process.

    Sorry to burst your bubble, but if you want good support and easy implementation, you have to look for normal paid-for solutions. Besides, open source is not synonym for free. This is especially true with specialized software or something you want good support for. Open source just means you get the code aswell, so you can implement your own additions (without use of plugins) or change it.

    Your point may be valid. But what would really help your validity is mentioning some proprietary products that beat R and WEKA at their own game. Sure, I've used Matlab and it can't be beat in some respects and is heavily supported. But to suggest that just because it effortlessly interfaces with Excel spreadsheets when the person could get by with a simple export in Excel to run their R script on the resulting files? Not worth the cash, in my opinion. I don't go out and buy every piece of software to evaluate it, though. I'm aware of Matlab and Mathematica and have used them quite a bit ... but I still prefer R and WEKA. So, CmdrPony, go ahead and list all the proprietary point-and-click-omg-it-just-works software for our friend here. We're all waiting.

    But unless you get an product from a company that is spending money to develop it, you never get good software and good support.

    Say, friendo, have you ever heard of Linux? Eclipse? Audacity? PostGRES? VLC?

    No one can make both because everything in this world costs money, and developers have to live too. Open source and free software model works well for the likes of Google and Firefox because the developments get paid by money made with advertising. Statistical analysis software, and other specialized software is a different matter.

    Can you tell me what advertising model is employed to funnel money through Firefox into Google? I mean, Google makes a competing product called Chrome -- the rendering engines are even different! What in the world are you free basing?

    --
    My work here is dung.
    1. Re:Go Ahead and List Them Then by DeadDecoy · · Score: 2

      Stata is another option and it isn't too expensive. I find it more usable than R with regards to the basic tests. And it somewhat supports copy-paste functionality between excel.

    2. Re:Go Ahead and List Them Then by peter+in+mn · · Score: 4, Informative

      One major advantage of R is that it's the standard teaching package for undergraduate statistics. That means that stats department (or math department, if the school is too small to have a separate stats dept) will have people who can show you how to do stuff. That is, support is available, locally, for free. Also, there are teaching texts that start simple and build up to as complicated as you want. A saved R script is a reasonable way to automate the report preparation process. You can collect data in Excel, dump it to tab-delimited text, read it into R and generate a pile of pretty graphs over and over again every month. But writing the script requires a fair amount of study, and being able to talk to someone who uses it a lot will make you much happier.

  11. Blog and Book for SAS to R by eldavojohn · · Score: 2

    Anyone with decent recommendations, aside from R's own website, where to do a quickstart when you're a SAS geek?

    This blog explains some of the stuff you do in R and as he does it, he compares it to SAS.

    Example:

    Unlike SAS, which has DATA and PROC steps, R has data structures (vectors, matrices, arrays, dataframes) that you can operate on through functions that perform statistical analyses and create graphs. In this way, R is similar to PROC IML.

    And here's an entire book on the topic (although may be difficult to find)!

    --
    My work here is dung.
  12. Maybe a slightly different tool by LWATCDR · · Score: 3, Interesting

    It almost seems like you are not doing statistics as much as creating reports from data.
    Maybe you should be using a database instead of a spreadsheet or a statistics program.
    The Uber geek way would be to set up a LAMP server and create a webased system.
    The more convent way would be something like Access.
    You can then use Excel to manipulate the data as needed or the database program.

    In the end if you know excel you may want to stick with it. I see people use Excel for databases all the time. Drives me a bit nuts but sometimes what ever works is just fine.

    --
    See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    1. Re:Maybe a slightly different tool by Anonymous Coward · · Score: 2, Informative

      Agreed. Access is a sh*tty database but you seem to be saying that volume is your problem not functionality. However if you've got an Excel license you've probably got an Access license already and Access will allow you to re-use a lot of what you've put together in Excel while handling the volume of data better.

      Unfortunately I also agree with the other posters, if you're after more relevant advice you really need to give a bit more background on:
        - your skill set (Excel user/VBA hacker/Stats major/Hardcore programmer)
        - what do you mean by 'statistical analysis'? This is too broad a description
        - the data you're using (volumes, sources, complexity)

      Another option if volume is your only problem is to not use all the data. Take a random sample and work from that - this is common practice even for people/orgs with high end stats packages.

  13. Do NOT stick with Excel by Anonymous Coward · · Score: 5, Informative

    Excel and other spreadsheets suck at stats:

    * Burns, P. (2005). Spreadsheet Addiction.
    * Cryer, J. (2001). Problems with using Microsoft Excel for StatisticsPDF.
    * Pottel, H. (n.d.). Statistical flaws in Excel. PDF
    * Practical Stats (n.d.), Is Microsoft Excel an Adequate Statistics Package?
    * Heiser, D. (2008). Errors, faults and fixes for Excel statistical functions and routines

    For a more comprehensive and technical discussion, see the papers by Yu (2008); Yalta (2008); and McCullough & Heiser in Computational Statistics and Data Analysis 52(10).

  14. A suggestion... by esme · · Score: 2

    I suggest you post your question to the code4lib mailing list. It's going to get you much more informed and practical advice. You might even find some people who already have a good workflow who will share their tools.

    -Esme

  15. R works with both PostgreSQL and MySQL by G3ckoG33k · · Score: 2
  16. What is your Integrated Library System? by Anonymous Coward · · Score: 2, Insightful

    What is your ILS? Depending on what it is, you may already have access to just about all of what you need there along with Excel. Atriuum from Booksys has wonderful features like you are asking about, record tracking, and it exports to Excel very well. Voyager from Ex Libris had wonderful integration with Access and my boss could pull out some amazing statistics with it.

    If you don't have an ILS then seriously look at Atriuum as they are great for the smaller libraries.

    lordjim AT gmail DOT com

  17. Two tools I made for this... by njvack · · Score: 2

    OK, this is a horribly shameless self-plug, but hey, it's directly relevant. I started two projects aimed at tracking reference statistics: Libstats, which is PHP-based and open-source. I'm also one of the founders of Gimlet, which is hosted and closed-source, but provides a similar workfow.

    If you're looking to spend some time delving in code, Libstats is looking for maintainers -- I'm no longer working in libraries, so it's largely orphaned.

  18. I find that ... by PPH · · Score: 2

    ... rand() serves most of my statistical needs.

    --
    Have gnu, will travel.
  19. Try the JMP demo by jollespm · · Score: 2

    I use and like JMP from SAS. They offer a free 30 day demo and I think it does a good job at data visualization and statistical modeling, or as they call it, discovery. It will interface with SAS, R, Excel along with various database packages for additional capability that may not exist in the core product. I found it pretty easy to pick up with a fairly active user base to help get started.

  20. R with RKWard by binarstu · · Score: 4, Informative

    I will echo the support for the open-source statistics package R. R is incredibly powerful, and in the natural sciences it is fast becoming the standard statistics software.

    I will also echo the sentiment that, by itself, R is fairly low-level and typically requires at least some simple programming to get what you want.

    However, there is a very nice graphical front end for R called RKWard (http://rkward.sourceforge.net/). With RKWard, importing and exporting data, running basic analyses on it (descriptive statistics, linear regression, t-tests, etc.), and producing basic graphs is very straightforward and does not require detailed knowledge of the R language. Plus, RKWard is also a nice development environment for writing R code, so if you want to take your project further, you can easily do so. So, I'd recommend giving RKWard + R a look.

  21. Find out the real need and focus on that by fredrikv · · Score: 2

    It seems to me that all you need is descriptive statistics (change from last month, mean, min, max, etc and probably graphing). Using a general spreadsheet application like Excel or Calc will do the job just fine. Remember that Excel is designed to support business calculations and what you are asked to provide is exactly that! Using a dedicated statistics software for this task (in your environment) is a waste of resources. Full stop.

    However, the solution may not be straight-forward to solve in Excel or any other program. In my experience there are two main reasons:

    1. The request for data is unclear.
    Why do they "increasingly want data on various aspects of our activities"? It could be that the data you have provided so far has not provided support to decisions. Are the questions they really want answered possible to support with the data you can provide? Meet up with the actual decision makers or at least someone who knows what the statistics are actually used for and ask them WHY they need it. Is it used to support resourcing? Is it used to describe changes? Not even a university administration creates statistics for no reason. Most likely, what they really want to know is a handful of numbers like "change from last month", "overall sum", "hours spent on teaching vs information searches".

    Do this with an open mind. You will probably learn that many of the imperfections you see in the details are less important to them. When you know their true needs, suggest a package of data, graphs, free-text report or whatever is suitable. If some parts are easy to provide, be clear about that. If something is more difficult to produce, tell them that it is is possible but time-consuming and costly. Get their buy-in before you spend time on producing the output.

    2. The raw data is not optimally formatted for the calculations
    First of all, if raw data quality can be improved, do that first. Update forms used for feedback, ask for output in a specific format etc. Then arrange the data and calculations in Excel to make it flexible and easy to read and troubleshoot. The trick is to use structure your data and calculations in Excel in a way that is easy to follow visually and logically. In my experience it is very useful to use different tabs for data entry, data analysis and presentation.

    It seems from your examples that your input will come from a variety of sources, both manually entered and output from other systems. To get it into Excel, create separate source data tabs where you can enter or paste your raw data. For each source data tab, create a "clean up and calculate" tab where you rearrange source data and make most of the calculations. If raw data is very far from optimal or calculations are complex you may want to use several tabs or even several workbooks for this. Then create presentation tabs where you present the results from calculations in a useful format.

    I'm convinced you are suffering from both these problems. Attack them in numeric order and you are well on your way. And by all means, sign up for a course in advanced Excel that is suitable for your application. Best of luck!

  22. R, Octave, Matlab by Virtucon · · Score: 2

    I've used them all and in terms of engineering and academia, MATLAB seems to be where most theoretical prototyping is done. The license costs for academic/student use are reasonable but it's about $2K for a commercial single seat license. Octave is the MATLAB open source alternative and for most basic functions it does well however it doesn't have the extension packages available that MATLAB does.

    My favorite and one I use all the time is "R" because it does have great open source community support and there's not a lot it can't do.

    --
    Harrison's Postulate - "For every action there is an equal and opposite criticism"
  23. Rstudio by rmcd · · Score: 2

    If you do go with R, be sure to check out Rstudio (rstudio.org), which is a very nice front-end for R.

    In response to the posters who tell you that R is low quality because it's open source, I can tell you that's nonsense. I have Stata, Matlab, and R on my machine, and access to SAS on a research server. There are times to use each, but all else equal I use R. It's not trivial to learn, but it's a powerful high-quality piece of software, widely used in the statistics community. Whether it's appropriate for your use depends on you and the task. But it's great software.