Slashdot Mirror


Why Jupyter is Data Scientists' Computational Notebook of Choice (nature.com)

Jeffrey M. Perkel, writing for Nature: Perched atop the Cerro Pachon ridge in the Chilean Andes is a building site that will eventually become the Large Synoptic Survey Telescope (LSST). When it comes online in 2022, the telescope will generate terabytes of data each night as it surveys the southern skies automatically. And to crunch those data, astronomers will use a familiar and increasingly popular tool: the Jupyter notebook. Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. Computational notebooks have been around for decades, but Jupyter in particular has exploded in popularity over the past couple of years. This rapid uptake has been aided by an enthusiastic community of user-developers and a redesigned architecture that allows the notebook to speak dozens of programming languages -- a fact reflected in its name, which was inspired, according to co-founder Fernando Perez, by the programming languages Julia (Ju), Python (Py) and R.

[...] For data scientists, Jupyter has emerged as a de facto standard, says Lorena Barba, a mechanical and aeronautical engineer at George Washington University in Washington DC. Mario Juric, an astronomer at the University of Washington in Seattle who coordinates the LSST's data-management team, says: "I've never seen any migration this fast. It's just amazing." Computational notebooks are essentially laboratory notebooks for scientific computing. Instead of pasting, say, DNA gels alongside lab protocols, researchers embed code, data and text to document their computational methods. The result, says Jupyter co-creator Brian Granger at California Polytechnic State University in San Luis Obispo, is a "computational narrative" -- a document that allows researchers to supplement their code and data with analysis, hypotheses and conjecture. For data scientists, that format can drive exploration.

58 comments

  1. Wasn't this recently discussed? by Anonymous Coward · · Score: 5, Interesting
    1. Re:Wasn't this recently discussed? by Anonymous Coward · · Score: 0

      At Slashdot, you get what you pay for.

      At least it isn't another fanciful article with a scaremonger headline about "climate change" and humans being the worst thing ever.

  2. They're oblivious to who owns their code by Anonymous Coward · · Score: 1

    Especially when that code provides a competitive advantage in grant proposals.

  3. Huge Notebook fan. by 0100010001010011 · · Score: 4, Interesting

    Anecdotal, but I do 90% of my python 'development' in Jupyter Notebooks.

    For work I can make a nice notebook and have it generate a PDF for archiving. It'll output to LaTeX, html, .py and a number of other formats.

    Now you can include multiple languages in the same notebook including R and Matlab, both popular in their own niches of use.

    1. Re:Huge Notebook fan. by Lab+Rat+Jason · · Score: 5, Interesting

      For the past year, I've begun using Jupyter and although I like it, there are some features that really bother me, and worry me when it comes time to create reproducible science. 1) Jupyter doesn't integrate automatically with any kind of source control software, and in the circles I run in, it is largely ignored. Data scientists act like they've never heard of source control, and what makes it worse, my local university is pumping out student after student where they introduce them to data science with Jupyter, but never bring up the topic of coding standards and recoverability. 2) Jupyter allows you to execute cells out of order. While this definitely helps speed up development (when you make a mistake, and just want to fix the relevant line and continue, rather than re-loading your entire data set), it presents a unique risk when someone thinks they've discovered something amazing, only to be unable to reproduce it after a restart, or when sharing the notebook with someone else. This can happen when race conditions exist, or when code makes changes to the database, and your out of order execution causes spooky behavior. 3) Jupyter doesn't encourage enterprise deployment. Too often I see experimental data science done well, but due to the nature of rapid development, nothing is modular, nothing is object oriented, and so if the solution was a one off answer, everything is great, but if the solution is to be made into proper enterprise ready code, the entire notebook must be transcribed into truly disciplined code. (as an aside, this process is massively difficult because data scientists often don't understand the principles of object oriented programming, and the programmer doesn't understand the principles specific to the data science objective the code was written to solve.)

      I expect to use Jupyter a lot more frequently in the coming years, but I fear it will feel like a huge step back in terms of the things that computer scientists have solved, that data scientists are ignoring.

      --
      Which has more power: the hammer, or the anvil?
    2. Re:Huge Notebook fan. by TechyImmigrant · · Score: 2

      I wrote a book in latex, with lots of python code generating data and gnuplot and matplotlib generating pictures. This worked well (try doing a 426 page mathy book in word) and it was all text files, so source control and offsite backup via git worked well.

      Would you recommend Jupyter for that kind of thing. Would the output always look like a paper - or could you make it work for technical book writing to eliminate some of the scripting and hand integration done with raw latex?

      My experiments were not encouraging, but it's entirely possible I was just clueless when trying.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    3. Re:Huge Notebook fan. by 0100010001010011 · · Score: 2

      It looks exactly like it would on paper, but if you plan on printing it I don't know if it handles the page break stuff. It does make a great PDF.

      I know that you can integrate LaTeX templates.

      The nice part about it is you don't have to deal with 99% of LaTeX and can just focus on writing the equations.

      There's a free online way to try it: https://jupyter.org/try

    4. Re:Huge Notebook fan. by Anonymous Coward · · Score: 0

      Windows and Mac only? No thank's.

    5. Re:Huge Notebook fan. by 0100010001010011 · · Score: 1
      1. I've had no problem putting notebooks in version control. There's even a diff/merge tool available now to operate on Notebooks: https://nbdime.readthedocs.io/...
      2. That can be an issue, but in reality you can enforce it by having a git-commit hook that executes the notebooks. I just make it habit to regularly reset the kernel and run all.
      3. I don't know why you think that. I always use Jupyter for my exploratory and early "enterprise" development. I'll start out with some unorganized code and then refine the notebook until I have a class developed, with test cases. Then I'll convert it to a .py. You can develop OOP in Notebooks.

      The other issues you list, that's a training issue. I have the same issues with engineers. Most are mechanical or electrical, and while they are subject matter experts their code leaves a lot to be desired.

    6. Re:Huge Notebook fan. by Hognoxious · · Score: 2

      Most are mechanical or electrical, and while they are subject matter experts their code leaves a lot to be desired.

      A bit like programmers, then?

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    7. Re:Huge Notebook fan. by 0100010001010011 · · Score: 1

      They don't have the excuse that they weren't shown a better way.

    8. Re:Huge Notebook fan. by Lab+Rat+Jason · · Score: 2

      I think you misunderstood a few things. On my first point I said there is no INTEGRATION with source control. In Visual Studio, you can commit and check out directly in your IDE. In Eclipse, you can commit and check out directly in your IDE. In Pycharm you can commit and check out directly in your IDE. In Jupyter you can not (at least I'm not aware of how).

      On your second comment, there are many ways to test that the code compiles and runs correctly, but that will not guarantee that it is right. I, like you, tend to frequently reset the kernel to ensure everything is good, but my concern isn't for me so much as it is for a group of scientists, who are scientists first, and data scientists second, and in some cases data scientists first, but computer scientists second. I'd love to see a build server and unit test capability in Jupyter, but I know I'm reaching on that one... asking data scientists to write using object oriented principles and unit tests is akin to asking them to write poetry in greek.

      On the third point, it is again, an area that I don't have an issue with so much as it is an area that a group of data scientists has an issue with. If I want to promote rapid deployment of data science packages to the organization, I need an easier way to move code from spitball POC to enterprise-ready, load tested, environmentally robust code. I can't begin to tell you how many times I've seen people try to deploy code to a server, only to enter into dependency hell because project A wants a package that is incompatible with project B. Now I've got to circle back and containerize all these piddly applications, and once that's done, I don't have a way to push the containers back to maintainers (because they're scientists dammit! They don't have time to learn about containers!). There aren't enough hours left in my life (and I'm not THAT old) to do all of this myself. I need as many people as possible to up their game doing data science to make any real progress. That means taking content matter experts, and training them with data science skills. Because that's way easier than taking someone with data science skills and teaching them the subject matter. That means I need a development environment that is easy enough for them, and also provides enterprise software development capability.

      Anyway, I guess all I'm saying is that I pray for the day that this ecosystem is as mature as the ones I'm coming from because a lot of the problems that have long been solved in computer science are now being repeated. Those who fail to learn from history are doomed to repeat it.

      --
      Which has more power: the hammer, or the anvil?
    9. Re:Huge Notebook fan. by Anonymous Coward · · Score: 0

      > (as an aside, this process is massively difficult because data scientists often don't understand the principles of object oriented programming, and the programmer doesn't understand the principles specific to the data science objective the code was written to solve.)

      In my experience, data scientists don't understand the concept of FUNCTIONAL programming either. Just getting them to write in terms of functions rather than copying and pasting the same code that does the same thing a million times HAS NOT WORKED IN SEVEN YEARS OF ARGUING IT.

    10. Re:Huge Notebook fan. by 0100010001010011 · · Score: 2

      You can execute shell commands from within a notebook with !

      !git commit -am "Commit message"

    11. Re:Huge Notebook fan. by Rutulian · · Score: 1

      3) Jupyter doesn't encourage enterprise deployment. Too often I see experimental data science done well, but due to the nature of rapid development, nothing is modular, nothing is object oriented,

      I'm not aware of anything like this that works generically for Jupyter notebooks, but I've been using OpenCPU to provide this functionality for R. RStudio is kind of like Jupyter, but designed more specifically for R, and it has templates for turning R scripts into packages. So you start with R Notebook, modularize into part Notebook / part R script with embedded functions, the Notebook part can be bootstrapped into a basic UI, then wrap up everything in an R package and deploy with OpenCPU. It's not quite seamless, but can probably be made reasonably easy with some unifying glue code to be used across packages.

      I agree with you because I'm in the same basic situation as you: surrounded by subject matter experts who need to learn to do some data science. Getting the tools in their hands to help them along the way is a persistent need. If you don't, they just go back to using Excel (or whatever) and that is truly a nightmare to manage.

    12. Re:Huge Notebook fan. by Anonymous Coward · · Score: 0

      Binder (https://mybinder.org/) is an attempt to go solve some of these problems - encouraging code control (use git) and controlling the rat's nest of python dependencies (build a controlled docker image). The current implementation is quite slow, but it's heading the right direction.

    13. Re:Huge Notebook fan. by Aighearach · · Score: 1

      LOL you almost got me, but I since I never believe anything anybody says I looked it up and found out you're wrong.

      YOU'RE ON THE INTERNET, AND YOU'RE WRONG! (That's worse than genocide, BTW)

    14. Re:Huge Notebook fan. by Anonymous Coward · · Score: 0

      huh? it runs on Linux too ... it's on pip

  4. Literate Programming by Anonymous Coward · · Score: 2, Informative

    Knuth tried to teach us to do this decades ago, but nobody listened.

    1. Re:Literate Programming by dfghjk · · Score: 1

      Utterly unrelated with only the slightest of superficial similarity.

      Furthermore, plenty of people "listened" and tools exist today not unlike those Knuth advocated.

      At least you spelled Knuth right.

    2. Re:Literate Programming by Aighearach · · Score: 1

      Knuth tried to teach us to do this decades ago, but nobody listened.

      If that was true, we wouldn't be talking about this implementation!

      It took 30 years after Maxwell published his Equations that people realized he was right, because it implied relativity, which would contradict Newton.

      A few decades of pause in the face of a new idea is nothing to be concerned about.

      That said, IMO it is only useful for casual programming, mostly because the words used in English to describe what literate programming is supposed to be don't really prepare people for the mess of mixed source and documentation that it actually makes up.

  5. Because Jupyter aligned with Mars? by jfdavis668 · · Score: 3, Funny

    Then we will analyze what guides the planets and understand what steers the stars.

    1. Re: Because Jupyter aligned with Mars? by Anonymous Coward · · Score: 0

      Mah toadstool cock is aligned with Uranus.

    2. Re:Because Jupyter aligned with Mars? by mcswell · · Score: 1

      Didn't we do that back in the 1960s? At least I think so; they say that if you remember the 60s, you weren't there.

  6. A rare sort of development in the software world.. by Junta · · Score: 3, Interesting

    Jupyter is something that is relatively unique, useful in its field, and *not* crammed down the throats of people for whom it isn't really relevant.

    I applaud the way that project is executed, adopted, and evangelized as being on point and solidly executed...

    --
    XML is like violence. If it doesn't solve the problem, use more.
  7. Terrabytes/day is not much for astronomy by Anonymous Coward · · Score: 2, Interesting

    In early nineties, my experiments used to generate 5 terabytes of data a day (limited by ability to store and post process. the telescopes had capability of generating much larger data. In fact in late 70s and early 80s, it was common to generate terrabytes of data a day in radio astronomy VLBI experiments. These were stored on regular video cassettes (7 GB/tape). A single experiment will use anywhere from 5-25 recorders simultaneously 24 hrs a day.

  8. You're Royal Blood by Anonymous Coward · · Score: 0

    Was never meant to operate this SAN.

  9. Couldn't be STUPYDR by Anonymous Coward · · Score: 0

    Stupydr for stupider. Trying to be hip is square. Get with the times.

    1. Re: Couldn't be STUPYDR by Type44Q · · Score: 1

      Replacing an "i" with a "y" is like replacing a "c" with a "k" - very whitetrash.

    2. Re: Couldn't be STUPYDR by Anonymous Coward · · Score: 0

      Konqueror is awesome--I don't care if it's "whitetrash".

    3. Re: Couldn't be STUPYDR by Anonymous Coward · · Score: 0

      Just like 'Gaysecks" ...

    4. Re: Couldn't be STUPYDR by Anonymous Coward · · Score: 0

      It's not. The name is a combination of JUlia PYThon and R.

    5. Re: Couldn't be STUPYDR by epine · · Score: 2

      I get it. You print out the suspect word with your 3D printer, and then you use your sterling hand to trace around the printed artifact for bumps, hollows, ridges, descenders, and cisterns—which you simultaneously compare and contrast to the irregular outlines of your Mario Kart Kamikazi Kukmumbr as braced by your other fleshy mitt.

      Phrenology, with benefits.

    6. Re:Couldn't be STUPYDR by Anonymous Coward · · Score: 0

      I'm sorry, but as a sysadmin setting up and maintaining jupyterhub instances, it's been pretty handy to be able to google jupyter and not have Jupiter related stuff show up. More projects should follow this example. Apache Zeppelin? "Here's a photo of the Hindenberg" Thanks Google.

    7. Re: Couldn't be STUPYDR by Anonymous Coward · · Score: 0

      +1 valid reason.

    8. Re: Couldn't be STUPYDR by mcswell · · Score: 1

      Agreed, why qan't they replase 'c' with 's' or 'q'.

  10. Re:A rare sort of development in the software worl by Anonymous Coward · · Score: 0

    Jupyter is something that is relatively unique, useful in its field, and *not* crammed down the throats of people for whom it isn't really relevant.

    Indeed, this struck me from the article:

    An attendee on a course taught by Perez even created a component to display 3D brain-imaging data. "This is a completely [neuroscience] domain-specific tool, obviously - the Jupyter team has no business writing these things. But we provide the right standards, and then that community in 24 hours can come back and write one," he says.

    Basically, we're in no way qualified to make stuff to display brain imaging, but we'll teach you how you can.

    I applaud the way that project is executed, adopted, and evangelized as being on point and solidly executed...

    Suddenly I'm very intrigued. I've been bashing away with Excel making representations of two years of performance data on our servers, and now I'm wondering how much easier this would have been with Jupyter.

    Sounds quite interesting.

  11. How long? by Anonymous Coward · · Score: 0

    How long until Microsoft/amazon/google/facebook/autodesk/stratasys by up this Jupyter? Probably as soon as I download it...

  12. For a different perspective by Anonymous Coward · · Score: 0

    My favorite counterpoint to the benefits of notebooks.

  13. Re:A rare sort of development in the software worl by Anonymous Coward · · Score: 1

    If you come from your mindset of replacing Excel, yes it is.

    However, if you come from the mindset of using it as the prodcution code for the backend model for your startup, then no.

    I can see how it works great for prototyping, experimenting, and one-off type work especially when all the users of the tool will be reasonably familar with it.

    But I wouldn't use it to run production code behind a website. Maybe you'll build your prototype version of the model in there, and then port it to a production framework.

  14. Literate programming by jma05 · · Score: 3, Interesting

    For decades, we talked about Knuth's literate programming. Jupyter is finally an open source tool that made it usable for everyone.
    There is no better way to explain the use of a library than making a Jupyter notebook available.

    Most of my Python use lately is for one-off analytics with heavy libraries. Jupyter suits this workflow very well.
    IPython already has decent hooks for IDEs (PyCharm, Spyder), but I hope this gets even better.

  15. Re:A rare sort of development in the software worl by Anonymous Coward · · Score: 0

    Nowhere in the TFA was it suggested to build your website, or start-up company off of Jupyter. Just that it's an intuitive, easy to use, powerful, and has a lot of community support and development. Basically the right tool for the job. But no one says is the best tool for every job.

    It's also not going to run Skynet, so you can cross that off your list.

  16. It's Matlab in python clothing by goombah99 · · Score: 3, Interesting

    I love jupyternotebooks. But it's matlab. Well a broken inferior matlab. I do like python syntax better than matlab but that's just a sugar.

    The upcoming Jupiterlab is a slavishly copy of the matlab ide.

    It reminds me of how Linux desktop managers were always copying the last generation of windows.

    I'm not complaining! I use mint and it owes a lot to windows too.

    Mint however is actually superior to windows now.

    But look at something like staroffice libre office. Ow... the pain. It's like a bad ms office 5, except you can only use it if you have thumbtacks in your shoes. They copied everything that was bad just so it was the same.

    Jupiter is really nice and I use it in preference to matlab because it's so portable and I can use other python packages. But unless you used matlab you may not realize it's just a fast follower of ideas already tested out by matlab

    --
    Some drink at the fountain of knowledge. Others just gargle.
    1. Re:It's Matlab in python clothing by HuguesT · · Score: 1

      Not quite.

      - I'm not aware that you can typeset whole documents, including mathematics, in the Matlab UI. see nbviewer website
      - I'm not aware that you can do slideshow presentations in Matlab, using something as simple as markdown.
      - Plots and graphics are not embeddable in the Matlab command line.

      On the other hand you can use Matlab as a GUI builder, which you cannot do as easily with Jupyter. see dashboards in jupyter

    2. Re:It's Matlab in python clothing by Anonymous Coward · · Score: 0

      The upcoming Jupiterlab is a slavishly copy of the matlab ide.

      Numpy was a close copy of matlab in the past (it has diverged somewhat in the meantime). The interface in Jupyter Lab looks like just a generic IDE though, far closer to a blend of Mathematica and MathCAD than Matlab.

      Unless I am missing something as a daily user of Jupyter lab and former Matlab user, how is tabbed environment with a file and kernel listing on the side a clone of Matlab as opposed to something more generic? How is the editable cell format any way similar to Matlab's command line environment?

    3. Re:It's Matlab in python clothing by Anonymous Coward · · Score: 0

      you may not be aware of it, but you can do all those things.

    4. Re:It's Matlab in python clothing by Anonymous Coward · · Score: 0

      Slideshow presentations are the beanie baby of python. Just a fad.

      Why? Well I can give a Powerpoint presentation I made 25 years ago and it still works. You really think anything you create in python notebooks presentaiton mode will even work a year from now? Fat chance. It's a cute trick and I've used it. But it's just a one trick pony.

    5. Re:It's Matlab in python clothing by Anonymous Coward · · Score: 0

      I love jupyternotebooks. But it's matlab. Well a broken inferior matlab. I do like python syntax better than matlab but that's just a sugar.

      That is almost correct.

      s/Matlab/Mathematica/p

      Now it's correct.

    6. Re:It's Matlab in python clothing by Anonymous Coward · · Score: 0

      I love jupyternotebooks. But it's matlab.

      Nah, nothing is as shit as MatLab. They only recently retrofitted a time-series concept into it and it shows - it's bloody terrible. Not only that but everything is expensive base license + toolbox license for even a remotely usable system. Anything licensed is going to be a roadblock to participation and what is really desired - reproducible research. MatLab is a dog. It's niche is in the fluid dynamics and aeronautics fields. For general science it is a busted flush.

  17. It's totally related by Anonymous Coward · · Score: 0

    OP is right. This is just literate programming.

    There's literally nothing novel about it; it's just the right place and right time (and a slew of Slashvertisements don't hurt either).

  18. it is? by iggymanz · · Score: 1

    Sagemath has more libraries (and can deal with jupyter notebooks too)

  19. OMG BUZZWORDS! by Anonymous Coward · · Score: 0

    BLAST PROCESSING HERE WE COME

  20. Re:A rare sort of development in the software worl by Anonymous Coward · · Score: 1

    Having said all that... the order of execution isn't really fixed. Unlike Excel where making any change normally recomputes everything affected, making a change in a notebook doesn't recompute anything untill that 'cell' is executed.

    This means the 'execution state' doesn't necessarially match what's in the notebook (it reflects the state of the cells based on the order they were executed in, which isn't guarenteed to be top-to-bottom but rather whatever order you triggered them in.)

    So, if you're a programmer and used to running code in a debugger, you're fine. If you're not.. well, yeah. Potential for confusion. Except programmers normally only go so far with this and then go "oh my programs state no longer matches reality, time to re-launch the process" Newbies won't know about that.

  21. software pirates! by Anonymous Coward · · Score: 0

    Software pirates use R.

    R! R! R!

  22. Re:rare sort of development in the software world by mcswell · · Score: 1

    "But no one says is the best tool for every job." I wish someone knowledgeable would explain which jobs Jupyter is good for, and which it isn't. I see claims here by one poster, and then a reply below that post saying they're all wrong... Doesn't generate trust, at least not for me.

    For the record, I do a lot of Python programming (including object oriented, which one poster said you couldn't do in J, and the very next poster said you could; case in point). Also XML, LaTeX, and finite state transducers, with Literate Programming thrown in (special purpose XML, the Python code is not done in LP). And all under version control. But it's not clear to me what, if anything, I'd be gaining by putting all?/some? of this in Jupyter.

  23. broken link says it all by Anonymous Coward · · Score: 0

    in the original article
    https://www.nature.com/articles/d41586-018-07196-1

    there is a link to a live jupyter demo
    link is broken

    isn't that modern fadish software ? broken by the time you hear about it ?

  24. It's alright, but it's no Org mode by Anonymous Coward · · Score: 0

    Emacs uber alles

  25. What if data scientists can't really code ? by Anonymous Coward · · Score: 0

    Maybe they just don't know how to write functions and shit and they just copy/paste some hacked code that just works for their cases.