Slashdot Mirror


Book Review: Scaling Apache Solr

First time accepted submitter sobczakt writes We live in a world flooded by data and information and all realize that if we can't find what we're looking for (e.g. a specific document), there's no benefit from all these data stores. When your data sets become enormous or your systems need to process thousands of messages a second, you need to an environment that is efficient, tunable and ready for scaling. We all need well-designed search technology. A few days ago, a book called Scaling Apache Solr landed on my desk. The author, Hrishikesh Vijay Karambelkar, has written an extremely useful guide to one of the most popular open-source search platforms, Apache Solr. Solr is a full-text, standalone, Java search engine based on Lucene, another successful Apache project. For people working with Solr, like myself, this book should be on their Christmas shopping list. It's one of the best on this subject. Read below for the rest of sobczakt's review. Scaling Apache Solr author Hrishikesh Vijay Karambelkar pages 215 publisher Packt rating 9/10 reviewer sobczakt ISBN 978-1783981748 summary Get an introduction to the basics of Apache Solr in a step-by-step manner with lots of examples Karambelkar is an enterprise architect with a long history in both commercial products and open source technology. As he says, he currently spends most of his time solving problems for the software industry and developing the next generation of products.

The book is divided into 10 chapters. Basically, the first three are an introduction to Apache Solr and cover its architecture, features, configuration and setting up. Chapter One contains many practical cases of Apache Solr, to help beginners understand the topic.

Chapter Four is very interesting and describes a common pattern for enterprise search solutions. These patterns focus on data processing/integration and how to meet the requirements of users (interface, relevancy, general experience).

The rest of the book mainly refers to the central topic, that is distributing search queries and how to scale/optimize a system. The book discusses all Apache Solr concepts like replication, fault tolerance, sharding and illustrates them with helpful examples. The book precisely explains SolrCloud — a bundle of built-in distributed capabilities available from version 4.0.

Chapter 8, dedicated to optimization, drew my attention. It is full of useful tips concerning JVM parameters and manipulating data structures or caching layers as well.

Scaling Apache Solr covers both basic and advanced subjects. The information is well organized, clear and concise. Lots of examples and cases in this book can be absorbed by beginners. I was nicely surprised by the chapter describing integration possibilities. There's some great information about using Solr with Cassandra, MapReduce paradigm or R (programming language for computational statistics) although I would have preferred this subject to be covered in more detail. The book has two more advantages: first, it discusses designing an enterprise search system in general terms and second, it can be treated as an introduction to large volume data processing.

I believe I need to emphasize that many sections related to defining a schema, importing data, running SolrCloud or searching in near real time (NRT) are not just a raw documentation, they also have the author's well-judged advice and comments.

Unfortunately, I felt some of the more advanced topics were not described in enough detail. For example, index merging, documents relevance or using dynamic fields in data structure. Moreover, reading the book, I had a feeling that some parts do not fit the title, such as the section about clustering with Carrot2 or integration with PHP web portal.

I can say that I have read this book with pleasure and satisfaction, which in fact is rare regarding technology publications. For me, as a person who has been working with Solr since version 1.3, it was a great way to review and sort out some of its aspects. On the other hand, I'm pretty sure, that people starting their experience with Apache Solr will take a lot from this book. Although, it is mainly focused on advanced problems, it starts with the basics.

Despite some little imperfections I recommend this book, especially because it describes the concrete technology in an easy-to-read way and also refers to some general architectural patterns.

You can purchase Scaling Apache Solr from amazon.com. Slashdot welcomes readers' book reviews (sci-fi included) -- to see your own review here, read the book review guidelines, then visit the submission page. If you'd like to see what books we have available from our review library please let us know.

42 comments

  1. Apache what? by ArcadeMan · · Score: 2

    Hrishikesh Vijay Karambelkar, haswritten an extremely useful guide to one of the most popular open-source search platforms, Apache Solr.

    It's so popular that I never heard of it before today.

    1. Re:Apache what? by Anonymous Coward · · Score: 0

      Since you're a nobody, the fact that you've not heard of it means jack and shit.

    2. Re:Apache what? by ArcadeMan · · Score: 1

      That's a fact, Jack.

    3. Re:Apache what? by Anonymous Coward · · Score: 0

      Okay, you piece of shit.

    4. Re:Apache what? by nahpets77 · · Score: 3, Informative

      I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.

    5. Re:Apache what? by Dynedain · · Score: 2

      Have you ever needed to implement side-wide search functionality? (note, this is not the same thing as a global web search company like Google, Bing, etc)

      If you have, and it involved anything more than turning on a checkbox in your platform, then you have almost undoubtedly encountered or considered Solr.

      --
      I'm out of my mind right now, but feel free to leave a message.....
    6. Re:Apache what? by sexconker · · Score: 2

      I had never heard of it either until I needed to create an internal search engine where I work. After a few days of research, I found that Apache Solr/Lucene is often used for intranet search engines and for e-commerce sites.

      We use it to parse and index OCRd PDFs for full-text searching.
      My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

    7. Re:Apache what? by nahpets77 · · Score: 1

      What would you use as an alternative to Solr? One of the things I use it for is to index several internal wikis so that we can have a centralized search engine (also the default search engines suck). In this case, I need to index the content as well. The thing that gave me the most difficulty was tweaking the config so to get page rankings "just right".

    8. Re:Apache what? by Shados · · Score: 1

      There's a lot of popular things I've never heard about. The indexing/search space is actually pretty big, because its one of those things everyone thinks is trivial, until you need to actually do it in meaningful ways or scale. Almost everyone hits a big fat roadblock, and start looking for tools to do it (since its more or less a solved problem). For the longest time, Lucene was a defacto standard, but its fairly low level as far as indexing and searching goes, and everyone reinvented the wheel over it.

      So then you got stuff like Solr and other commercial product. For a while, the only meaningful ones were Solr, Fast, Autonomy, Endeca, etc... Still, the field is large enough to have a big mix of both open source AND commercial solutions (the last 2 I mentioned above were often part of multi-million dollar contracts, and not because the VP of IT was a moron...), and even more recently it exploded with more solutions than one would expect.

      Its pretty much a field on its own, so if you've never had that kind of problem or worked for a company who did (and was close enough to see it), you wouldn't have heard of it. Everyone else did though. Its a bit like content management systems (there's more than Wordpress...), ERPs, etc.

    9. Re:Apache what? by Shados · · Score: 2

      It really depends in what industry or subset of an industry you're in... I had to work on implementing something like that once for legal at an extremely large (and famous, or rather, infamous) company. Lawyers needed to run full searches against all our documents very very quickly to go through the bazillion lawsuit threats we were getting on a daily basis to figure out if they had some weight or not. That very much required full text search.

    10. Re:Apache what? by sexconker · · Score: 1

      Yes, if you have to support it, you have to support it. I would steer clear from Solr based on my experience with it, however.
      We only added it on because the shit we use integrates well with it. It was a "why not?" that works well enough to not be ripped out, but I wouldn't do it again unless I had to.

    11. Re:Apache what? by nbauman · · Score: 1

      My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

      Lawyers use it. Magazines use it. Lots of people use it.

    12. Re:Apache what? by K.+S.+Kyosuke · · Score: 1

      In that case, I must be you if you're the only one using it. :D Nope, the problem of full-text search is that is doesn't go deep enough. Especially in technical fields, it would be great to have search engines with greater level of text comprehension than genetic FTS (for example, what if the text uses a synonym instead of exactly the term you're looking for?)

      --
      Ezekiel 23:20
    13. Re:Apache what? by Anonymous Coward · · Score: 0

      There's Holmes, if you're brave enough. A little bit daunting but probably hard to beat on really large document collections.

    14. Re:Apache what? by sexconker · · Score: 1

      My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

      Lawyers use it. Magazines use it. Lots of people use it.

      Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.

      Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can go ahead and hire a monkey to type it in. You're still left with Solr sucking, but on top of that much of a magazine's content is so heavily formatted/styled/image-based that a Solr index would not suit it well.

      If you NEED a fulltext index. there are plenty of alternatives, some mentioned by others in the comments on this article. I can only speak to OCR sucking, Solr's indexer sucking, and Solr's search giving me way too many things for it to be useful.

    15. Re:Apache what? by beholder · · Score: 2

      Hrishikesh Vijay Karambelkar, haswritten an extremely useful guide to one of the most popular open-source search platforms, Apache Solr.

      It's so popular that I never heard of it before today.

      That's the fun part about the IT world. That even products as popular as Lucene/Solr backed by companies with Billion dollars investment (e.g. Cloudera) may not be known by everybody.

      The good thing is that when you actually need to build a search for something, we have a solution for you. Built, tested and iterated on by the hundreds of full-time and hobby-time developers working on it while you are doing other - I am sure exciting - things in your own corner of the software universe.

      P.s. This is not a comment on the book itself. I reserve my opinion on that.

    16. Re:Apache what? by nbauman · · Score: 1

      My advice: Don't use Solr. Don't use PCRd PDFs. Don't support full-text searching, because no one fucking uses it. We get thousands of searches against title, keywords, dates, and other meta shit every day in our internal application. The only full-text searches performed are by me when I'm testing shit.

      Lawyers use it. Magazines use it. Lots of people use it.

      Lawyers use it because they have to - there is no alternative to search shit short of hiring monkeys to manually type up mountains of old documents. Often, those monkeys would have to be legally privileged to look at the documents, so it's not something you can shunt off to cheap labor / Mechanical Turk. OCR sucks. Solr sucks. Mixing the two is a big ol' suck fest.

      Magazines use it because... they're stupid? There's no need to OCR a massive backlog of shit. For old shit that may not be digital, you can go ahead and hire a monkey to type it in. You're still left with Solr sucking, but on top of that much of a magazine's content is so heavily formatted/styled/image-based that a Solr index would not suit it well.

      If you NEED a fulltext index. there are plenty of alternatives, some mentioned by others in the comments on this article. I can only speak to OCR sucking, Solr's indexer sucking, and Solr's search giving me way too many things for it to be useful.

      What are some fulltext indexed open-source alternatives to Solr?

  2. Web scaling by Anonymous Coward · · Score: 1

    Doesn't one just use MongoDB and Solr automatically becomes web scale?

    1. Re:Web scaling by LordThyGod · · Score: 1

      Doesn't one just use MongoDB and Solr automatically becomes web scale?

      Speaking of web scale what's wrong with Google? Google custom search is really easy to set up (free or not free de-branded), and just works really well. The only use-case I can see for something like this, would be for stuff that can't go "web scale", because its private. What else are people using it for?

  3. I personally love ElasticSearch by Anonymous Coward · · Score: 3, Informative

    It's also based on Lucene, and has an easier setup and administration interface.

    1. Re:I personally love ElasticSearch by drpimp · · Score: 2

      And REST (ES) vs query string (Solr) is nice too!

      --
      -- Brought to you by Carl's JR
    2. Re:I personally love ElasticSearch by Anonymous Coward · · Score: 0

      And most importantly, since we're on the topic of scaling, ES is built to be distributed from the ground up. Solr can scale quite nicely, but if you want to distribute solr you pay the price of losing a lot of functionality that doesn't work distibuted.

      I used to really love Solr, and I still do, but since working with ES I can't see ever using Solr again. ES' aggregation API is the sexiest piece of tech I've seen in a long time, and is miles ahead of Solrs facets.

  4. Outdated book by Anonymous Coward · · Score: 1

    Meanwhile all those actually using Solr/Lucene and who care about scaling have already moved to Elastic Search and don't need this book.

    1. Re:Outdated book by Anonymous Coward · · Score: 0

      Well at least Solr use a decent consensus algorithm for their cluster and not something homemade like Elasticsearch, http://aphyr.com/posts/317-call-me-maybe-elasticsearchhttp://books.slashdot.org/story/14/10/13/1257212/book-review-scaling-apache-solr#

    2. Re:Outdated book by Anonymous Coward · · Score: 0

      sorry for messed up url, this is the correct one:
      http://aphyr.com/posts/317-call-me-maybe-elasticsearch

    3. Re:Outdated book by Anonymous Coward · · Score: 0

      Note: Even author of that article still uses Elasticsearch over Solr

  5. Apache Solr by Anonymous Coward · · Score: 0

    It's a Java distributed search platform using Java servlets for full-text searching. It's pretty interesting stuff

  6. Oh really? by Anonymous Coward · · Score: 0

    Karambelkar is an enterprise architect with a long history in both commercial products and open source technology. As he says, he currently spends most of his time solving problems for the software industry and developing the next generation of products.

    Care to give us examples? I've Googled his name and even after going through 10 pages of links, I've yet to see a single product he's architected, any open source project he seems to be affiliated or a single problem he has solved. All I see is links about this book of which many are spam sites. For someone with such a claimed long history, it's amazing how none of it is indexed by Google.

    1. Re:Oh really? by Anonymous Coward · · Score: 0

      Shus. It's bookwrap talk. Basically hyberbole and lies written by some copyeditor in his lunch hour.

  7. i work at a funded startup by Anonymous Coward · · Score: 0

    and we switched from solr to elastic search.

    that is all i have to say on the matter.

  8. Fuck Java!! by Anonymous Coward · · Score: 0

    Is this just more bloated, Java "enterprise" shit?

  9. I'd like to buy a vowel by Anonymous Coward · · Score: 0

    I have no idea what "Solr" is, but ... they couldn't come up with a better name!?!

  10. Not a computer problem by bluefoxlucid · · Score: 3, Interesting

    Searching and indexing information isn't a computer problem. We can already find information in massive databases--MongoDB and PostgreSQL handle that well.

    It's tagging information that's difficult. Contextual full-text searches often fail to find relevant context. Google does an okay job until you're looking for something specific. General information like melting arctic ice sheets or the spread of Ebola find something relevant; but try finding the particular documents covering the timeline Wikipedia gave for Thomas Duncan's infection, and each of the things the nurse said. You'll find all kinds of shit repeated in the media, but not how they originated. Some of the things in there are notoriously hard to find at all.

    I've thought about how to structure a Project Management Information System for searching and retrieving important data. Work performance information, lessons learned, projects related to a topic themselves. This steps beyond multi-criteria search to multi-dimensional search: I want to find all Lessons Learned about building bridges; I want to find all Programming projects which implemented MongoDB and pull all Work Performance Information and Lessons Learned about Schema Development; etc. I need to know about specific things, but only in specific contexts.

    For this to work well, people need to tag and describe the project properly. The Project Overview must carry ample wording for full-text search; but should also be tagged for explicit keywords, such that I can eschew full-text search and say "find these keywords". It would help if project managers marked projects as similar to other projects, and tagged those similarities (why is it similar?). A human can highlight what particular attributes are strongly relevant, rather than allowing the computer to notice what's related.

    With so much information, searching requires this human action to improve the results. It may also be enhanced by individualized human action: what humans produce what tags and relationship? What humans do you feel provide useful tagging and relationships? What particular relationships do *you* find important? What relationships do you want to add yourself? This will allow an individual human to tailor the search to his own experiences and needs.

    On top of that, such things require memory: a human must remember certain things to know what to search for. I remember working on a project where... ...and so this becomes relevant to this search, and let me find similar things.

    Computer searching is a crude form of human memory: human memory is associative, and computer searching is keyword-driven. Humans need to use their own memories, to tell the computer how they see things, and then to tell the computer how they think about what they want to know--what it's related to, what it's similar to, who they think knows best about it--and have the computer use all that information to retrieve a data set. To do that, humans must manually remember in the computer and in their brains.

    The holy grail of searching is a strong AI that takes an abstract question, considers what you mean by its experience with you and its database of every other experience, pulls up everything relevant, decides what you would want to see, and discards the rest. Such a machine is largely doing your job: it's thinking for you, deciding what you'll remember, and making your decisions by occluding information which would affect your decisions. Anything less is a tool, and faulty, and requires your expertise to leverage properly.

    1. Re:Not a computer problem by xxxJonBoyxxx · · Score: 1

      since you're down the tag route already, you might want to research how "faceting" works in search (think of the categories you see on some sites down the lefthand nav after a search...but generated dynamically)

    2. Re:Not a computer problem by Archtech · · Score: 1

      "Computer searching is a crude form of human memory: human memory is associative, and computer searching is keyword-driven".

      Computer searching is completely different from human memory (to the extent that we really should use different words for them): for a start, human memory is associative, and computer searching is keyword-driven. More to the point, human memory is inextricably tied up with all our senses and the ways in which the brain remembers them, whereas computer searching consists of running algorithms on successive sets of bits until an algorithm is satisfied.

      FTFY.

      --
      I am sure that there are many other solipsists out there.
    3. Re:Not a computer problem by bluefoxlucid · · Score: 1

      Keyword-driven searching is associative, but only in a minimal form. Humans remember things by remembering other things; they expect to find things in a computer by remembering something about the thing they want to find, and then entering it into the computer.

      In human memory, this brings up every association, categorized, detailed, and sorted by relative strength of association and frequency of use. On computers, we can track frequency of use automatically; strength of association is not automatic because computers don't learn and analyze context.

      In this way, a computer search mimics human memory: it allows a memory of one thing to associate to another. It's not a complete implementation, and can't be fully completed automatically. Further, even a complete system cannot be operated without human memory; a human, having improved his memory with a superior filing system, will be able to use the best technical search engine better than a dumb human.

  11. Apache what? by Anonymous Coward · · Score: 0

    And just how many open-source search platforms have you worked with? If your answer anything other than 0 and you haven't heard of Apache Solr you are full of crap, and if you answer 0 then why the heck did you bother to post here in the first place as this clearly has nothing to do with you.

  12. Not a computer problem by nbauman · · Score: 2

    I wrote a few stories about this. http://www.nasw.org/users/nbau...

    The best search engine I've ever seen is PubMed http://www.ncbi.nlm.nih.gov/pu... They structure information better than anybody else. But it requires a librarian to look at every document and code it according to a fairly elaborate coding scheme, the MESH headings, which basically requires a degree in library science and a good medical education to do well.

  13. What a surprise by BlackPignouf · · Score: 1

    What a surprise! A Slashdot Book Review with 9/10 rating.
    https://www.google.com/?q=site...'
    You might want to normalize the ratings in your book reviews.

    1. Re: What a surprise by Anonymous Coward · · Score: 0

      Bigger surprise... it's from Packt Press. Statistically speaking, Packt Press books are overrepresented on Slashdot. Any guess as to where the money flows?

  14. Re:Illiterate Indian Authors by Anonymous Coward · · Score: 0

    That's stereotypically racist on so many levels.