Book Review: Scaling Apache Solr
First time accepted submitter sobczakt writes We live in a world flooded by data and information and all realize that if we can't find what we're looking for (e.g. a specific document), there's no benefit from all these data stores. When your data sets become enormous or your systems need to process thousands of messages a second, you need to an environment that is efficient, tunable and ready for scaling. We all need well-designed search technology. A few days ago, a book called Scaling Apache Solr landed on my desk. The author, Hrishikesh Vijay Karambelkar, has written an extremely useful guide to one of the most popular open-source search platforms, Apache Solr. Solr is a full-text, standalone, Java search engine based on Lucene, another successful Apache project. For people working with Solr, like myself, this book should be on their Christmas shopping list. It's one of the best on this subject. Read below for the rest of sobczakt's review.
Scaling Apache Solr
author
Hrishikesh Vijay Karambelkar
pages
215
publisher
Packt
rating
9/10
reviewer
sobczakt
ISBN
978-1783981748
summary
Get an introduction to the basics of Apache Solr in a step-by-step manner with lots of examples
Karambelkar is an enterprise architect with a long history in both commercial products and open source technology. As he says, he currently spends most of his time solving problems for the software industry and developing the next generation of products.
The book is divided into 10 chapters. Basically, the first three are an introduction to Apache Solr and cover its architecture, features, configuration and setting up. Chapter One contains many practical cases of Apache Solr, to help beginners understand the topic.
Chapter Four is very interesting and describes a common pattern for enterprise search solutions. These patterns focus on data processing/integration and how to meet the requirements of users (interface, relevancy, general experience).
The rest of the book mainly refers to the central topic, that is distributing search queries and how to scale/optimize a system. The book discusses all Apache Solr concepts like replication, fault tolerance, sharding and illustrates them with helpful examples. The book precisely explains SolrCloud — a bundle of built-in distributed capabilities available from version 4.0.
Chapter 8, dedicated to optimization, drew my attention. It is full of useful tips concerning JVM parameters and manipulating data structures or caching layers as well.
Scaling Apache Solr covers both basic and advanced subjects. The information is well organized, clear and concise. Lots of examples and cases in this book can be absorbed by beginners. I was nicely surprised by the chapter describing integration possibilities. There's some great information about using Solr with Cassandra, MapReduce paradigm or R (programming language for computational statistics) although I would have preferred this subject to be covered in more detail. The book has two more advantages: first, it discusses designing an enterprise search system in general terms and second, it can be treated as an introduction to large volume data processing.
I believe I need to emphasize that many sections related to defining a schema, importing data, running SolrCloud or searching in near real time (NRT) are not just a raw documentation, they also have the author's well-judged advice and comments.
Unfortunately, I felt some of the more advanced topics were not described in enough detail. For example, index merging, documents relevance or using dynamic fields in data structure. Moreover, reading the book, I had a feeling that some parts do not fit the title, such as the section about clustering with Carrot2 or integration with PHP web portal.
I can say that I have read this book with pleasure and satisfaction, which in fact is rare regarding technology publications. For me, as a person who has been working with Solr since version 1.3, it was a great way to review and sort out some of its aspects. On the other hand, I'm pretty sure, that people starting their experience with Apache Solr will take a lot from this book. Although, it is mainly focused on advanced problems, it starts with the basics.
Despite some little imperfections I recommend this book, especially because it describes the concrete technology in an easy-to-read way and also refers to some general architectural patterns.
You can purchase Scaling Apache Solr from amazon.com. Slashdot welcomes readers' book reviews (sci-fi included) -- to see your own review here, read the book review guidelines, then visit the submission page. If you'd like to see what books we have available from our review library please let us know.
The book is divided into 10 chapters. Basically, the first three are an introduction to Apache Solr and cover its architecture, features, configuration and setting up. Chapter One contains many practical cases of Apache Solr, to help beginners understand the topic.
Chapter Four is very interesting and describes a common pattern for enterprise search solutions. These patterns focus on data processing/integration and how to meet the requirements of users (interface, relevancy, general experience).
The rest of the book mainly refers to the central topic, that is distributing search queries and how to scale/optimize a system. The book discusses all Apache Solr concepts like replication, fault tolerance, sharding and illustrates them with helpful examples. The book precisely explains SolrCloud — a bundle of built-in distributed capabilities available from version 4.0.
Chapter 8, dedicated to optimization, drew my attention. It is full of useful tips concerning JVM parameters and manipulating data structures or caching layers as well.
Scaling Apache Solr covers both basic and advanced subjects. The information is well organized, clear and concise. Lots of examples and cases in this book can be absorbed by beginners. I was nicely surprised by the chapter describing integration possibilities. There's some great information about using Solr with Cassandra, MapReduce paradigm or R (programming language for computational statistics) although I would have preferred this subject to be covered in more detail. The book has two more advantages: first, it discusses designing an enterprise search system in general terms and second, it can be treated as an introduction to large volume data processing.
I believe I need to emphasize that many sections related to defining a schema, importing data, running SolrCloud or searching in near real time (NRT) are not just a raw documentation, they also have the author's well-judged advice and comments.
Unfortunately, I felt some of the more advanced topics were not described in enough detail. For example, index merging, documents relevance or using dynamic fields in data structure. Moreover, reading the book, I had a feeling that some parts do not fit the title, such as the section about clustering with Carrot2 or integration with PHP web portal.
I can say that I have read this book with pleasure and satisfaction, which in fact is rare regarding technology publications. For me, as a person who has been working with Solr since version 1.3, it was a great way to review and sort out some of its aspects. On the other hand, I'm pretty sure, that people starting their experience with Apache Solr will take a lot from this book. Although, it is mainly focused on advanced problems, it starts with the basics.
Despite some little imperfections I recommend this book, especially because it describes the concrete technology in an easy-to-read way and also refers to some general architectural patterns.
You can purchase Scaling Apache Solr from amazon.com. Slashdot welcomes readers' book reviews (sci-fi included) -- to see your own review here, read the book review guidelines, then visit the submission page. If you'd like to see what books we have available from our review library please let us know.
It's so popular that I never heard of it before today.
Get free satoshi (Bitcoin) and Dogecoins
Doesn't one just use MongoDB and Solr automatically becomes web scale?
It's also based on Lucene, and has an easier setup and administration interface.
Meanwhile all those actually using Solr/Lucene and who care about scaling have already moved to Elastic Search and don't need this book.
It's a Java distributed search platform using Java servlets for full-text searching. It's pretty interesting stuff
Karambelkar is an enterprise architect with a long history in both commercial products and open source technology. As he says, he currently spends most of his time solving problems for the software industry and developing the next generation of products.
Care to give us examples? I've Googled his name and even after going through 10 pages of links, I've yet to see a single product he's architected, any open source project he seems to be affiliated or a single problem he has solved. All I see is links about this book of which many are spam sites. For someone with such a claimed long history, it's amazing how none of it is indexed by Google.
and we switched from solr to elastic search.
that is all i have to say on the matter.
Is this just more bloated, Java "enterprise" shit?
I have no idea what "Solr" is, but ... they couldn't come up with a better name!?!
Searching and indexing information isn't a computer problem. We can already find information in massive databases--MongoDB and PostgreSQL handle that well.
It's tagging information that's difficult. Contextual full-text searches often fail to find relevant context. Google does an okay job until you're looking for something specific. General information like melting arctic ice sheets or the spread of Ebola find something relevant; but try finding the particular documents covering the timeline Wikipedia gave for Thomas Duncan's infection, and each of the things the nurse said. You'll find all kinds of shit repeated in the media, but not how they originated. Some of the things in there are notoriously hard to find at all.
I've thought about how to structure a Project Management Information System for searching and retrieving important data. Work performance information, lessons learned, projects related to a topic themselves. This steps beyond multi-criteria search to multi-dimensional search: I want to find all Lessons Learned about building bridges; I want to find all Programming projects which implemented MongoDB and pull all Work Performance Information and Lessons Learned about Schema Development; etc. I need to know about specific things, but only in specific contexts.
For this to work well, people need to tag and describe the project properly. The Project Overview must carry ample wording for full-text search; but should also be tagged for explicit keywords, such that I can eschew full-text search and say "find these keywords". It would help if project managers marked projects as similar to other projects, and tagged those similarities (why is it similar?). A human can highlight what particular attributes are strongly relevant, rather than allowing the computer to notice what's related.
With so much information, searching requires this human action to improve the results. It may also be enhanced by individualized human action: what humans produce what tags and relationship? What humans do you feel provide useful tagging and relationships? What particular relationships do *you* find important? What relationships do you want to add yourself? This will allow an individual human to tailor the search to his own experiences and needs.
On top of that, such things require memory: a human must remember certain things to know what to search for. I remember working on a project where... ...and so this becomes relevant to this search, and let me find similar things.
Computer searching is a crude form of human memory: human memory is associative, and computer searching is keyword-driven. Humans need to use their own memories, to tell the computer how they see things, and then to tell the computer how they think about what they want to know--what it's related to, what it's similar to, who they think knows best about it--and have the computer use all that information to retrieve a data set. To do that, humans must manually remember in the computer and in their brains.
The holy grail of searching is a strong AI that takes an abstract question, considers what you mean by its experience with you and its database of every other experience, pulls up everything relevant, decides what you would want to see, and discards the rest. Such a machine is largely doing your job: it's thinking for you, deciding what you'll remember, and making your decisions by occluding information which would affect your decisions. Anything less is a tool, and faulty, and requires your expertise to leverage properly.
Support my political activism on Patreon.
And just how many open-source search platforms have you worked with? If your answer anything other than 0 and you haven't heard of Apache Solr you are full of crap, and if you answer 0 then why the heck did you bother to post here in the first place as this clearly has nothing to do with you.
I wrote a few stories about this. http://www.nasw.org/users/nbau...
The best search engine I've ever seen is PubMed http://www.ncbi.nlm.nih.gov/pu... They structure information better than anybody else. But it requires a librarian to look at every document and code it according to a fairly elaborate coding scheme, the MESH headings, which basically requires a degree in library science and a good medical education to do well.
What a surprise! A Slashdot Book Review with 9/10 rating.
https://www.google.com/?q=site...'
You might want to normalize the ratings in your book reviews.
That's stereotypically racist on so many levels.