Forums | Mahara Community

Support /
User search is slow, Solr dosen't appear to work


John Madden's profile picture
Posts: 10

17 May 2010, 4:25 PM

Looking through the SQL being executed for user searches, I see some glaring problems, at least if you're running postgresql.  The code uses things like "ILIKE '%query%'," which can't use  indexes, leaving all user searches to sequential scans of rather large tables.  We're running with about 600,000 users (and about to deploy on 6/1) and we're finding user searches to be too slow to be of any use.

I went to try the solr indexer, thinking "oh, internal must be for small sites" but I can't get it to work at all -- it doesn't even appear to start making requests and data changes such as a name change don't seem to fire off requests.  Solr-based searches result in missing-file errors.

Is there any hope for large sites?

 

Richard Mansfield's profile picture
Posts: 808

17 May 2010, 6:38 PM

John,

I'm afraid the Solr plugin hasn't been kept up to date.  I know I haven't tested it for over two years now (Solr 1.2 was what we used then), and I doubt any of the other developers have used it recently either.

We'd like to update Mahara for recent versions of solr, or even just add a site option to make sql searches use exact matches only, but this will only happen if time permits.  Patches welcome of course!

R.

 

John Madden's profile picture
Posts: 10

18 May 2010, 8:03 AM

I'll see what I can do patch-wise as far as replacing ILIKE with =. 

The other problem we're seeing is that as more keywords are added to the search it gets less specific, which is counter-intuitive.  I think the search terms should be ANDs by default instead of ORs.  Search for "John Smith" and you get all of the John's and all of the Smiths and you get them in alphabetical order rather than order of relevance.  Re-writing the SQL anyway so I'll take care of both.

How do you do patches around here?  We maintain our own bazaar branch.

John Madden's profile picture
Posts: 10

18 May 2010, 1:14 PM

First stab at a patch is below.  I took out the JOIN, which is expensive, and due to how I'm handling the parameters (not looking at the artefact table), no longer needed.  This uses straight 'equal' clauses.  Still missing are intelligent hit rankings, such as preferring that both firstname and lastname match.  Three new indexes are needed:

create index usr_lname_lower_idx on usr (lower(lastname));
create index usr_fname_lower_idx on usr (lower(firstname));
create index usr_prefname_lower_idx on usr (lower(preferredname));

--- search/internal/lib.php.orig    2010-05-18 10:45:15.000000000 -0400
+++ search/internal/lib.php    2010-05-18 14:09:53.000000000 -0400
@@ -90,8 +90,7 @@
         $sql = '
             SELECT
                 COUNT(DISTINCT u.id)
-            FROM {artefact} a
-                INNER JOIN {usr} u ON u.id = a.owner';
+            FROM {usr} u';
         if (isset($data['group'])) {
             $groupadminsql = '';
             if (isset($data['includeadmins']) and !$data['includeadmins']) {
@@ -103,18 +102,21 @@
             $sql .= $groupjoin;
         }
         $querydata = split(' ', preg_replace('/\s\s+/', ' ', strtolower(trim($query_string))));
-        $namesql = "(u.preferredname $ilike '%' || ? || '%')
-                    OR ((u.preferredname IS NULL OR u.preferredname = '')
-                        AND (u.firstname $ilike '%' || ? || '%' OR u.lastname $ilike '%' || ? || '%'))
-                    OR (a.artefacttype IN $fieldlist
-                        AND ( a.title $ilike '%' || ? || '%'))";
-        $namesql = join('
-                    OR ', array_fill(0, count($querydata), $namesql));
+        $namesql = "lower(u.preferredname) = lower(?)
+                    OR lower(u.username) = lower(?)
+                    OR (lower(u.firstname) = lower(?) OR lower(u.lastname) = lower(?))";
         $values = array();
-        foreach ($querydata as $w) {
-            $values = array_pad($values, count($values) + 4, $w);
+        $values[0] = $values[1] = strtolower(trim($query_string));
+        $values[2] = array_shift($querydata);
+        while($x = array_shift($querydata))
+        {
+          $values[3] .= $x . " ";
+        }
+        if(!$values[3])
+        {
+          $values[3] = $values[0];
         }
-
+        
         $where = '
             WHERE
                 u.id <> 0 AND u.active = 1 AND u.deleted = 0
@@ -142,8 +144,7 @@
             $sql = '
             SELECT DISTINCT ON (' . $data['orderby'] . ')
                 u.id, u.username, u.firstname, u.lastname, u.preferredname, u.email, u.staff
-            FROM {artefact} a
-                INNER JOIN {usr} u ON u.id = a.owner';
+            FROM {usr} u';
         }
         else if (is_mysql()) {
             // @todo This is quite possibly not correct. See the postgres

John Madden's profile picture
Posts: 10

18 May 2010, 1:26 PM

Oh, and I added an OR to go against username.  I think the next step would be to do more intelligent things like search with AND's to make searches with more keywords more restrictive instead of less restrictive; likely a project for another day though.

 

Richard Mansfield's profile picture
Posts: 808

18 May 2010, 4:42 PM

Thanks John,

I think if we're going to apply this to core we'd need a completely new query to do the exact name-only searches, in a separate function.  If we do it that way it will be much easier to leave the existing behaviour intact for small dbs, and add another checkbox in the site options page to turn the fast searching on/off.  Obviously we'd need to add the db indexes for everyone.

François Marier's profile picture
Posts: 411

18 May 2010, 5:34 PM

Hi John,

I agree with you that search terms should probably be ANDed by default, that's what Google does too.

As far as patches are concerned, feel free to send a patch series via email (francois@mahara.org) or to publish your bzr branch somewhere (e.g. launchpad) where we can pull from it.

Cheers,

Francois

7 results