// srs6ldap.cc notes // july02, dgg /* srs6ldap.cc SRS backend for LDAP directory service of biodata objects. d.gilbert, july 2002 */ /*************** ** speed concerns: # query = [lib=genbank-des:human] # result set size=1219158, output limit=5 srs if very fast in returning no. matches to query, and a few matches (above is <1 second) but returning full id list is time consuming -- revise this to return at 1st query, not full list but result set stats ( size, lib names, urls to full set and lib subsets, and/or set split by parts, ? sample record info) ** need to start free()ing memory allocated here; long lists cause slapd failure ** need to revise all these global vars for use w/ multi-threaded calls -- put in structure to pass to/from slapd caller ** dang, slapd is doing filter translations against the bioseq schema: asked for: (&(lib=swissprot)(org=drosophila*)) got filter: (&(lib=swissprot)(spp=drosophila*)) >> need some backend configuration for ldap>srs field names for ldap use, need -- mapping from srs fields to ldap fields (like srs views, but not slow) for each databank -- get databank LIB from query or esp. from retrieved entries, compose dn: from ID, LIB and if needed config dn extras -- need equivalent of databank lib info (getz -libs) ldapsearch url/?base? should return service info ldapsearch url/?one? should return databank libs (& their attributes) >> implies lib should be attribute of all records ? >> ? does ldap allow search on fields in dn ? .. e.g. dn: id=xxx,lib=genbank,o=bions ldapsearch url/?sub?(&(objectclass=biosequence) \ (|(lib=genbank)(lib=swissprot)) \ (des=*enzyme*)) -- *want* options to return all data of record, but not by default only if explicitly asked for: ? as attribute, possibly different formats (fasta, genbank, ...) ascii or binary object, possibly gzipped ?, but not by default. ?? use attribute ref/alias to url for complete data -- srs views methods would be handy but cant seem to get them working (academic distrib. lacks enough info for use) -- use slapd/back-passwd/ as template for srs backend (simple, only searches passwd using system calls - very similar set of options to srs uses) - revise to match bioseq.schema - handle bioseq output formats, attributes how? by attribute request, or objectClass search? - add search of databank attributes w/o retrieval, e.g. databank formats (as objectClass - inherited from superclass?) e.g. genbank BioseqDatabank object has several output formats, -- is this good ldap filter? (& (&(objectclass=BioseqDatabank)(name=genbank)(content=fasta)) (&(objectclass=BioseqEntry)(des=kinase))) ) - ? use mol attr for dna/aa sequence type (as per refseq): -- need to add fixed value to amino or dna databanks that lack (swissprot) ************/ /************* compiling -- use with c-perl regex substitution package: pcre-3.9.tar.gz by Philip Hazel ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz pcrs-0.0.1-src.tar.gz by Andreas S. Oesterhelt , LGPL - Written and Copyright (C) 2000, 2001 set sx=/bio/mb/srs61 CC -w -g -I$sx/src -I./ -L$sx/bin/solaris -lsrs \ -o srs6ldap srs6ldap.cc \ $sx/bin/solaris/libsrs.a $sx/demoo/libpcrs.a $sx/demoo/libpcre.a \ $sx/bin/solaris/libctreestd.a $sx/bin/solaris/libgd.a $sx/bin/solaris/Monitor.o compile as backend (back-srs) for openldap slapd see also back-srs/config.c, external.h,init.c, and esp. search.c as revised from back-passwd/ # must use DSRS6LDAP_NOMAIN to keep out main() CC -w -g -I$sx/src -I./ -L$sx/bin/solaris -lsrs -DSRS6LDAP_NOMAIN=1 -c srs6ldap.cc # this is a muddle - as slapd is C code and has a complex configure/make system # need to add above SRS, pcre/pcrs libs to it for building slapd/back-srs Makefile parts: SRCS = search.c config.c init.c srs6ldap.cc OBJS = search.lo config.lo init.lo srs6ldap.o Monitor.o XLIBS = libsrs.a libpcrs.a libpcre.a libctreestd.a libgd.a AC_DEFS += -I$(SRS6)/src/ AC_LIBS += -lsrs -lpcre -lpcrs XINCPATH += -I$(SRS6)/src slapd/ Makefile parts, to link w/ CC libs for srs6 : -lCstd -lCrun AC_LIBS += -lCstd -lCrun # to make stand-alone (slapd/back-shell usable) CC -w -g -I$sx/src -I./ -L. -lsrs -o srs6ldap srs6ldap.cc \ libsrs.a libpcrs.a libpcre.a libctreestd.a libgd.a Monitor.o slapd/backend.c: #ifdef SLAPD_SRS #include "back-srs/external.h" #endif #if defined(SLAPD_SRS) && !defined(SLAPD_SRS_DYNAMIC) {"srs", srs_back_initialize}, #endif ../../include/portable.h #define SLAPD_SRS 1 #-------------- srs5 - wont work w/o major code changes - srs6 has PROGv , other mods slapd shell call for slapd.conf -- dont need, just use below command line in .conf file #!/bin/sh echo "srs6ldif call" >> log.srsslapd date >> log.srsslapd exec ./srs6ldif -s -l log.srsslapd -b 'srv=srs' #-------------- *************/ /************** TESTS --------------------------------- basedn: srv=srs,o=IUBio ?? drop o=IUBio for simplicity, compat. w/ ldap://eugenes.org/srv=srsgnomap '(& (|(objectClass=*)(objectClass=Feature))(start<=10044371)(stop>=9955629))' '(&(objectClass=*)(lib=genbank)(des=kinase))' '(&(objectclass=*)(lib=genbank)(des=kinase))' this works for multiple libraries: ldap://iubio.bio.indiana.edu:3895/srv=srs?id,acc?sub?(&(objectClass=*)\ (lib={swissprot trembl})(des=kinesin))?sizelimit=500 ^^^^ regularize this for ldap syntax? (|(lib=lib1)(lib=lib2)) // limit 500 machome% /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=500' \ > ! srsq.out 8.31 real 0.93 user 0.26 sys // all 21320 results: -- too slow !! - compare to wgetz machome% /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=50000' \ > ! srsq.out 235.70 real 6.67 user 5.07 sys // try all data /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=50' \ > ! srsq.out oat% /usr/bin/time java ldapsearch \ ? 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=50' \ ? > ! srsq.out limit=50 real 2.3 user 0.7 sys 0.1 limit=500 real 9.5 user 1.7 sys 0.2 limit=50000 (all 21320) real 5:05.4 user 33.0 sys 3.8 -- with only 'dn' field: real 4:10.0 user 8.2 sys 2.5 ^^^ slowness seems to be # of entries, not size of entry ^^^ is it shell backend slowness, or slapd/ldap slowness? MORE SPEED TESTS: - getz alone, no fields (id list) /usr/bin/time getz '[lib={swissprot refseq trembl}-des:kinase]' | wc real 18.6 user 16.3 sys 0.6 ^^^ this is how fast it should be (at least for id list only :) 20073 entries - getz alone (retrieve a field) *** HA, the time cost is w/ SRS (opening/processing each entry) /usr/bin/time getz '[lib={swissprot refseq trembl}-des:kinase]' -f id | wc real 3:17.2 user 2:18.2 sys 4.4 /usr/bin/time getz '[lib={swissprot refseq trembl}-des:kinase]' -e > out.getz oat% /usr/bin/time getz '[lib={swissprot refseq trembl}-des:kinase]' -e > out.getz real 4:57.7 user 3:51.7 sys 9.8 - srs6ldap with no ldap transport: /usr/bin/time ./srs6ldap -d 1 -f '(&(lib={swissprot refseq trembl})(des=kinase))' -a 'id' -m 50000 > /dev/null # query = [lib={swissprot refseq trembl}-des:kinase] # result set size=20073 limit=50000 (all 20073) real 3:16.6 user 2:18.4 sys 5.2 - srs6ldap, no transport, dn only (no attrib/data reading) >> same speed as getz (or faster) /usr/bin/time ./srs6ldap -d 1 -g '[lib={swissprot refseq trembl}-des:kinase]' -a dn -m 50000 > /dev/null # query = [lib={swissprot refseq trembl}-des:kinase] # result set size=20073 limit=50000 (all 20073) real 17.3 user 14.0 sys 0.6 -- srs6ldap, transported oat -> oat, all attribs (long) --! ldap died - out of memory?! -probably, got lots of mem leaks in this hack /usr/bin/time java -cp $el ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=50000' > /dev/null limit=50000 (all 20073) real 4:16.1 user 18.6 sys 1.1 /usr/bin/time java -cp $el ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=5000' > /dev/null limit=5000 real 8.6 user 2.8 sys 0.4 /usr/bin/time java -cp $el ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?(&(lib={swissprot refseq trembl})(des=kinase*))?sizelimit=50000' | wc Search failed: javax.naming.CommunicationException: Request: 2cancelled real 20.7 user 5.3 sys 1.2 24056 24059 605868 /usr/bin/time java -cp $el ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?\ (&(lib={swissprot refseq trembl})(des=kinase))\ ?sizelimit=10000' | grep -c 'dn:' real 15.5 user 4.8 sys 0.8 10000 oat% /usr/bin/time java -cp $el ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?(&(lib={swissprot refseq trembl})(des=kinase))?sizelimit=10000' | grep -c 'dn:' real 15.3 user 4.6 sys 0.9 10000 /usr/bin/time java -cp $el ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(lib=swissprot refseq trembl)(des=kinase))?sizelimit=100000' | grep -c 'dn:' real 29.6 user 7.9 sys 1.8 20073 --- bigger result set test /usr/bin/time getz '[genbank-org:drosophila]' | wc real 3:44.1 user 3:35.5 sys 6.0 340794 340794 5792964 /usr/bin/time java -cp $el ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?\ (&(lib=genbank)(org=drosophila))?sizelimit=900000' | grep -c 'dn:' ^^ fails, ldap changes 'org' field to 'spp' based on bioseq.schema alias [ changed schema to put org as primary name ] # query = [lib=genbank-des:human] # result set size=1219158, output limit=5 # note: srs if very fast in returning no. matches to query, and 1st few (above is <1 second) # but returning full id list is time consuming oat% java ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*) (lib=trembl swissprot genbank refseq)(org=homo sapiens))' Search URL: ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*)(lib=trembl swissprot genbank refseq)(org=homo sapiens)) dn: resultset=Qi4,srv=srs {fmt=fmt: native, fasta, url=url: srv=srs??sub?(&(objectClass=BioseqEntry)(lib=trembl swissprot genbank refseq)(org=homo sapiens)), srv=srs??sub?(&(objectClass=BioseqRecord)(lib=trembl swissprot genbank refseq)(org=homo sapiens)), count=count: 5711344, objectclass=objectClass: BioseqSet, extensibleObject, query=query: (&(objectClass=*)(lib=trembl swissprot genbank refseq)(org=homo sapiens)), lib=lib: GENBANK, SWISSPROT, TREMBL, REFSEQ, id=id: Qi4} oat% java ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*) (lib=trembl swissprot refseq)(org=homo sapiens))' Search URL: ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*)(lib=trembl swissprot refseq)(org=homo sapiens)) dn: resultset=Qi5,srv=srs {fmt=fmt: native, fasta, url=url: srv=srs??sub?(&(objectClass=BioseqEntry)(lib=trembl swissprot refseq)(org=homo sapiens)), srv=srs??sub?(&(objectClass=BioseqRecord)(lib=trembl swissprot refseq)(org=homo sapiens)), count=count: 78148, objectclass=objectClass: BioseqSet, extensibleObject, query=query: (&(objectClass=*)(lib=trembl swissprot refseq)(org=homo sapiens)), lib=lib: SWISSPROT, TREMBL, REFSEQ, id=id: Qi5} /usr/bin/time getz '[genbank-des:human]' -c 1219158 real 2.2 user 0.6 sys 0.2 /usr/bin/time getz '[genbank-des:human]' | wc real 14:24.8 user 13:20.2 sys 24.4 1219158 records /usr/bin/time getz '[genbank-des:human]' -e > ! outgb.getz oat% /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(objectClass=*)(lib=genbank)(des=human))' resultset=Qi8: null:null:{fmt=fmt: native, fasta, url=url: srv=srs??sub?(&(objectClass=BioseqEntry)(lib=genbank)(des=human)), srv=srs??sub?(&(objectClass=BioseqRecord)(lib=genbank)(des=human) ), count=count: 1219158, objectclass=objectClass: BioseqSet, extensibleObject, query=query: (&(objectClass=*)(lib=genbank)(des=human)), lib=lib: GENBANK, id=id: Qi8} real 1.4 user 0.6 sys 0.1 /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?(&(objectClass=BioseqEntry)(lib=genbank)(des=human))' \ > ! outgb.ldap real 18:28.4 user 6:33.2 sys 2:39.2 1219158 recs /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(objectClass=BioseqRecord)(lib=genbank)(des=human))' \ > ! outgb.ldap java.lang.OutOfMemoryError .. out of mem at 1086547 recs , 3.3 GB (! most of them) real 31:58.5 user 21:58.0 sys 3:53.1 /usr/bin/time java -Xms90M -Xss30M -Xmx280M bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(objectClass=BioseqRecord)(lib=genbank)(des=human))' \ > ! outgb.ldap java.lang.OutOfMemoryError 1088120 recs real 32:48.1 user 24:04.4 sys 3:36.1 split 1219158 to 600000 and 619158 /usr/bin/time java -Xms90M -Xmx280M bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(objectClass=BioseqRecord)(lib=genbank)(des=human)(start=1)(count=600000))'\ > ! outgb1.ldap real 19:51.0 user 12:55.3 sys 2:02.0 /usr/bin/time java -Xms90M -Xmx280M bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(objectClass=BioseqRecord)(lib=genbank)(des=human)(start=600000)(count=619158))'\ > ! outgb2.ldap java.lang.OutOfMemoryError ^^^ maybe this is a record size memory bug ? 488944 records (== 1088944 index) real 14:15.3 user 11:30.8 sys 1:49.9 /usr/bin/time java -Xms90M -Xmx280M bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*?sub?(&(objectClass=BioseqRecord)(lib=genbank)(des=human)(start=1086547))' \ > ! outgb3.ldap Exception in thread "main" java.lang.OutOfMemoryError real 1:56.3 user 1:49.0 sys 11.1 /usr/bin/time ./srs6ldap -d 2 \ -f '(&(lib=genbank)(des=human)(objectclass=BioseqRecord)(start=1086547))' \ -b 'srv=srs' -a '*' -m 99 > ! outgb3.seq died at 1086547 + 3012 = 1089559 ## CCCCCCC try C client ldapsearch C C C 11) getz '[genbank-des:human]' -e /usr/bin/time /usr/local/bin/ldapsearch -x -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 \ '(&(objectClass=BioseqRecord)(lib=genbank)(des=human))' \ > ! outgb.ldapc # dn: id=AF037599,lib=GENBANKRELEASE,srv=srs # objectClass: BioseqRecord # objectClass: extensibleObject # fmt: native # seq:: TE9DVVMgICAgICAgQUYwMzc1OTkgICAgICAgICAgICAgICAgMTMzMyBicCAgICBETkEgICAg # IGxpbmVhciAgIEJDVCAxNy1KVUwtMTk5OApERUZJTklUSU9OICBFaHJsaWNoaWEgc3AuIE5DSC0xI ^^^ ldapsearch-c did this to seq ! probably cause of newlines in seq data #! fixed to avoid encoding data: -ttt patch to ldapsearch.c /usr/bin/time ./bioldapsearch -x -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 -LLL -ttt \ '(&(objectClass=BioseqRecord)(lib=genbank)(des=human))' \ > ! outgb.ldapc real 32:32.7 user 6:31.0 sys 2:04.0 1219158 records ; 4680319099 bytes (4.68 GB) 10) getz '[genbank-des:human]' /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 -LLL -ttt \ '(&(objectclass=BioseqEntry)(lib=genbank)(des=human))' \ dn > ! outgb1.ldapc real 18:45.5 user 1:39.1 sys 39.5 /usr/bin/time ncftpput iubio.bio.indiana.edu Incoming/ outgb.ldapc outgb.ldapc: 2145288192 bytes 9.14 MB/s time: command terminated abnormally. -- died at 2.1 GB (filesize limit !?) - ncftp or wuftpd error? real 3:44.9 user 0.8 sys 35.1 == ~ 8 min. ftp ---- 7) getz '[genbank-org:drosophila]' -e -- sep 02 /usr/bin/time ./bioldapsearch -x -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))' \ > & ! tout.ldapq7 & real 8:26.9 user 1:28.2 sys 38.5 506 secs ; 340794 records ; 1.45 GB ; 1456424336 bytes == /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 -LLL -ttt \ '(&(objectclass=BioseqEntry)(lib=genbank)(org=drosophila))' \ dn > ! outdr1.ldapc real 4:50.6 user 26.7 sys 10.3 340794 recs /usr/bin/time ncftpput iubio.bio.indiana.edu Incoming/ outdr.ldapc outdr.ldapc: 1.36 GB 8.62 MB/s real 2:42.5 user 0.5 sys 23.9 -- oct 02 /usr/bin/time ./bioldapsearch -x -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))' \ > & ! tout.ldapq7 & real 9:22.1 user 1:26.6 sys 45.7 562 secs; 332474 records ; 1.73 GB ; 1731821877 Oct 14 14:45 tout.ldapq7 15) getz '[lib=gnomapld-org:worm]&[lib-chr:X]' -c 29371 recs # "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?start=1;stop=17749735;ref=CHROMOSOME_X" \ /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 999999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=worm)(chr=X))' \ > ! outgn.ldapc real 28.2 user 6.2 sys 1.3 29371 recs /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 99999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=worm)(chr=X)(ftk=inverted_repeat))' \ > ! outgn.ldapc real 3.6 user 0.8 sys 0.1 /usr/bin/time ./srs6ldap -d 2 \ -f '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=worm)(chr=X))' \ -b 'srv=srs' -a '*' -m 9 > ! outgn.ldapc /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 99999999 -LLL -ttt \ '(&(objectclass=*)(lib=gnomapld)(org=mosquito)(chr=2L)(bpe=>38320970)(bpb=<38321085))' \ /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 99999999 -LLL -ttt \ '(&(objectclass=BioseqEntry)(lib=gnomapld)(org=mosquito)(chr=2L)(bpe=>38320970)(bpb=<38321085))' \ dn: id=id=bp1-47863460,ft=source,chr=2L,spp=mosquito,lib=GNOMAPLD,srv=srs dn: id=id=bp38320346-38328460,ft=CDS,chr=2L,spp=mosquito,lib=GNOMAPLD,srv=srs dn: id=id=AGgn0004351,ft=gene,chr=2L,spp=mosquito,lib=GNOMAPLD,srv=srs real 8.6 user 0.0 sys 0.0 /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 99999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=mosquito)(chr=2L)(bpe=>38320970)(bpb=<38321085))' \ real 8.7 user 0.0 sys 0.0 /usr/bin/time getz \ '[lib=gnomapld-org:mosquito]&[lib-chr:2L]&[lib-bpe#38320970:]&[lib-bpb#:38321085]' GNOMAPLD:id=bp1-47863460,ft=source,chr=2L,spp=mosquito GNOMAPLD:id=bp38320346-38328460,ft=CDS,chr=2L,spp=mosquito GNOMAPLD:id=AGgn0004351,ft=gene,chr=2L,spp=mosquito real 8.8 user 8.6 sys 0.1 /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 99999999 -LLL \ '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=fly)(chr=2R)(bpe=>7226327)(bpb=<7226281))' \ /usr/bin/time lynx -source \ 'http://oat.bio.indiana.edu:7180/cgi-bin/das/fly/features?start=7226281;stop=7226327;ref=2R' real 0.9 user 0.0 sys 0.0 oat.% /usr/bin/time lynx -source 'http://oat.bio.indiana.edu:7180/cgi-bin/das/fly/features?start=6226281;stop=9226327;ref=2R' | grep -c 'FEAT' real 1:13.7 user 0.1 sys 0.3 28169 (all exons) oat.% /usr/bin/time getz '[lib=gnomapld-org:fly]&[lib-chr:2R]&[lib-bpe#6226281:]&[lib-bpb#:9226327]' -c 1959 (no exons) real 9.9 user 9.4 sys 0.1 oat.% /usr/bin/time lynx -source 'http://oat.bio.indiana.edu:7180/cgi-bin/das/fly/features?start=6226281;stop=9226327;ref=2R;type=gene' | grep -c 'FEAT' real 3.7 user 0.0 sys 0.0 417 (why fewer than gnomapld ?) ------- /usr/bin/time java -cp $el ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?\ (&(lib=genbank)(des=human))?sizelimit=9999999' | grep -c 'dn:' real 28:05.2 user 7:13.2 sys 1:50.4 1219158 records ------- !! raw data get (~asksrs equiv, no icarus or entries) /usr/bin/time ./srs6ldap -d 0 -r -m 99999 -g \ '[libs={swissprot refseq trembl}-des:kinase*]' 12019 entries ?? missing some real 1:44.0 user 8.0 sys 8.7 test library split query: - 2 swiss, 1 refsq [lib={swissprot refseq}-des:kinesin]&[lib-des:Ubiquitous] java -cp $el ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?dn?sub?\ (&(lib={swissprot refseq})(des=kinesin)(des=Ubiquitous))\ ?sizelimit=10' test at ldap://iubio.bio.indiana.edu:3895/ java ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs,o=IUBio??sub?(&(objectClass=*)(lib=swissprot)(des=kinesin))?sizelimit=20,deref=always' java -cp /c7/eugenes/ldap/ ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectClass=*)(lib=swissprot)(des=kinesin))?sizelimit=20,deref=always' -- 8 jul 02 --- more tests w/ full data retrieval ./srs6ldap -f '(&(lib=swissprot refseq trembl)(des=kinase)(objectclass=*))' -b 'srv=srs' -a '*' -d 1 # query = [lib={swissprot refseq trembl}-des:kinase] # result set size=20073 # output set limit=10 dn: resultset: Q1 objectClass: BioseqSet objectClass: extensibleObject id: Q1 count: 20073 filter: (&(lib=swissprot refseq trembl)(des=kinase)(objectclass=*)) fmt: native fmt: fasta lib: SWISSPROT lib: TREMBL lib: REFSEQ url: srv=srs??sub?(&(lib=swissprot refseq trembl)(des=kinase)(objectClass=BioseqEntry)) url: srv=srs??sub?(&(lib=swissprot refseq trembl)(des=kinase)(objectClass=BioseqRecord)) /usr/bin/time ./srs6ldap \ -f '(&(lib=swissprot refseq trembl)(des=kinase)(objectclass=BioseqRecord))' \ -b 'srv=srs' -a '*' -m 9999999 > ! out.seq real 1:51.6 user 22.5 sys 7.6 20073 records /usr/bin/time ./srs6ldap \ -f '(&(lib=swissprot refseq trembl)(des=kinase)(objectclass=BioseqRecord))' \ -p fasta -b 'srv=srs' -a '*' -m 9999999 > ! out.fa real 2:09.5 user 1:03.5 sys 4.9 /usr/bin/time getz '[genbank-org:drosophila]' | wc real 3:44.1 user 3:35.5 sys 6.0 340794 340794 5792964 oat% /usr/bin/time getz '[genbank-org:drosophila]' -c 340794 real 3.1 user 0.4 sys 0.1 oat% /usr/bin/time getz '[genbank-org:drosophila]' -e > out.getz error: insufficient memory - error during malloc, could not allocate "string value" real 52:18.7 user 47:20.2 sys 3:27.1 ^^^ eek, getz failed!, and after taking so long (srs6ldap did ok in 7 mins)! ^^^ got 306032 records ./srs6ldap -f '(&(lib=genbank)(org=drosophila)(objectclass=*))' \ -b 'srv=srs' -a '*' -d 1 # query = [lib={genbank}-org:drosophila] # result set size=340794 # output set limit=10 dn: resultset=Q1,srv=srs objectClass: BioseqSet objectClass: extensibleObject id: Q1 count: 340794 filter: (&(lib=genbank)(org=drosophila)(objectclass=*)) fmt: native fmt: fasta lib: GENBANK url: srv=srs??sub?(&(lib=genbank)(org=drosophila)(objectClass=BioseqEntry)) url: srv=srs??sub?(&(lib=genbank)(org=drosophila)(objectClass=BioseqRecord)) /usr/bin/time ./srs6ldap \ -f '(&(lib=genbank)(org=drosophila)(objectclass=BioseqEntry))' \ -b 'srv=srs' -a 'dn' -m 999999 | grep -c '^dn:' real 3:22.1 user 3:12.6 sys 6.3 340794 /usr/bin/time ./srs6ldap \ -f '(&(lib=genbank)(org=drosophila)(objectclass=BioseqRecord))' \ -b 'srv=srs' -a 'dn' -m 999999 > ! out.seq real 7:45.6 user 4:40.7 sys 2:09.9 /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*)(lib=genbank)(org=drosophila))' real 0.7 user 0.4 sys 0.1 12-222-73-134% java ldapsearch 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*)(lib=genbank)(org=homo sapiens))' Search URL: ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*)(lib=genbank)(org=homo sapiens)) dn: resultset=Qi3,srv=srs {fmt=fmt: native, fasta, url=url: srv=srs??sub?(&(objectClass=BioseqEntry)(lib=genbank)(org=homo sapiens)), srv=srs??sub?(&(objectClass=BioseqRecord)(lib=genbank)(org=homo sapiens)), count=count: 5633196, objectclass=objectClass: BioseqSet, extensibleObject, query=query: (&(objectClass=*)(lib=genbank)(org=homo sapiens)), lib=lib: GENBANK, id=id: Qi3} -- direct standalone call oat% /usr/bin/time ./srs6ldap \ ? -f '(&(lib=swissprot refseq trembl)(des=kinase)(objectclass=BioseqRecord))' \ ? -p fasta -b 'srv=srs' -a '*' -m 9999999 > ! out.fa real 1:42.2 user 1:03.3 sys 3.9 -- same client host as server /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=fasta)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=200000' > kinase.fa 110.17 real 18.13 user 10.29 sys /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=200000' > kinase.out real 1:28.0 user 41.3 sys 5.2 /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=200000' > ! tout.ldap real 1:43.4 user 39.6 sys 5.4 oat% /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap Exception in thread "main" java.lang.OutOfMemoryError <> ^^ this is client error not server >> got Nrecords = 55457 !!!!! real 2:01.0 user 1:49.8 sys 16.0 /usr/bin/time java -Xmx120M ldapsearch \ ? 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ ? (fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap java.lang.OutOfMemoryError <> >> got 186406 records, this time !? what in ldapsearch.java is eating memory ? real 6:22.7 user 5:58.6 sys 46.1 !! STILL OUT OF MEM !! /usr/bin/time perl ldapsearch.pl 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ ? (fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! out.pldap Out of memory during request for 3672 bytes, total sbrk() is 1 332 205 512 bytes! real 11:42.9 user 10:42.4 sys 52.7 ## try w/ debug test, more mem free, more start mem /usr/bin/time java -Xms50M -Xmx180M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap >> on macosx (java1.3 ) 12-222-73-134% /usr/bin/time java -Xms40M -Xmx180M bioldapsearch debug 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap nrec=0 free Kb=40428 (gc) free Kb=40586 nrec=10000 free Kb=39999 (gc) free Kb=40363 nrec=20000 free Kb=38833 (gc) free Kb=40223 nrec=30000 free Kb=38957 (gc) free Kb=40213 nrec=40000 free Kb=39053 (gc) free Kb=40168 >> on solaris (java1.3 ) oat% /usr/bin/time java -Xms50M -Xmx180M bioldapsearch debug 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap nrec=0 free Kb=50856 (gc) free Kb=51053 nrec=10000 free Kb=2621 (gc) free Kb=42866 nrec=20000 free Kb=31199 (gc) free Kb=36139 nrec=30000 free Kb=6944 (gc) free Kb=25907 nrec=40000 free Kb=13666 (gc) free Kb=34900 nrec=50000 free Kb=10807 (gc) free Kb=50977 nrec=60000 free Kb=41339 (gc) free Kb=69079 nrec=70000 free Kb=64913 (gc) free Kb=81137 nrec=80000 free Kb=38961 (gc) free Kb=54767 nrec=90000 free Kb=34414 (gc) free Kb=35535 Exception in thread "main" java.lang.OutOfMemoryError <> ## try solaris set stacksize ? /usr/bin/time java -Xms50M -Xss30M -Xmx180M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000'\ > ! tout.ldap # on machome macosx: (cable modem) /usr/bin/time java -Xms50M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=900000' > ! tout.ldap 448.05 real 44.46 user 8.70 sys # on iubio solaris: /usr/bin/time java -Xms50M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=900000' > ! tout.ldap nrec=0 free Kb=50856 (gc) free Kb=51053 nrec=10000 free Kb=39621 (gc) free Kb=50604 nrec=20000 free Kb=45572 (gc) free Kb=48567 nrec=20073 free Kb=44608 (gc) free Kb=50991 real 1:45.8 user 38.1 sys 4.1 105 sec ## try solaris set stacksize ? oat% /usr/bin/time java -Xms50M -Xmx180M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap nrec=0 free Kb=50856 (gc) free Kb=51053 nrec=10000 free Kb=2621 (gc) free Kb=42866 nrec=20000 free Kb=31199 (gc) free Kb=36139 nrec=30000 free Kb=6944 (gc) free Kb=25907 << looks like j1.4 fixed this mem bug .. nrec=90000 free Kb=34414 (gc) free Kb=35535 Exception in thread "main" java.lang.OutOfMemoryError <> sep02 test: oat.% /usr/bin/time java -Xms50M bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=native)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=900000' > ! tout.ldap real 1:45.9 user 29.3 sys 4.6 /usr/bin/time java -Xms50M bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)(lib=refseq)(des=kinase))?sizelimit=2' /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 2 \ '(&(objectclass=BinaryBioseqRecord)(lib=refseq)(des=kinase))' -LLL -ttt -ttt << trick flag (dgg added) for verbatim output of ldapsearch ## output w/o -ttt, w/ it becomes binary asis # NM_000167,lib=REFSEQ,srv=srs dn: id=NM_000167,lib=REFSEQ,srv=srs objectClass: BioseqRecord objectClass: extensibleObject fmt: native bseq:: eJy1WF13m0gSffev6OOnzDmyApKsD+f4AQPCRAgjwDjaF592CyPWCLTQSkb767eqGmkijx3 vzJ7VTKIW6q6uuvfW7Va8O/M+Yurlzx81TdOHI3b66mnDMXva4nAT+ga+F3mZ8hoGQeiyvn5xZ8YX PVh8ZtlT13dj985n7LbaVKzh2zwtG5YVe5HWVcFe8pI3KfvkzH7rULzumWGadhTRmj+SOEvssH320 9OuzpjjXg0utYGmDc9m9vLhLrSogu5ZdHcfmrZKer3b8LJ7xthd6Bi+G81P8zn7uUB798LrfSX5Fz ZPJf93BQNzXdUrjo/Mmpc5jZK0lulTTWN7J9MirRpZbfIvJ9HmfLPhRa7mrNMaR0Gdb7hMG4gGq+t 6nZf5F0wI3lc8pWHVPQvtqR3aPtU= 3) ldap /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 900000 \ '(&(objectclass=*)(lib=swissprot refseq trembl)(des=kinase))' \ > ! tout.ldap /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 900000 \ '(&(objectclass=BioseqRecord)(lib=swissprot refseq trembl)(des=kinase))' > ! tout.ldap real 1:24.2 user 4.6 sys 1.7 84 secs ; 20073 recs, 67 MBytes ## from dghome over cable net (~ 1MBs speed) dghome2% ping oat 16% packet loss round-trip min/avg/max = 69.831/175.491/341.415 ms /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 900000 \ '(&(objectclass=BioseqRecord)(lib=swissprot refseq trembl)(des=kinase))' >&! tout.ldapq3 & 257.83 real 2.12 user 5.37 sys 258 secs ; 20087 recs ; 66MB ; 66327992 Oct 14 15:42 tout.ldapq3 /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 900000 \ '(&(objectclass=BinaryBioseqRecord)(lib=swissprot refseq trembl)(des=kinase))' >&! tout.bldapq3 & 119.25 real 1.42 user 1.35 sys 119 secs ; 20087 recs ; 7MB !! ; 7540212 Oct 14 15:48 tout.bldapq3 --- binldap q 3) /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 900000 \ '(&(objectclass=BinaryBioseqRecord)(lib=swissprot refseq trembl)(des=kinase))' > ! tout.bldap real 1:47.8 user 4.5 sys 0.9 107 secs ; 20073 recs ; 7.6 MBytes Wow! - NOTE that BinaryBioseqRecord ldap output is SMALLER by 50% than gzipped BioseqRec out -rw-r--r-- 1 gilbertd staff 7618507 Sep 17 17:37 tout.bldap << binary out (same query) -rw-r--r-- 1 gilbertd staff 14812877 Sep 17 17:41 tout.ldap.gz << text out binldap q 7) /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 900000 \ '(&(objectclass=*)(lib=genbank)(org=drosophila))' count: 332474 /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 999999999 \ '(&(objectclass=BinaryBioseqRecord)(lib=genbank)(org=drosophila))' \ >& ! tout.bldapq7 & real 38:17.7 user 1:15.4 sys 16.8 2297 secs ; 330509 recs ; 199 MBytes ! ^^ why only 330509 recs ?? - also takes much longer than non-bin ldap ? gzipping? ^^ slapd server died near end... /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 999999999 \ '(&(objectclass=BinaryBioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))' \ >& ! tout.bldapq7 & real 38:34.4 user 1:14.3 sys 18.2 2314 secs ; 332474 recs ; 200Mb ; 200648238 Oct 14 15:33 tout.bldapq7 ## dghome test dghome binldap q7) /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 999999999 \ '(&(objectclass=BinaryBioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))' \ >& ! tout.bldapq7 & 2639.49 real 21.65 user 21.79 sys 2639 secs ; 332474 recs ; 200Mb ; 200648242 Oct 14 16:48 tout.bldapq7 dghome ldap q7) /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -LLL -ttt -z 999999999 \ '(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))' \ >& ! tout.ldapq7 & ----------------- soap q 3) ... ## srs6 SOAP service test (soap.cgi system call to srs6ldap2.cc) r2 => ['','','sub', '(&(objectClass=BioseqRecord)(lib=swissprot refseq trembl)(des=kinase))',''], oat.% /usr/bin/time ./srs6soap.pl > ! tout.soap real 4:21.8 user 0.7 sys 0.1 No result received - server possibly down. Exact errors: SOAP-ENV:Client:Out of memory during "large" request for 134221824 byte s, total sbrk() is 2034454472 bytes at /export/home/gilbertd/perl/SOAP/Lite.pm line 1 109. ## need soap->proxy('http://localhost/', timeout => 500); oat.% /usr/bin/time ./srs6soap.pl > ! tout.soap 500 read timeout at ./srs6soap.pl line 54 real 3:00.8 user 0.7 sys 0.1 Wed Oct 9 17:47:54 EST 2002 query 3) # test paged-soap retrieval - 1000 records/call #(&(objectClass=BioseqRecord)(lib=swissprot refseq trembl)(des=kinase))'], oat.% /usr/bin/time ./srs*soap.pl > tout.pagesoap real 6:39.1 user 1:33.7 sys 4.9 # pagedSearch = 22000,2000,(&(start=22000)(count=2000)(objectClass=BioseqRecord)(lib=swissprot refseq trembl)(des=kinase)) No result received - server possibly down. Exact errors: SOAP-ENV:Server:No matches to search(??sub?(&(start=22000)(count=2000)(objectClass=BioseqRecord)(lib=swissprot refseq trembl)(des=kinase))?) oat.% fgrep -c ' BioseqEntry id' tout.pagesoap 20087 oat.% fgrep ' BioseqEntry id' tout.pagesoap | sort | uniq -c | wc 19779 ^^ missing some output limit=20087 (was 20073 recs) -- got 308 duplicates - why? - ll from SWISSNEW -- is an SRS or data error; same in tout.ldap soap query (7) q5 => ['','','sub','(&(objectClass=*)(lib=genbank)(org=drosophila))',''], q7 => ['','','sub','(&(objectClass=BioseqRecord)(lib=genbank)(org=drosophila))',''], -- 332474 recs now (was 340794 ??) /usr/bin/time ./srs6soap.pl -q q7 >& tout.soapq7 & -- -- Oct 14 2002 srs6ldap3/srs6ldap_jni/SRSjni/SRSsoapserver/SRSsoapclient cd /c7/eugenes/ldap/srsjni2 setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:`pwd` source glue.env # set classpaths java -Xms30MB -Xmx90MB iubio.srs.SRSsoapserver >& ! log.soapserver & using tuned srs6ldap - java JNI - GLUE server/client (lots of work...) [STARTUP] GLUE standard 3.2.2 (c) 2001-2002 The Mind Electric [STARTUP] http server started on http://129.79.225.149:8004/glue (srs6ldap writes to tmp file, glue server passes file thru http stream to client, which writes to tmp file then passes filehandle to program) q 3) /usr/bin/time java SRSsoapclient -w file://srsglue.wsdl q3 > & ! tout.srsglueq3 & real 4:59.7 user 6.0 sys 3.3 real 4:55.7 user 5.4 sys 3.2 == 299 sec q 7) (1.5 GB data, makes 3 file copies - 1 at server, one at client, then to stdout) /usr/bin/time java SRSsoapclient -w file://srsglue.wsdl q7 > & ! tout.srsglueq7 & -- failed on 2nd/3rd try - at what level? srs6ldap/srs6ldap_jni on server during read of tmp file to send to client? (seems to have finished srs search > file) oat.% more tout.srsglueq7 # app.queryLDAP (null,null,sub,(&(objectClass=BioseqRecord)(lib=genbank)(org=drosophi la)),null) # nResults=332474 Exception in thread "main" java.rmi.ConnectException: url = http://129.79.225.149:800 4/glue/srsglue at electric.soap.http.SOAPHTTPReference.invoke(Unknown Source) at electric.soap.SOAPReference.invoke(Unknown Source) at electric.soap.SOAPReference.invoke(Unknown Source) at electric.proxy.handler.Proxy.invoke(Unknown Source) at electric.util.proxy.proxy3.InvocationAdaptor.invoke(Unknown Source) at $Proxy0.getResultsAsAttachment(Unknown Source) at SRSsoapclient.(SRSsoapclient.java:107) at SRSsoapclient.main(SRSsoapclient.java:28) real 1:18:16.3 user 3.9 sys 0.3 ----- # test solaris j2se v1.4 ! seems to work - ALMOST (still fails near end) ! (no mem leak as 1.3) /usr/bin/time java -Xms50M -Xss30M -Xmx180M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap nrec=0 free Kb=50818 (gc) free Kb=51064 nrec=10000 free Kb=40499 (gc) free Kb=50664 nrec=20000 free Kb=48029 (gc) free Kb=50104 nrec=30000 free Kb=44526 (gc) free Kb=50631 .. nrec=90000 free Kb=45615 (gc) free Kb=49462 nrec=100000 free Kb=42700 (gc) free Kb=48028 .. nrec=190000 free Kb=44545 (gc) free Kb=49704 nrec=200000 free Kb=40049 (gc) free Kb=48868 nrec=210000 free Kb=39801 (gc) free Kb=48991 nrec=220000 free Kb=9568 (gc) free Kb=48764 nrec=230000 free Kb=31208 (gc) free Kb=49303 nrec=240000 free Kb=5866 (gc) free Kb=46502 << nrec=250000 free Kb=29755 (gc) free Kb=42001 nrec=260000 free Kb=38322 (gc) free Kb=44350 nrec=270000 free Kb=25731 (gc) free Kb=40101 nrec=280000 free Kb=16388 (gc) free Kb=35341 nrec=290000 free Kb=9964 (gc) free Kb=32644 << nrec=300000 free Kb=11420 (gc) free Kb=25790 Exception in thread "main" java.lang.OutOfMemoryError real 7:45.2 user 6:50.7 sys 1:05.0 try -Xms90M -Xss30M -Xmx280M /usr/bin/time java -Xms90M -Xss30M -Xmx280M bioldapsearch debug \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)(fmt=native)(lib=genbank)(org=drosophila))?sizelimit=9000000' > ! tout.ldap nrec=0 free Kb=91650 (gc) free Kb=91832 nrec=10000 free Kb=78309 (gc) free Kb=87739 .. nrec=300000 free Kb=80112 (gc) free Kb=88907 nrec=310000 free Kb=56634 (gc) free Kb=111361 << mem bump nrec=320000 free Kb=66525 (gc) free Kb=179799 nrec=330000 free Kb=151231 (gc) free Kb=239457 << mem bump nrec=340000 free Kb=218721 (gc) free Kb=261075 nrec=340794 free Kb=244037 (gc) free Kb=191133 real 10:34.3 user 9:06.3 sys 1:17.3 -- largest seq is ~ 500 KB -- returns 1.4 GB data 1460512006 Jul 14 18:27 tout.ldap -- over (slowish) network --- (cable modem) /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (lib=trembl)(org=homo sapiens)(fmt=fasta))?sizelimit=2000' > trembl.out 8.26 real 2.95 user 1.11 sys /usr/bin/time java ldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=BioseqRecord)\ (fmt=fasta)(lib=swissprot refseq trembl)(des=kinase))?sizelimit=200000' > kinase.fa 110.17 real 18.13 user 10.29 sys [ ! same speed as client-server on same host; not much diff from standalone call ! ] -- test LDAP GNOMAPLD (ldif data) Search URL: ldap://iubio.bio.indiana.edu:3895/lib=gnomapld,srv=srs??base?(&(objectclass=*))?sizelimit=900 : null:null:{libgroup=libgroup: Genome, name=name: GNOMAPLD, link=link: GNOMAP_UP, objectclass=objectClass: BioseqDatabank, extensibleObject, dat=dat: 14-Jul-2002, field=field: AllText, all, ID, id, Class, cla, Organism, org, Chromosome, chr, FtKey, ftk, Word, wrd, Name, nam, ChrStart, bpb, ChrStop, bpe, ChrSize, bpl, ChrLocation, map, ParentId, pid, id=id: GNOMAPLD} /usr/bin/time java bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs??sub?(&(objectclass=*)(lib=gnomapld)(org=man))?sizelimit=900' resultset=Qi15: null:null:{fmt=fmt: native, fasta, url=url: srv=srs??sub?(&(objectClass=BioseqEntry)(lib=gnomapld)(org=man)), srv=srs??sub?(&(objectClass=BioseqRecord)(lib=gnomapld)(org=man)) , count=count: 259048, objectclass=objectClass: BioseqSet, extensibleObject, query=query: (&(objectClass=*)(lib=gnomapld)(org=man)), lib=lib: GNOMAPLD, id=id: Qi15} /usr/bin/time java bioldapsearch \ 'ldap://iubio.bio.indiana.edu:3895/srv=srs?*,+?sub?(&(objectclass=BioseqEntry)(lib=gnomapld)(org=man))?sizelimit=9' ## BioseqRecord fails for lib=gnomapld - various reasons (Big/little FIP; no seq record; wrong data file - ask for man, got fly ?? -- wgetz tests --- Here are some srs6 wgetz parameters which produce different output. The '-vn' parameter is 'view number', I believe. Note that the '-id+...' is a session ID number, not needed. >> plain data record: +-vn+2 http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?[swissprot-id:'PAPA_CARPA']+-vn+2 >> hyperlinked 'plain' format with page wrapper: +-vn+2+-page+EntryPage http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?[swissprot-id:'PAPA_CARPA']+-vn+2+-page+EntryPage >> header information only: +-vn+3+-page+EntryPage http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?[swissprot-id:'PAPA_CARPA']+-vn+3+-page+EntryPage http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?[swissprot-des:kinesin]+-vn+2 /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?[lib={swissprot refseq trembl}-des:kinesin]+-vn+2' \ > ! wgetz.out /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bswissprot%7D-des%3Akinase%5D%0A+-vn+1' \ > ! wgetz.out urlesc: %5Blib%3D%7Bswissprot%20refseq%20trembl%7D-des%3Akinesin%5D%0A 2) /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bswissprot%20refseq%20trembl%7D-des%3Akinase%5D%0A+-vn+1' \ > ! out1s.wgetz real 19.2 user 0.0 sys 0.0 3) /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bswissprot%20refseq%20trembl%7D-des%3Akinase%5D%0A+-vn+2' \ > ! out3s.wgetz real 5:09.9 user 0.4 sys 1.8 ^^^^ slower - compared to same w/ ldap, 3x slower 6) /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bgenbank%7D-org%3Adrosophila%5D%0A+-vn+1' \ > ! out6s.wgetz real 4:04.5 user 0.0 sys 0.0 /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bgenbank%7D-org%3Adrosophila%5D%0A+-vn+2' \ > ! tout.wgetz real 1:00:53.8 user 10.3 sys 36.2 SRS error -- got 310365 records of 340794 10) /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bgenbank%7D-des%3Ahuman%5D%0A+-vn+1' \ > ! out1.wgetz real 14:22.3 user 0.0 sys 0.0 11) getz '[genbank-des:human]' -e /usr/bin/time getz '[genbank-des:human]' -e > ! outgb11.getz error: insufficient memory - error during malloc, could not allocate "string value" real 53:54.4 user 48:13.4 sys 3:58.8 3234 secs ^^^ got 317736 records of 1219158 (0.26) est time? 12,438 secs 3234 /usr/bin/time lynx -source \ 'http://iubio.bio.indiana.edu/srs6bin/cgi-bin/wgetz?%5Blib%3D%7Bgenbank%7D-des%3Ahuman%5D%0A+-vn+2' \ > ! out11.wgetz lynx -source \ 'http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+LibInfo+-newId+-lib+MEDLINE2002+-ascii' http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+databanks+-newId ------ DAS tests ---------- /usr/bin/time java ldapsearch \ 'ldap://eugenes.org:3892/spp=worm,srv=das,o=euGenes??one' chr=CHROMOSOME_I: null:null:{bases=bases: 15080475, stop=stop: 15080475, strand=strand: +, objectclass=objectClass: Chromosome, start=start: 1, id=id: CHROMOSOME_I} chr=CHROMOSOME_II: null:null:{bases=bases: 15174006, stop=stop: 15174006, strand=strand: +, objectclass=objectClass: Chromosome, start=start: 1, id=id: CHROMOSOME_II} chr=CHROMOSOME_III: null:null:{bases=bases: 13855084, stop=stop: 13855084, strand=strand: +, objectclass=objectClass: Chromosome, start=start: 1, id=id: CHROMOSOME_III} chr=CHROMOSOME_IV: null:null:{bases=bases: 17493985, stop=stop: 17493985, strand=strand: +, objectclass=objectClass: Chromosome, start=start: 1, id=id: CHROMOSOME_IV} chr=CHROMOSOME_V: null:null:{bases=bases: 20916337, stop=stop: 20916337, strand=strand: +, objectclass=objectClass: Chromosome, start=start: 1, id=id: CHROMOSOME_V} chr=CHROMOSOME_X: null:null:{bases=bases: 17749735, stop=stop: 17749735, strand=strand: +, objectclass=objectClass: Chromosome, start=start: 1, id=id: CHROMOSOME_X} /usr/bin/time java ldapsearch \ 'ldap://eugenes.org:3892/chr=CHROMOSOME_X,spp=worm,srv=das,o=euGenes\ ??sub?(&(objectClass=Feature)(|(ft=gene)(ft=CDS)(ft=exon))(start=>100000)(stop=<150000))' /usr/bin/time java ldapsearch \ 'ldap://eugenes.org:3892/chr=CHROMOSOME_X,spp=worm,srv=das,o=euGenes??sub?(&(objectClass=Feature)(start=>100000)(stop=<120000))'\ | grep -c 'id=' 'ldap://eugenes.org:3892/chr=CHROMOSOME_X,spp=worm,srv=das,o=euGenes??sub?(&(|(objectClass=Feature)(objectclass=NA-sequence))(start=>100000)(stop=<120000))' # 10% of X csome (17749735) /usr/bin/time java ldapsearch \ 'ldap://eugenes.org:3892/chr=CHROMOSOME_X,spp=worm,srv=das,o=euGenes??sub?(&(objectClass=Feature)(start=>1)(stop=<1774973))' \ > ! outdas.ldap 13596 records (!? not same as web-das !) real 3:32.9 user 13.0 sys 1.6 # 1% of X csome /usr/bin/time java ldapsearch \ 'ldap://eugenes.org:3892/chr=CHROMOSOME_X,spp=worm,srv=das,o=euGenes??sub?(&(objectClass=Feature)(start=1)(stop=177497))' \ > ! outdas.ldap real 30.6 user 3.2 sys 0.4 --- Web DAS ----- (! note: lynx dump doesnt include XML parsing time by client to split out records and fields ) /usr/bin/time lynx -source \ 'http://eugenes.org:7180/cgi-bin/das/dsn' /usr/bin/time lynx -source \ 'http://eugenes.org:7180/cgi-bin/das/worm/entry_points' /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?start=10000;stop=110000;type=partial_gene;ref=CHROMOSOME_V" \ | grep -c 'FEATURE id=' # total X= 17749735 bases /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?start=1;stop=1774973;ref=CHROMOSOME_X" \ > ! outw.das /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/types?start=1;stop=1774973;ref=CHROMOSOME_X" \ /usr/bin/time getz '[lib=gnomapld-org:worm]&[lib-chr:X]&[lib-bpe#:1774973]' "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/types?start=1;stop=1774973;ref=CHROMOSOME_X" \ /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?start=1;stop=17749735;ref=CHROMOSOME_X" \ > ! outw.das real 24:11.6 user 3.0 sys 9.4 /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?start=1;stop=177497;ref=CHROMOSOME_X" \ > ! outw.das real 20.8 user 0.0 sys 0.1 1% = 7558 feats; real 20.3 secs 10% = real 2:18.1 user 0.2 sys 0.7 recs=55249 100% = real 24:11.6 13) /usr/bin/time getz '[lib=gnomapld-org:worm]&[lib-chr:X]&[lib-ftk:inverted_repeat]' -c 3250 real 0.6 /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 99999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=worm)(chr=X)(ftk=inverted_repeat))' \ > ! outgn.ldapc real 3.6 user 0.8 sys 0.1 /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?type=repeat:inverted;ref=CHROMOSOME_X" \ > ! outw.das real 26.2 user 0.0 sys 0.1 14) all features from bp 1000000 to bp 2000000 on Chr X, spp=worm /usr/bin/time lynx -source \ "http://oat.bio.indiana.edu:7180/cgi-bin/das/worm/features?start=1000000;stop=2000000;ref=CHROMOSOME_X" \ > ! outw.das real 56.1 user 0.1 sys 0.2 /usr/bin/time ./bioldapsearch -H 'ldap://iubio.bio.indiana.edu:3895/' \ -b srv=srs -s sub -z 9999999 -LLL -ttt \ '(&(objectclass=BioseqRecord)(lib=gnomapld)(org=worm)(chr=X)(bpb=>1000000)(bpe=<2000000))' \ > ! outgn.ldapc nrecs = 1734 real 11.6 user 0.2 sys 0.1 CHROMOSOME_X ***************/ Query: Sequences from SwissProt, TrEMBL, RefSeq matching description "kinase" Result: 20,073 records (at IUBio SRS databanks, June 2002) LDAP clients: one page of Java or Perl code (Java is faster here; ~ same speed as C client) Test cases: 1) Summary result with count: getz -c or ldap attribute=*, objectClass=* 2) All entries with only ID: getz or ldap attribute=dn, objectClass=BioseqEntry 3) All entries with full record: getz -e or ldap attribute=*, objectClass=BioseqRecord queries: 1-3) SRS getz: [lib={swissprot trembl refseq}-des:kinase] 1-3) LDAP filter: (&(lib=swissprot refseq trembl)(des=kinase)) 5) getz '[genbank-org:drosophila]' -c ... 340,794 records 6) getz '[genbank-org:drosophila]' (ID only) 7) getz '[genbank-org:drosophila]' -e (all) 9) getz '[genbank-des:human]' -c .. 1,219,158 records 10) getz '[genbank-des:human]' 11) getz '[genbank-des:human]' -e > ! getz.out Time (seconds) Query getz ldap wgetz soap ftp local net net net net (no query) 1) 0.5 0.5 -- 2) 18 17 19 3) 297 84 309 300 8 (66 MB data ; 20,000 recs) 5) 0.6 0.7 -- 6) 224 202 244 7) 3138* 506 3653* >4680+ 162 (1500 MB data ; 340,000 recs) 9) 1.3 1.4 -- 10) 864 1108 862 11) ~12000* 1952 * -- 520 (4700 MB data ; 1,200,000 records) * failed to get all records (memory overflow error by getz or wgetz at ~ 300,000 records) + soap server fails repeatedly w/ large data size (can fix?) -- BinaryBioseqRecord over ldap (local net) 3) binldap 107 secs ; 20073 recs ; 7.6 MBytes 7) binldap 2314 secs ; 332474 recs ; 200 Mb ; -- BioseqRecord over ldap (local net) 3) ldap 84 secs ; 20073 recs ; 67 MB - sep02 7) ldap 506 secs ; 340794 records ; 1.45 GB - sep02 562 secs ; 332474 records ; 1.73 GB - oct02 -- Over slow network (dghome ; ~ 1MBs) 3) ldap 258 secs ; 20087 recs ; 66 MB 7) ldap 3) binldap 119 secs ; 20087 recs ; 7.6 MB 7) binldap (! near same as local net) 2639 secs ; 332474 recs ; 200Mb 3) soap/glue -- Web-XML/soap service timing - perl CGI and client (? test java/glue server/client?) 3ws) 6.39 == 399 sec -- perl server/client (memory failure...) 3) glue using tuned srs6ldap - java JNI - GLUE server/client (lots of work...) 4:59.7 == 299 sec 7) glue m ore than > 1:18 == 4680 sec -- failed on 2nd/3rd try - at what level? srs6ldap/srs6ldap_jni on server during read of tmp file to send to client? (seems to have finished srs search > file) -- Web-DAS .. [!! this data is not quite comparable for Web-DAS and ldapsrs ] Web-DAS ldapsrs 13) all inverted repeats on chr=X, spp=worm 26 3.6 14) all features from bp 1000000 to bp 2000000 on chr=X, spp=worm 56 (22K recs!) 14 (2k recs!) 7:45 = 465 52:18 = 14:24 = [ turn above table to chart ] 11) java ldsearch .. out of mem at 1086547 recs , 3.3 GB (! most of them) c ldsearch .. 32:32.7 = 1952 [?? add also chart of gnomap-srsldap vs das *** ?? ]