Changes to freeWAIS-0.2-sf-beta-04 August 1994 .. Jun'95 Don Gilbert, gilbertd@bio.indiana.edu Source: I have made my working version of freewais-sf available (from the patch level 2b4). It includes various corrections and enhancements to freewais-sf. These changes are as yet mostly undocumented. As soon as time permits I'll provide documentation and patches to current freewais-sf. The version below is what I am using successfully now. Use it or not at your own discretion. Fixed stoplist code that got broken somewhere along the way. Added code for biology database uses (flybase). Added "Date" field usage. Tried to isolate query user interface stuff (symbols). Isolated regexo code. Portable index files (file of filenames names fix) Numeric range searches -- not yet working? CSV and TSV data formats with auto-field indexing See contributions/ask-waissf.c for a waissearch variant useful for gopher and maybe html servers. Date field usage (small hack) ------------------------------ Added code to create a numeric index for the as defined in a field section of the index.fmt file. This lets on search on "date=" fields, as in query: date>940731 returns all records later than 31 July 1994 To use, make this additions to your "index.fmt" files, (a) include a description in the section as described for prior versions of fwais-sf .... as before ... /^\*H / /%d %3s %d/ day month string year /^\*H [^ ]/ (b) and define a numeric field for date in this form (note odd caps of "DaTe=" are required, they are written into source code that way) /^DaTe=/ date /^DaTe=[^ ]/ 8 TEXT LOCAL /^/ Query user interface -------------------- Uses special symbols that can conflict with biological database uses: *, (), = and maybe a few others. Tried to put all "query user interface" symbols *, and, or, not, (), = into a single piece of changable code, but failed (can get yacc.y stuff to read c headers) See ir/query_symbol.h and query_y.y for changes. Isolated regexp code. -------------------- Needed to put regexp.h into .c wrapper file, since the AT&T delivered header has lots of c code, and including this in multiple files yeilds linker errors. Portable index files -------------------- The naming of files in waisindex has been hard coded in such a way that data files could not be moved from one folder to another without requiring reindexing. I've revised the file name handling so that now indexes and data files are fully portable. They can be moved among file systems, and even to different operating systems. SunOS-built indices work perfectly on Macintosh computers, etc. The new methods rely on two index files .fnn and .fni, which replace the file of data filenames .fn. The .fnn file of filenames can be editted with a text editor to put in new data file paths for the indices. The .fni file of filename indices is then updated the next time a search is perfomed on that index (a current bug makes the first such search fail as the .fni is updated, then subsequent ones succeed). This method does not loose any of the index lookup-speed that the prior hardcoded .fn system used. The new or old method can be selected at run time. Numeric range searches ---------------------- ?? is this working ? Index record with number_field= 123-456 Search with query number_field= 345 Result *should* match such a record as 123-456 (query valu is in numeric range specified by 123-456). I haven't had time to debug this code. It looks like it isn't yet working. Comma-separated-values and Tab-sep-values fields format -------------------------------------------------------- Handle common spreadsheet/data format file known as CSV and TSV, where each record is one line, and field data is delimited by commas or tabs in that line. The fields .fmt file is used with special field regular expression tags: /^field1 /, /^field2 /, etc. [more .. fill in..] Miscellaneous corrections ------------------------- stoplist feature was not working -- needed patches in a few places "literal phrase searches" ! WAS NOT WORKING with stemming ! -- stemming is/was incompatible with literal phrase searches (indexing w/ stemming on chops real words into de-suffixed stem words, then when literal phrases are searched, they don't match... ... fixed it, in waissearch/sersrch.c (no need to re-index)) check delimiter (BIO) usage -- seems to be ignoring all symbols -- this was a failure in some FIELD addition not using the WordDelimiters... ircfiles.c: added flygenes_field_separator() -- to preprocess some fields in this data (convert odd data to true numeric types) waisindex.c: new index type flygenefields correction/change to field indexing method so that /^exp/ can be properly used as an end-of-field expression even if /exp/ occurs within the first line of the index expression. E.g., start-field: /^SomeWord / Skip-past: /^SomeWord / end-field: /^[A-Z]/ This would fail to index to the next line starting with [A-Z] in the unpatched version, because the regexp parser looks after the "skip-past" portion as a start of a new line. added fixes for -DBIO uses. ?? auto-include field "date" when fields_date_function is called ?? field_y.y: add code to DATE to make "date" field irtfiles.c: field_index_text_file() -- add check for "date" field, and do date_function()