RePEc::Index DEVELOPMENT LOG AND IDEAS


Starts sometime in 2002.

Hmm. I just had an idea about partial updates.

When there is a conflict or suspicion, that there may be a double
handle, we may leave some checks and processing for the late time.
And if the questions are not resolved during the update, we try to
resolve them after an update (for instance at the close() stage).


I still haven't found a solution for RePEc Index's handle/template
history tracking.  I wanted to have a separate database for
handle/template history tracking.  It could be all-inclusive, while
the main database would only contain the recently approved records.

So the history would keep track of such events as:

 (a)  finding a template in a data file at an update
 (b)  disappearence of a template from a file
 (c)  disappearence of a file, in which the template was

 (d)  noticing a template checksum change

 (e)  finding a handle conflict between different files
 (f)  finding a repeated handle in a single file
 (g)  finding a handle conflict cleared (resolved)


* UPDATE from 2003-05-02 00:28

 (h)  record processed

* UPDATE END

Some of the listed facts aren't simple to establish.  The last four
types of facts: (d), (e), (f), (g) can be deduced from certain
combinations of the basic facts (a), (b), (c).  Let's call them
derrivative facts.  Now try to be more elaborate on what the facts
are.

 (a)  the template with [checksum] read from [file] at [position],
      during an update session at [start time]

 (b)  [file] no longer has the template

 (c)  [file] no longer exists


 (d)  the template's content changed, new [checksum]
      (in [file], [position], [update session start time])


 (e)  the template is present in several different files:
      ( ([file], [position]), ... )
 
 (f)  the template is present twice or more in the same [file]:
      ( [position], ... )
 
###  now that looks redundant: (e) and (f) are too closely related and
###  separation is unnatural: one template can (although chances are
###  little) appear twice in one file and another one time in some
###  other file.  Although the processing will differ in a subtle way,
###  (e) is to be joined with (f).  Or to be more precise, (f) will be
###  a case of more general (e).

 (g)  conflict cleared, [update session time]


* UPDATE from 2003-05-02 00:28

 (h)  the record had undergone full processing in an application
      framework, [update session time]

* UPDATE END


Now on the derrivative facts.

Assume, we are at an (a) fact, with its [file], [position] and a
[checksum].  Of course, we record the fact.  Then we do some analysis.

First, need to check if there is a conflict somewhere (fact (e/f)).
We may *suspect* a conflict if there were an (a) fact with some other
file *recently*.  We may *establish* *definitely* a conflict if there
was an (a) fact with another [file, position] pair in the *same*
update session.

All this is getting so difficult that I'm getting worried.  Wouldn't
it be better to find other, simplier ways to track the handles?  Or
maybe I need to break the task into smaller ones.  Like some function
which will take a handles' history and then will return some data
structure reflecting the status of the handle.  

* * *

If there is a conflict currently, then (a) fact can't clear it.  It
can only add to the conflicting positions.

If there is no conflict, we need to check any recent (a) facts not
devalued by an appropriate (b) or (c) fact.  This mean we better keep
a list of files/positions/checksums/last seen update session time in
which the template was seen lately and should be expected.  As soon as
we have a (b) or (c) fact, we clean the file from the list
apropriately.  (hahaha.)  Then its a bit easier.

This may be a hash with filenames as keys, and positions and update
session times as members of the value arrays.  Easy to check, easy to
delete, easy to update.


So, to finalize.  I need a class, which will be responsible for
tracking a handle's history and status.  Objects of the class will be
saved to a database (Berkeley) and restored from it.  Object will
receive the basic facts messages: (a), (b), (c); those will be
recorded in the handle's history without changes.  On some occasions
the objects themselves will issue & record derrivative messages: (d),
(e/f), (g).  Object will also, of course, present the history and the
status to the general public.  Objects of this class also need an
ability to request that the processor also check this or that template
in this or that file.  Or, may be, just a file.  So, it must be able
to put a file into the update processing queue.  That's all, I think.
I only now need to decide upon the data structures and class'
interface (and name).


First, the data structures.  The history is a list of event
descriptions.  Event description consists of: update session time, the
event type and the data associated with the event, depending on its
type.

The update session time is a integer number of seconds since the
epoch.  Two alternative strategies we may choose for the event type:
use letters, from a to g as specified above.  Or use plain English
words/phrases.  

  (a) - present
  (b) - lost template
  (c) - lost file

  (d) - change

  (e,f) - conflict

  (g) - clear

* UPDATE from 2003-05-02 00:28

  (h) - processed

* UPDATE END


?

The structure: array of events.  Each event is an array.

history => 

(a)
 [ UPDATE_SESSION_TIME, "present", "dir/filename", POSITION_NUMBER,
                                                                CHECKSUM, DATATYPE ]

(b)
 [ UPDATE_SESSION_TIME, "lost template", "dir/filename", POSITION_NUMBER, CHECKSUM ]

(c)
 [ UPDATE_SESSION_TIME, "lost file", "dir/filename" ]

(d)
 [ UPDATE_SESSION_TIME, "change", "dir/filename", POSITION_NUMBER, CHECKSUM ]

(e-f)
 [ UPDATE_SESSION_TIME, "conflict", 
			["dir1/filename1", POSITION_NUMBER, CHECKSUM ],
			["dir2/filename2", POSITION_NUMBER, CHECKSUM ],
			...
			]

(g)
 [ UPDATE_SESSION_TIME, "clear"	]

* UPDATE from 2003-05-02 00:28

(h)  record processed
 [ UPDATE_SESSION_TIME, "processed"	]

* UPDATE END


Status structure?

handle history object structure:
 (updated 2003-02-20 18:49)

 
{ 
  present => 
       [ 
	   [FILE, POSITION, CHECKSUM, UPDATE_SESSION_TIME],
	   ...
       ],

  handle => $handle,
  session_time => $seconds,

	       
  type   => $datatype,  # not present if there is a conflict

  conflict =>  # not present if there's no conflict
	   {
	     'directory/filename.ext' => {
                                          FILEPOS => [
                                                       'CHECKSUM',
                                                       SESSIONTIME
                                                     ]
                                            },
             'iku/kokoko/nierwp0051.rdf' => {
                                              '0' => [
                                                       '7y5huGoLrnu4t7YcZUWKTw',
                                                       1045569866
                                                     ]
                                            }
           },

  last_changed => UPDATE_SESSION_TIME,


  history => [
	       [
                 SESSION_TIME,
                 EVENT_NAME,
                 EVENT_DATA1,
                 EVENT_DATA2,
		 ...
               ],
               [
                 1045569700,
                 'present',
                 'iku/kokoko/nierwp0051.rdf',
                 0,
                 '7y5huGoLrnu4t7YcZUWKTw'
               ],
               [
                 1045569700,
                 'change'
               ],

             ],


* UPDATE from 2003-05-02 00:28

  last_processed => UPDATE_SESSION_TIME,

* UPDATE END

}


Class name: RePEc::Index::History::Handle?

(updated 2003-02-20 18:17: DATATYPE in present())

Class interface.  Methods:

  new ( HANDLE, OBJECT_DATA )

Constructor.  OBJECT_DATA is a restored from the database (Berkeley,
supposedly) previously created object.  If it is undef, new empty
object is created.

  update_session_time( UPDATE_SESSION_TIME )

A setter method for all the event-methods that follow.

  present( FILE, POSITION, CHECKSUM, DATATYPE )

(a) event

  lost_template( FILE, POSITION )

(b) event

  lost_file( FILE )

(c) event


Accessor-methods:

  handle ();

Returns the handle.  XXX is this useful?

  conflict ();

Returns a list of [FILE, POSITION, CHECKSUM] items if the handle is
used in several templates in several file/positions.  Returns undef
otherwise.

  present ();

Returns a list of [FILE, POSITION, CHECKSUM] items, representing the
points where the handle is present (to the best of our knowledge).

  history ();

Returns the history array as described above.

  status ();

Returns the status hash as described above.

  last_changed ();

Seconds since the epoch before the most recently noticed template
change.

  last_checked ();

Seconds since the epoch before the most recent update session, where
this template was found present.  XXX Umbiguity: seconds since most
recent update session when the template was found present, or seconds
of the latest history record, no matter what this record is?  Both
seem reasonable, but they are different.  A template may dissapear, so
it will be noticed, and that will be the last history record date...
What is this method for at all?  May be for tracking the old
(outdated) handles.  Then either way it will do the job.


One of the ways forward is thinking about how exactly will the stuff
desrcibed above (RePEc::Index::History::Handle) be used and by whom.

Another very important issue.  One of the ideas was that we will have
partial updates.  Then in case of a suspected conflict we should
enqueue suspected files for a later check.  Then ..::History::Handle
must have a way to request that, because conflict can only be
suspected inside it.  But no, let's go another way.  A handle's
record, after an (a) event may include some element,
hmm... also_check_file (or similar) which will list a filename.  And
then the Update class will check for this element after each (a)
event.  That's not very elegant design, but the best I can think of
for now.



So let's look at the Update class and see how History::Handle fits in
there.  Do I need to create another class?

Yes, but later.  

Now I need to decide how the internal processing will do in the
History::Handle class.  With the idea of keeping a status hash, things
should have simplified, but I'm not sure about that.

I have status hash and I have history list.  A "present" (a) event
comes.  What do I do?  First, and that's without worries, I push the
event into the history list.

Then I need to check for the conflicts.  

First I go through the status->present items and classify each of
those (there might be many) as either "old" or "now" based
on the update session time of those present items:

  "old" -- come from previous sessions

  "now" -- was updated in the current update session

The items are reviewed to find matches in the filename.  If there
is the same filename in "old" -- we update that item with current one.
If its a different filename -- we suspect a conflict: include it into
a list to be checked for conflicts later.

If there is the same filename in "now" -- the conflict of a repeated
handle in one same file.  If different filename -- the different files
conflict.


Simple, huh?

An implication: I must ensure that each file is processed no more than
once in an update session.


Problems: 

1) how exactly will "old" be defined?  There is a connection
to RePEc::Index's "TOO_OLD_IS", but that is not obligatory the same thing.

In principle, this is something that should be defined in the
environment of RePEc-Index.  But I don't want to break the
self-containment of the History::Handle method.  Then a package
variable (which can be set from outside) is a way forward.

2) should "old" items be deleted or their files should be rechecked?

Most safe way to solve it is to request that file to be re-checked.  A
little performance penalty, but I'll sleep well.

3) if there is an in-file conflict (one handle twice or more in one
   file), then this kind of conflict can be cleared by reading only
   this file.  This leads to an idea, that there must be special
   processing on reaching a file's end...  I had that in Find-In-RePEc

One of the ways is to request this file for a later re-check and that
should be interpreted by the managing authority as a signal to issue a
special message as soon as file ends.  This could be a method like:
nothing_else() or whatever.


Need to simplify this!









So decisions:

1) create a package variable TOO_OLD_IS with default value of
   60*60*12*4, ie. 4 days.

2) The status->present records of "old" status with another filename
   will cause that file to be rechecked.  So what's difference from
   "recent" then?  No difference.  Let's forget about this all.




Now, in process_template we will do:

  if the template's handle is in the archives limits XXX ???

  load record from data file (single general datafile)

  create a History::Handle object

  pass an (a) event to it (method: present())

  check for conflicting files (potentially) and queue those files
  for processing

  store the object to the datafile



2002-12-11 22:37

Now, that I have History::Handle class working (as some simple tests
show), I'm now thinking hard: what to do next.  Things look strange
now.  I don't see a need for archive-separate databases now.

The template data probably should be stored in the History::Handle
object.  


When processing/checking, I have to check that the handle used is in
some sensible limits: ie in the archive of the file and in some known
series.  Only after that I check for conflicts, and here I use
History::Handle.  

And then (if there's no conflict) I can store the template data in the
object.  For everyone's pleasure.  Any questions?

First, the question is that conflicts would usually be cleared or
established after some processing.

Second, I should integrate template data into History::Handle class
object, blocking it on a conflict and releasing it on a clear event.



2002-12-20 11:56

Now I plan to involve Events module for the Index work.  The handle
history will be triggered by Events, and it, in its turn, will trigger
some events.  What I need to get finally is a template/handle
repository, used in accordance with the history.  




The problem I face now: a template's handle before it is handed over
to the history or the data collector, has to be basically checked.
Primarilly, it has to be checked that it belongs to the archive of the
directory it is in.  That's RePEc's rules: if your file is in
remo/fgh/ or below it, all your handles must start with "RePEc:fgh:".
But that's RePEc rules.  All RePEc handles start with 'RePEc:', but
there exist RePEc-like collections, e.g. ReLIS: same rules, another
prefix.

To abstract out the primary handle-checking rules I introduce a
concept of a collection.  A collection will represent some
RePEc-Index-monitored group of data files.  A collection has a home
directory, a type and a handle prefix.

An update session will always work in context of some collection.


The collection has to provide means for:

- parsing a data file into records (perl data structure)
- getting a record's handle, given a record
- checking the record's handle


methods:

 $collection = RePEc::Index::Collection::RePEc -> new ( PATH, AUTHORITY );

 $collection -> open_data_file ( $filename ) 
   or warn "can't open the file";

 ( $id, $record ) = $collection -> get_next_record ( );

 my $result = $collection -> check_id( $id );  # in the context of the last opened file



Ok. Now (2002-12-27 01:17) I have one sample collection class
(RePEc::Index::Collection::RePEc) and even have a tool to read
collections configuration from a file (in a simple three-part line
syntax: "dir type prefix").  An ::Update needs a collection object to
use.  It should get one from RePEc::Index, I presume.  Either way, I
need to read configuration somehow.  And, since I have created a
configuration file (for collections conf), I might want to add some
more parameters there.  But no.  Not this time, at least.  

I might put collections file into the data dir and read it from
there.  I might call it "collections".  Then I should check that
collections configuration is valid: each prefix must be unique.  Then
I should ensure there is a subdirectory for each prefix under data
directory.  All related data files should be put under that
directory. 



Now I tweaked RePEc::Index::Config to read collections config from the
"collections" file.  But now I have to do something with the
RePEc::Index module, because it heavily relies on $DATA_DIR variable,
which should, in theory, depend on the collection chosen.

The problem is that I want to create (and use) new object-oriented
interface to RePEc::Index and retain old - procedure based.  I need to
make up my mind.  At least on this issue.


Basically, the data-storage/retrieval functions should be moved to a
new module, say RePEc::Index::Storage or something.  Then the main
interface functions (lookup...) would need to be adapted to the
change.  The storage functions could stay as they are.


The storage functions are at the low level of abstraction.  They are
the tools, that higher levels use.  So I probably should remove
DATA_DIR from there and use absolute paths there.

In principle, we can forget about look-up functions for a while,
because it is the last thing to worry about.  First I need Update to
work smoothly.

2003-01-03 14:44

RePEc::Index::Update class seems to be working.  Now, I think, it is
time to connect Update class with Handle History class.  Events module
should make this easy.  Issues may rise in the process, but I hope to
resolve them along the way.

One issue I already see is the processing queue.  There isn't such
thing now, but it should be.  Handle History class requires one.

So I add this to the Handle History class. 


2003-01-30 18:29  

I create the RePEc::Index::History::Handle2 class.  Entitled: the
persistance wrapper and Events glue around
RePEc::Index::History::Handle.  It handles the events generated by the
Index::Update module.

I have a feeling of some important details forgotten, but that might
be just because I didn't work with the code for a long while (almost a
month).

* * *

<emotions> Why is that?  Only it started to look pretty simple, easy
and clean, and I get into conceptual framework problems again.
</emotions>

Now it turns out that History::Handle requires more data than I have
at the Index::Update level.  It requires ridiculous file position at
the "lost template" event (and its not far from asking me about the
checksum, as a comment proclaims...).

It seems to assume that I use the handle's history in tracking the
handle's history.  Which is a wicked loop, I think.


The options I have to solve the problem: 

1) accept that wicked loop as a right thing

2) remove the dependency on this data in History::Handle class, which
   might affect its accuracy

3) add tracking this additional data at the Index::Update level, in
   by-file records


2003-02-06 13:48

The whole problem is ugly even more than it is apparent, because the
template positions are ReDIF-specific and other format parsers are
not-likely to support that notion.  That's why I don't want to pay
much time and effort to this.  And I don't want to create complex
algorithms to deal with the problem, 'cause it will be a burden, a
waste of processing power in case of non-ReDIF parser.


So, this makes me lean towards option 2 of the listed right above.
That is: remove dependency on this in History::Handle class.  

So, I did it.  Now History::Handle uses it's own structures to resolve
the questions...  Its accuracy didn't suffer, at least theoretically
so.  I'm almost happy.  But I didn't yet finish the History::Handle2
class...


2003-02-06 18:00  

Now it's time for testing.  I need to see that the whole thing works.

Well, on surface it looks like working after fixing a bug or two.  
But I need deeper tools for seeing it real working.  

For example, I need a tool to check (visualize) the handle history 
record, by the handle value.

Also I need to extend the written above scheme of events and history
status to include records datatype.  This also involves
Collection::RePEc: its the parser that should return the record type.


Another thing to DO is improve compactness of the Handle history
records.  Probably $self->{status} is an unnecessary structure.  It's
elements can live at the top level of the object itself.


2003-02-18 13:39 After fixing a bug, which I didn't describe last time
I worked on this (blame on me again), things look generally OK.  What
really ugly is that conflict and probably other secondary events
repeat themselves even though they tell nothing new.  History will get
bloated quickly.

Fixed a few problems, removed redundant conflict history recording,
simplified history object structure, did some testing.

 TO DO:
 ~~~~~

ToDo: document new history object structure (with {status}
substructure removed)
 - done! (2003-02-20 18:29)

ToDo: record type recording in history
 - done! (2003-02-20 19:03)

ToDo: RePEc::Index::Storage - as an option, there must be BerkeleyDB.
 - done! (2003-02-20 19:46) RePEc::Index::Storage and its competitive
   parts: RePEc::Index::Storage::Berkeley and RePEc::Index::Storage::AnyDBM

ToDo: Store the record itself somewhere, if there's no conflict.

ToDo: RePEc::Index - must be a reading interface to the handle history
and the record itself.

ToDo: (to check) What interface do I need for good integration with
ARDB?  Probably Events at the History::Handle class.

ToDo: Documentation


2003-02-21 14:08 

Now I'm running a test R::I update of the local fraction of RePEc
collection, and I'm terrified: the history database grows huge at this
first update.

Should I separate the database into archive sections as it was before?
Should I simplify history object structure?
Should I try it on full RePEc and see how big it gets?

Also: shouldn't I treat separately the all/ directory in RePEc-type
collections?


(update from 2003-02-24 18:30:) While the above questions stay open,
I've changed Update module to treat directory names differently.  Now
the root directory is "", all the rest is relative to it, and has no
leading "/" char.  All other directory entries have a trailing "/"
char, and that differentiates them from datafiles.


2003-02-24 18:59
Current TO DO list:

  Make the record storage
  - Store the record itself somewhere, if there's no conflict.

  - 2003-02-26 11:53: Theoretically done, need testing.  For testing I
    need to do the next point.

  Update RePEc/Index.pm
  - RePEc::Index - must be a reading interface to the handle history
    data and the record data itself.  Also files listing may be needed.

  - 2003-02-26 16:46: theoretically done, needs testing.  It is very
    simple.

  - 2003-02-26 17:25: did a simple test file: t/index.t and it runs fine
   

  Check the concepts
  - (to check) What interface do I need for good integration with
    ARDB?  Probably Events at the History::Handle class.

  - 2003-02-26 17:26: well, probably yes: I need Events generated in
    the History::Handle class.


  Update a single file - Update module must be able to update a single
  file given.

  Documentation
  - use atrain.  In a separate file write about: general usage,
    installation, the modules, data structures



2003-02-26 14:29

As an idea, I thought: what if I make records_db optional? 

If there's no records_db for a collection, then the collection's class
can retrieve the record by its filename, position and id.  Ha?

A clever idea, but we need more configuration for it, don't we?  And
that configuration will have to be checked: 1- at the update time, in
Handle2.pm and 2- at the data retrieve interface, in Index.pm.

Also I need to teach RePEc collection class to do that retrieve kind
of thing.

May be for the next version?



2003-02-27 17:26: 

Thinking about integration with ARDB I wrote:

  Check the concepts
  - (to check) What interface do I need for good integration with
    ARDB?  Probably Events at the History::Handle class.

  - 2003-02-26 17:26: well, probably yes: I need Events generated in
    the History::Handle class.


Oh, do I?  Can't I do the same thing with ARDB that I did with
records_db: ie treat it at the History::Handle2 class?  At least if I
do generate Events' events, I can do it there.

On the other side, why separate?

It's so natural to have Events' events where they happen: ie in
the History::Handle class.

Let's think: of what events do we actually talk: RECORD::GOOD,
RECORD::BAD, RECORD::LOST.  As far as I can think of, this is it.
RECORD::GOOD is at the present event if there's no conflict.
RECORD::BAD is at the present event if there's a conflict happened.
RECORD::LOST if it disappeared.

Probably I need RePEc::Index::ARDB interface module, which will
transfer the events.

I can't really explain why I don't want to touch History::Handle.
Performance? yes, but I don't think the hit will be singificant.
Additional complexity? yes.

Let's do this.  History::Handle will "use Events" and will generate
those events.  RePEc::Index::ARDB will handle them with directly
sending requests to ARDB.


Another thought breaking in: do I really need to push every "present"
record into ARDB?  I could push only new ones, and then push "changed"
ones.  Ha?  This makes a lot of sense at least if ARDB's configuration
didn't change since last update.

This also makes perfect sense for record_db.

Then I need these events: RECORD::NEW, RECORD::OLD, RECORD::CHANGED.


2003-02-28 15:04

Yes, I think this is right.  I need History::Handle to generate
events, which I'll handle in the RePEc::Index::ARDB module. 

I also need ARDB to let me know the date of last configuration
modification.  I will use it to treat old records wisely.

  
2003-03-01 23:10

Finally I thought of a conflict clearance.  The thing is: I save to
record_db only if there's no conflict, but it's a mistake.  Because a
valid record will be missing from the database, if the conflict was
cleared, until I happen to see the first file again.  So I better save
to database every time a new template comes in or changes.  And then I
check the history for conflicts, before I give anyone access to a
record.  Is there another way?  

Hmm, I also could store a separate list of conflicting handles, keep
that in sync with all conflict/clear events and check it on user
get_record() requests.  

Alternatively, upon a conflict clear event, I could go check handle's
{present}, and then load that (file, pos) combination into record_db.
But this is stupid.



2003-03-02 15:40

I missed quite a lot of development.  I added the following events
(generated) to RePEc::Index::History::Handle:

    RePEc::Index::RECORD::NEW
    RePEc::Index::RECORD::OLD
    RePEc::Index::RECORD::CHANGED
    RePEc::Index::RECORD::CONFLICT
    RePEc::Index::RECORD::CLEAR

All these events carry ( RECORD_ID, HISTORY_OBJECT ) as their
parameters.


Then I added handling of these events to
RePEc::Index::History::Handle2.  I moved record_db support into these
handlers and I created conflict_db, which will keep track of all
conflict/clear events.  This means if there is handle present in
conflict_db, it is blocked.  If there's no handle present - there's no
conflict.

Now I also think of making RePEc::Index::History::Handle2 a general
facility for storing/deleting records from anywhere.  I generate these
events: 

    RePEc::Index::RECORD::STORE( ID, TYPE, RECORD, [LAST_CHANGED] )
    RePEc::Index::RECORD::DELETE( ID, TYPE );

The problem of conflicts creates a desire for RECORD::BLOCK and
RECORD::UNBLOCK events, but I feel this is stupid, because it is a
road towards duplicating RePEc-Index's functionality.

Problem is that I don't want to break the idea of ARDB's independence
from RePEc::Index.  

But I can do it another way.  Since I have history and record
databases, then on a conflict I can do DELETE, on a clear I can use
RePEc::Index and then do STORE.  What do you think?

To make this even simplier, I could also enhance record_db to store
both the record and its type.  Then I don't need to touch the history
- I can do the DELETE and STORE just with the record_db.

Yes, I start doing this now.

Ha, that's ready.


2003-03-03 13:17

Moving ahead, I added some nice command-line options to ri_update.pl,
fixed some problems in Update.pm and added an adequate process_file()
function.

Now ri_update.pl works much nicer.  And RePEc::Index::ARDB seem to
work well as well, but I didn't yet add ARDB's specific stuff.


2003-03-03 18:07  I made "process_this" method to the Update class.
It takes a file/directory name and processes it accordingly.  

Fixed some bugs in History::Handle2.

Now I realized: the concept of RECORD::STORE as it is implemented now
has a flaw.  The flaw is that the record's last change time is
supposed to determine necessity of the actual record storage
operation.

Instead, time of the last STORE operation needs to be taken into
account.

The only way I now see this could be accomplished is by creating this
kind of event in handle history class.  Of course, I also can store it
separately, but that would be stupid.

I could create event RePEc::Index::RECORD::STORED or STORED_OK, with (
ID, TYPE, RECORD, STORAGE );


The issue is rooted down below in the storage logic.  The logic (the
configuration, the rules) may change and then we have to replace
old-stored data with new data.

The simplest way out is to assume that if user does a significant
change in the storage rules, it runs a force-update on the whole
collection.  Otherwise, old records will not be stored (re-).  The
force-update here is not RePEc-Index's force-update.  It is different,
although RePEc-Index's force-update is required. 

I think this is the right thing to do.

I have to ensure RePEc::Index does its job at get_record request: it
has to check the conflict db.


2003-04-17 14:33

Now, as ARDB stores records in its ObjectDB, RePEc-Index turns into
"update agent" for ARDB.



Yes, I now can see RePEc-Index as a part of ARDB.



2003-05-01 21:33


... For that I need to make RePEc-Index more run-time configurable.  I
think Template-Toolkit-style will work fine: I make all main
components configurable at creation time by a common configuration
hash.

But for now I need to make configurable not so many things: the
database directory names (where to read from and write to the database
files) and the collections.  Also some processing, like not to create
the records db (because in ARDB I already have ObjectDB).

Anything else?


Should I make RePEc-Index fully object-oriented?  Probably not.


Instead, I should let both RePEc::Index module and
RePEc::Index::Update accept a conf hash.  

The conf hash:

  home_dir 
  collections_conf
  data_dir
  config_name
  processed_considered_old
  

2003-05-02 00:25

Event renaming: 'RePEc::Index::RECORD::STORE' is now
'RePEc::Index::RECORD::PROCESSED'.

Now I should adapt History::Handle to that event, and adapt Handle2.pm
to check $history -> last_processed() and call external 'processors"
on this.

Then -- the configuration.



2003-05-02 23:44

I didn't yet test it in any way, but theoretically it looks pretty
good now.

The RI configuration can be changed run-time by
RePEc::Index::Config::home_dir( DIR ) call.  RePEc::Index::Handle2
will call Events->RePEc::Index::RECORD::PROCESS when necessary,
respecting the $Update->{TOO_OLD_IS} setting.  Record history will
have last_processed() accessor method.

The only thing left is clever collection resolver for ARDB to use.

In ARDB we don't have the collection name, we only have the id of the
record...



2003-06-24 11:58

Oh, it's been so long since I last time updated this DEVLOG.  It is a
shame, really.  So much has been done here...  Not really anything
revolutionary, but a lot of cleaning and straightening.  

Among other things, I have recently installed the thing on
netec.mcc.ac.uk, along with ARDB thing and it worked (after few first
attempts broke into some severe problems -- I had to fix them).

It has even filled the Fire (Find in RePEc) database.

So now in the configuration I have a way to set some collection's
options and a list of processors (per individual collection, again).  

Options and processors for a collection are in the forth parameter in
the collection configuration line.  It is a comma-separated list, each
part being an individual setting, like "records_db=1" or
"proc=ARDB::RI".

Now I have finally reached an internal consensus about records_db.  I
was uncertain until recently: make it an intergral part of the system
or make it a switchable option.  Now it is a switchable option.
Successful processing by the third-party modules will be slightly more
clean and efficient if it is present, but it is not necessary for the
system as a whole.




2003-08-06 22:58

Before I start using RePEc-Index on xerces.openlib.org I want to have
a simple and solid solution for starting and stopping the control
daemon, as well as starting and stopping berkeley db RPC server for
short-ids database.


I want to have configuration for the whole system in one place, but I
do not yet have a solution for it.

For starting RePEc-Index daemon I need: 

 - socket filename for daemon to listen

 - pid filename for daemon to create

 - lib dir for daemon to add to the perl lib path (optional)

 - RePEc-Index data home dir or ACIS home dir (then RI data home dir
   will be $acishome/RI)

 - log filename for daemon to write to


If I have ACIS home dir, then I can safely compute lib dir, data home
dir, pid filename and log filename.


  acishome/
     lib/
     bin/
     RI/ 
        collections
	data/
	backup/
        daemon.log
        daemon.pid

     SID/
        ...data
	daemon.log
	daemon.pid






2003-08-09 00:52

Now let's think about packaging and installing stuff.

I could create package directory structure like this:

ACIS/

   ... all the ARDB package content, plus:
   sql_helper
   RePEc-Index-X.XX
   ReDIF-Perl-X.XX
   Events-X.XX
   BerkeleyDB-X.XX
   db-X.XX.XX

   

INSTALLATION



During installation, I could run db-X.XX-XX/build_unix/configure
--enable-rpc --prefix=$acishome.

Then I could write BerkeleyDB-X.XX/config.in file as
echo INCLUDE=$homedir/include
echo LIB=$homedir/lib
echo 'DBNAME= -ldb-4.0'
   




2005-03-01 12:03

RePEc-Index and RI update daemon needs concurrency and
better recoverability.  BerkeleyDB is probably good enough
to provide both if used properly, but it would require some
careful work.

Number one task is to separate storage-related code into
abstract calls to pluggable modules.  


Interface of the Storage class:

new( DATADIR ); class method

initialize();

get_table( NAME );

  Returns a table object to be used in consequent requests
  to get/put/del_record() methods.

start_transaction();

  Start a new DB transaction; returns a transaction object
  (handle) TXN.

abort_transaction( TXN );

  Abort transaction.

commit_transaction( TXN );

  Commit a transaction.

get_record( TXN,    TABLE, ID );

  As part of transaction TXN, get record ID from TABLE.

get_record_ro( TABLE, ID );

  As part of transaction TXN, get record ID from TABLE
  without intent to modify and save it later.

put_record( TXN,    TABLE, ID, REC );
del_record( TXN,    TABLE, ID );





2005-03-09 12:03

Performance tests, comparing transaction-based processing
and non-transaction-based.

Running on my local RePEc data copy without transactions
takes:


ivan@zetta:~/dev/Index/Index> time perl -Ilocal/playground/lib -Ilib -I/opt/ARDB/lib reindex.pl RePEc / > play_notxn
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 3.
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 3.
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 3.
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 3.
Sys: No such file or directory at lib/RePEc/Index/Update.pm line 510, <FILE> line 3.
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 7.
perl -Ilocal/playground/lib -Ilib -I/opt/ARDB/lib reindex.pl
RePEc / >  
 69.32s user 2.01s system 74% cpu 1:36.22 total

ivan@zetta:~/dev/Index/Index> /bin/rm -fr local/playground/data/*; echo

ivan@zetta:~/dev/Index/Index> time perl -Ilocal/playground/lib -Ilib -I/opt/ARDB/lib reindex.pl RePEc / > play_txn
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 3.
Sys: No such file or directory at lib/RePEc/Index/Log.pm line 27, <FILE> line 3.
perl -Ilocal/playground/lib -Ilib -I/opt/ARDB/lib reindex.pl
RePEc / >  
 73.62s user 2.62s system 46% cpu 2:44.73 total




2005-03-11 09:06

To do:

  - RI::Update::Client shall fork into background if there's
    no deamon to accept its request and then after a while
    try a couple of more times.

    ...Done

    - Interesting, how would sql connection in the parent
      process behave after a fork...

      ...Do not see anything strange

  - control_deamon shall have a configuration.  Parameters: 
    maxthreads

    ...for the start we will make it a command-line
    parameter, because there's no more parameters to use now
    (yet).

  - control_daemon shall run database recovery on start and
    fail otherwise
    ...done
    
  - make control_daemon redirect its output into a log by
    itself (with autoflush(1))
    ...done

  - replace /bin/rid.start and /bin/rid.stop with /bin/rid
  
  - looks like rid.start removes pid if start failed: fix

  - add /bin/rid backup function

  - add /bin/rid onboot function (clear pid files)

  - make some locking on the part of the collection that is
    being processed.  Try to avoid a chance of parallell
    updates running at the same time.  Bad (strange, at
    least) things can happen if a file has been processed by
    a small update session and now is about to be processed
    as part of a big update (started a while ago,
    i.e. having an older SESSION timestamp).

    Think more on this.  

  - Avoid several overlapping requests within a single
    second (with equal SESSION timestamp).

  - make a nice read-only interface to the RePEc-Index
    database


The multi-threaded control_daemon shall fork a child for
each update request.  A forked child will know its "channel"
number, so it will open a numbered log file for its output.
Once finished, it will notify the parent of the processing
result.

To manage it in a good way, the daemon will keep this
information about its children:

  - channel number
  - pid
  - request details:
     - collection being updated
     - path being updated
     - force parameter

If the daemon gets TERM signal, it shall propagate it to all
its children, wait till all of them exit, and then exit.

Probably, it shall do the same in case of the KILL signal.


Now there shall be some way for a child to notify the daemon
about having finished its work.  One way is to connect to
the daemon through the network socket, just as UpdateClient
does.  Then child could send appropriate message to notify
the daemon.

Another way is to monitor child processes with waitpid(-1,
WNOHANG) function; every once in a while I cycle through the
children processes, testing if they are still there with
waitpid( $kid, WNOHANG );

 


 - priority request option (nice(2) or setpriority(2))

 - per collection processing module shall have a way to
   complain, that it can't work (because some other part of
   the environment or system isn't ready).
   
 

More real performance tests with transactions:

reindex.pl RePEc / > play_txn
76.74s user 3.63s system 38% cpu 3:29.90 total

Repeat: 
reindex.pl RePEc / > play_txn2
6.80s user 1.02s system 17% cpu 44.778 total

Repeat with force:
reindex.pl -F RePEc / > play_txn2
74.78s user 2.81s system 41% cpu 3:05.06 total



No transactions:

reindex.pl RePEc / > play_notxn
70.49s user 1.74s system 86% cpu 1:23.63 total

Repeat: 

reindex.pl RePEc / > play_notxn2
5.14s user 0.68s system 74% cpu 7.819 total

reindex.pl -F RePEc / > play_notxn3
70.68s user 1.73s system 85% cpu 1:25.13 total




2005-03-15 10:51

Plan:

  - make some locking on the part of the collection that is
    being processed.  Try to avoid a chance of parallell
    updates running at the same time.  Bad (strange, at
    least) things can happen if a file has been processed by
    a small update session and now is about to be processed
    as part of a big update (started a while ago,
    i.e. having an older SESSION timestamp).

    Think more on this.  

  - Avoid several overlapping requests within a single
    second (with equal SESSION timestamp).

  - add /bin/rid backup function

  - add /bin/rid onboot function (clear pid files)

  - make a nice read-only interface to the RePEc-Index
    database

  - Try and test performance with DB_TXN_NOSYNC and
    DB_TXN_WRITE_NOSYNC flags (on the environment)

  - priority request option (nice(2) or setpriority(2))

  - per collection processing module shall have a way to
    complain, that it can't work (because some other part of
    the environment or system isn't ready).
   




2005-03-17 19:43

On present() method of the R::I::History::Handle class and
dependency on single-threadedness of RI updates.  

Handle2 class calls this method every time a record is read
from a data file.  It is called on a record history object.

It shall:

 - save an appropriate mark in {history}

 - keep {present} part of the object up-to-date (update
   it with most recent filename, position, checksum, etc.)

 - call conflict_event() method if a there was no conflict
   before, but there is now.

 - instruct Update object to read some specific files to
   check for possible conflicts, if necessary.

 - Given absense of conflicts, detect record changes.  Issue
   one of the RECORD::OLD, RECORD::NEW, RECORD::CHANGED
   events.  Compare past and present checksum for that.

Additionally:

 - check {present} list for possible "ghost" files listed
   and clear them (possibly clearing a conflict)


Now in re-writing this method I shall stick to idea that
session time will be taken at the moment I start parsing the
data file.  This is different from original point of view,
when session time was taken at the session-creation time.


Conflict is a situation when after updating {present} list
of a record, there is more than one item in this list, such
that each item has either a timestamp no less than our
current session time, or has an earlier timestamp, but
corresponding file didn't change since and update probably
won't change situation.




2005-03-19 19:55

The time has come for a non-sense elimination work.
History::Handle2 and History::Handle classes must be joined
into one module!


new_rec_history( $id )

event_record()
event_rec_present()

found_rec_change()
found_id_conflict()
id_conflict_cleared()

save_and_process_record()
clear_record()

event_record_old()

event_record_disappear()




2005-03-21 15:00

This is current workplan:

  - make a nice read-only interface to the RePEc-Index
    database

  - add /bin/rid backup function

  - add /bin/rid onboot function (clear pid files)

  - Try and test performance with DB_TXN_NOSYNC and
    DB_TXN_WRITE_NOSYNC flags (on the environment)

  - priority request option (nice(2) or setpriority(2))

  - per collection processing module shall have a way to
    complain, that it can't work (because some other part of
    the environment or system isn't ready).
   

2005-03-22 11:09

RePEc::Index::Reader is the simple read-only interface to
the RePEc-Index data.  (See above list.)



2007-04-17 15:07

Added configuration.pl parameter $max_record_history_items to limit the
growth of the record history logs.


