Author Topic: Plausibility check: to much false-positives  (Read 1308 times)

0 Members and 1 Guest are viewing this topic.

Offline Jo301

  • VIP
  • Jr. Member
  • *
  • Posts: 10
    • View Profile
Plausibility check: to much false-positives
« on: July 28, 2024, 18:39:56 »
Hi,

a lot (more than 200) of problems are reported within plausibity check with my converted GEDCOM 5.5.1 file.
These all follow a pattern for any INDI:

e.g.
...
1 DEAT
2 DATE AFT 1945
1 RESI
2 DATE ABT 1946
2 PLAC XYZ, pow. Opole, , , ,
2 ADDR 15
2 SOUR @S284@
3 PAGE 99
...

The check delivers:
Residence (about 1946) after Death (after 1945) of ...  (I6372) ...

The story behind is, that I found a list of inhabitants, created between 1945 and 1947 which was extended in several steps.
But no entry is dated in the turmoil after the Second World War.

So I can say that a person resides there around 1946 (not precise information) and died after 1945 (also not precisely).
Earliest the person can be died in year 1945 (interim result of the research).

This also applies to other sources.
A person with OCCU ... in year 1805 (easter) can be died earliest in year 1805, so AFT 1804 is correct.

OCCU / RESI ... are facts (individual attributes) and DEAT is an event, which is distinguished in GEDCOM standard (here 5.5.1).

And in addition: a ">=" is not possible in GEDCOM and a FROM - TO is not allowed for events.

I think both dates deliver a logically "true" in connection and are a false-positive result.
In other words, the results are not mutually exclusive and are not contradictory.

My problem is that many reports of this result mask other possible checks that need to be verified.

In principe no fact or individual attribute with date is allowed:
- before birt (exaxt date) and
- after death (exact date).

That means that no check should be performed
- without date,
- date ABT
- date AFT (death)
- date BEF (birth).

It could be a good idea to encapsulate the checks in functions like IsDeathApplicable() or IsBirtApplicable().
Then the handling of only known year could be realized there (birt.year --> 01.01.YYYY and death.year --> 31.12.YYYY).

Please change the plausibility check in that way, that the described false-poitives does no more appear.
(No other of my GEDCOM Apps is delivering a warning or error).

Jo301

My config:
Ancestris-Version :  13.0.12817
Java:  21.0.3+7-LTS-152 - C:\Program Files\Java\jdk-21
System:  Windows 11 - 10.0

Offline Zurga

  • VIP
  • Supernatural Member
  • *
  • Posts: 4 265
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #1 on: July 28, 2024, 22:34:54 »
OCCU / RESI ... are facts (individual attributes) and DEAT is an event, which is distinguished in GEDCOM standard (here 5.5.1).

And in addition: a ">=" is not possible in GEDCOM and a FROM - TO is not allowed for events.
Could you explicit where it is indicated in the specification ?
All I can see is :
n OCCU <Occupation>
+1 <INDIVIDUAL_EVENT_DETAIL>
n RESI
+1 <INDIVIDUAL_EVENT_DETAIL>
n DEAT [Y|NULL]
+1 <INDIVIDUAL_EVENT_DETAIL>

So if I understand correctly, a fact or an event have the same INDIVIDUAL_EVENT_DETAIL substructure.
So, where is the limitation of the date ?

Zurga

Offline Jo301

  • VIP
  • Jr. Member
  • *
  • Posts: 10
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #2 on: July 29, 2024, 09:54:29 »
OCCU / RESI ... are facts (individual attributes) and DEAT is an event, which is distinguished in GEDCOM standard (here 5.5.1).

And in addition: a ">=" is not possible in GEDCOM and a FROM - TO is not allowed for events.
Could you explicit where it is indicated in the specification ?
In GEDCOM v7 (and also in earlier versions is explained), see:
3.3.1. Events / 3.3.1.1 Individual Events

"As a general rule, events are things that happen on a specific date. ...
Resist the temptation to use a ‘FROM date TO date ’ form in an event structure. If the subject of your recording occurred over a period of time, then it is probably not an event, but rather an attribute or fact.
...

3.3.2. Attributes / 3.3.2.1. Individual Attributes
Unlike events, the presence of an attribute is sufficient to assert the attribute applied to the individual, regardless of the attribute’s substructures and payload."


Life begins at BIRT and ends at DEAT with exact date (ideally also within GEDCOM file).
If no exact date is available only a limited set of key words are defined, like BEF, AFT, BET, AND, ABT, EST, CAL ...

It must be taken into account that imprecise data should not be compared with precise data, or only to a limited extent.
And "AFT" includes the entire future thereafter. This is the real issue at stake.

Therefore you need to decide, if a combination of dates (precise / impricise) is applicable to compare.
This results in 4 main cases (precise / impricise), which still differ in sub-characteristics (BEF... or when starts or ends a year).
This logic should be encapsulated.

No matter how it is realised, the false positive results should be avoided.

All I can see is :
...
So if I understand correctly, a fact or an event have the same INDIVIDUAL_EVENT_DETAIL substructure.
So, where is the limitation of the date ?

As always, it depends on the question how events and facts differ in meaning, as described in specification.
Apples and pears have the same substructure as fruits, yes it's possible to compare them.  ;)

Jo

Offline Zurga

  • VIP
  • Supernatural Member
  • *
  • Posts: 4 265
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #3 on: July 29, 2024, 20:10:41 »
Again, where it is explained that you can't use the FROM /TO structure in an event in the specification GEDCOM 5.5.1 ?
The line you mention indicates it is not a good practice to do it, this is not forbidden.

And the GEDCOM norm don't forbid to create several BIRT or DEAT tags.
I know users that indicates lot of death and lot of birth before reducing the possibilities.

Zurga

Offline Jo301

  • VIP
  • Jr. Member
  • *
  • Posts: 10
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #4 on: July 29, 2024, 20:58:32 »
Again, where it is explained that you can't use the FROM /TO structure in an event in the specification GEDCOM 5.5.1 ?
The line you mention indicates it is not a good practice to do it, this is not forbidden.
OK, you are right, I did not use the correct word, I should better use: It's not recommended (also in 5.5.1).
Look for: "Resist the temptation to use a ‘FROM date TO date’ form in an event structure."

I know that some users use FROM / TO for events, but it is not correct!
Useally it is accepted by GEDCOM Apps.

And the GEDCOM norm don't forbid to create several BIRT or DEAT tags.
But there are other apps that report it within plausibility check.

But let's get to the real problem: what about the reported false positives?

I have described the circumstances clearly.

Jo301

Offline Zurga

  • VIP
  • Supernatural Member
  • *
  • Posts: 4 265
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #5 on: July 29, 2024, 23:10:43 »
But let's get to the real problem: what about the reported false positives?
Just to know : Should a software declare a possible error in this case :
1 DEAT
2 DATE AFT 945
1 RESI
2 DATE ABT 1946

Zurga

Offline Jo301

  • VIP
  • Jr. Member
  • *
  • Posts: 10
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #6 on: July 30, 2024, 07:40:31 »
Just to know : Should a software declare a possible error in this case :
1 DEAT
2 DATE AFT 1945
1 RESI
2 DATE ABT 1946

My clear answer is NO. There is no problem with "Logical sequence of events (238)".

DEAT after 1945 and
RESI about 1946

are not contradictory and cannot be determined more precisely.
If GEDCOM had the option of specifying ‘greater equal’ (>=), this would not be a problem.
In programme logic, ‘after’ also contains any year after.

How else should this be described?

What do others think about this?

Offline Zurga

  • VIP
  • Supernatural Member
  • *
  • Posts: 4 265
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #7 on: July 30, 2024, 07:56:45 »
I have not written 1945 I have voluntary written 945.
When you cite another message, avoid to modify it. This is not respectful.

My question stays : Should a software mention a possible error when you put 945 instead of 1945 ?

By the way, you could always put a "_VALID" tag to ignore the error.

Zurga
« Last Edit: July 30, 2024, 09:00:58 by Zurga »

Offline Jo301

  • VIP
  • Jr. Member
  • *
  • Posts: 10
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #8 on: July 30, 2024, 18:34:37 »
I have not written 1945 I have voluntary written 945.
When you cite another message, avoid to modify it. This is not respectful.

Sorry, I thaugt that was a typo.

My question stays : Should a software mention a possible error when you put 945 instead of 1945 ?
You meant that the difference of years is may be much to high, ok and you changed it thereafter.

The problem you means is not a problem if you check it against the max. possible life age (or at CHR). Instead there could be defined the earliest (and latest) year, that is valid. For beginners it could be 1850, 1700 and so on. But these would be more parameters, which need to maintained. It's not a good solution for my opinion, to create more and more parameters.

Please remember: In reality you can't find all problems really.

By the way, you could always put a "_VALID" tag to ignore the error.

Hmmm... Ancestris shall be 100% GEDCOM compatible, if I understood it right.
The the use of any _TAG should be avoided (no _VALID, no _ASSO and so on).

I don't want that because the maintenance is cumbersome.
But we want to remain compatible, or?

If you would realise that events and attributes are different things, the solution would be easy.
It's not a question of having the same substructure.

Events and attributes with a time specification can only be compared with each other to a limited extent, as already mentioned.

Currently, the disregard leads to far too much false positives, which require a _VALID. And you obviously don't have a GEDCOM-compatible solution for this, or?

The current check is well-intentioned, but not really good.

I suggest the KISS principle (keep it simply "stupid").

Anything else entails a rat's tail of unnecessary things.
The check against fuzzy information, which cannot be avoided in research, produces a large number of unnecessary reports.

This problem occures only with Ancestris. And remember again, you can't find all problems really.
The user also has a responsibility and does not need to be diapered.  ;)

Offline Zurga

  • VIP
  • Supernatural Member
  • *
  • Posts: 4 265
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #9 on: July 30, 2024, 21:48:02 »
Hmmm... Ancestris shall be 100% GEDCOM compatible, if I understood it right.
The the use of any _TAG should be avoided (no _VALID, no _ASSO and so on).
Again I disagree.
_ASSO mimic an existing tag (and you expect that other software understand it) and add meaningful information to store relation between entities.
_VALID doesn't mimic anything and we don't expect others software understand it, and neither add meaningful information.
If all others software have no problem with AFT /BEF date, _VALID is not needed, so this is correct for the specification.

Zurga

Offline Jo301

  • VIP
  • Jr. Member
  • *
  • Posts: 10
    • View Profile
Re: Plausibility check: to much false-positives
« Reply #10 on: July 31, 2024, 19:10:06 »
_VALID doesn't mimic anything and we don't expect others software understand it, and neither add meaningful information.
You are correct, ‘no meaningful information’, _VALID is an ‘appendix’ that would not be needed unless you make it so complicated.

If all others software have no problem with AFT /BEF date, _VALID is not needed, so this is correct for the specification.

Does that mean that only Ancestris needs it?

The main issue are the ‘false-positives’, which please do not distract from.

It's just a matter of common sense and logic, forget the incomplete specification for a moment.

One is the specification, which offers no help for interpretation in this case, the next is the implementation (after interpretation), which must be measured against reality.

If the implementation can be improved and simplified for the user, this should be done.
It is a common use case that dates are sometimes vague and (have to) overlap.

And where there's a will, there's a way (as they say here).

Is there any chance that this will be get to your roadmap?

Jo301