It is no secret that DMOZ faced a number of problems during recent years. Unfortunately those problems were clearly reflected on the quality of the DMOZ resources until the point where it was shut down on March 17, 2017.
As mentioned on the "MISSION" page, the purpose of DIRpopulus is to keep alive (and fix) the relevant DMOZ data and add new, high quality web resources to it.
In order to achieve the above, the DMOZ RDF files (from 2017-03-13) needed to be transformed into an easy to work with format/structure. This step was mandatory in order to be able to analyze the data and identify the problems, but also to create an efficient way to serve this data to people.
Once the new, optimized data structure in place, the actual problem identifying process has been started. Let's see what we're dealing with...
1. Data volume & quality
The initial data imported from the DMOZ RDFs contained a total of 3.572.034 resources in 787.596 categories.
1.1. Existing and non-existing resources
It's been know for years that various resources available on DMOZ were not existing anymore online (broken links). These (non-existing anymore) resources need to be identified and un-linked.
To achieve this, a specially designed application will check -- on a regular basis -- if the resource is available online. If the application reports that the resource can NOT be found three times consecutively at the location to which the resource URL is pointing, the resource will be un-linked. The system will allow a number of hours between the checks and at least one day (24 hours) between the first and the third check.
The un-linked resources will be kept however in the database for historical purposes. It's only the link that will be removed.
The fact that a resource may not be available for a couple of days may not necessarily mean that the resource isn't actually available. For this reason, any resource can be re-linked at a later time if it's being re-suggested by a person. If such action takes place, the resource will be added to the review queue and will be properly reviewed by several people.
1.2. Duplicate resources
The data analysis showed a large number (330.425) of resources showing up more than once. And this is just the beginning... The number represents only the resources having the exact URL show up twice or more.
By digging a little further we can see that there are an even larger (much larger) number of resources that have links to one or more sub-pages of the same website. While this makes a lot of sense for giant websites such as Google or Facebook, this particular aspect still needs to be addressed.
Checking these resources is a priority as there is a high possibility that they are duplicates.
The review system will be built in such way that it will prioritize the review of certain resources (especially possible duplicates since they can easily be identified programmatically). More about this later on in the "2.2 Web resource review" section.
1.3. Inappropriate categories
Some of the resources don't belong to the category they were placed in at DMOZ. These resources will be reviewed and placed in the correct category if one exists. New categories will be created whenever necessary.
Although there may be a way to programmatically categorize the existing resources, the results will simply not be reliable, at least not from a human edited web directory perspective. In conclusion the category of each resource will need to be reviewed by people. Again, more about this in the "2.2 Web resource review" section.
2. Resource suggest, review and approval
The DMOZ resource review and approval process was simply not right, to say the least... While there is no reason to hide from the submitter what is the status of the submitted web resource, the system must be built in a way that allows but also requires the proper review of resources.
The system will allow anyone to suggest AND review web sites/pages (web resources) to and on the DIRpopulus web directory.
Obviously, it wouldn't make too much sense to have the person suggesting a web resource review that very same resource so people will NOT be able to review the resources they suggest.
It will be the system (application) that will provide the resources to be reviewed by people, based on various criteria.
Here is the detailed description of "THE SYSTEM".
2.1. Suggesting a web resource
As mentioned above, anyone can suggest one or more resources to DIRpopulus. You automatically become an editor by suggesting a web resource.
Suggesting a web resource to DIRpopulus is free, it doesn't cost you a cent! However...
Help is needed to review all the DMOZ resources (as well as the newly submitted ones) so, in order to suggest a resource you will be required to review a number of 10 (ten) existing resources, including old (DMOZ) and new (DIRpopulus) ones.
The ratio of the old/new resources to be reviewed will be directly proportional with the existing suggested resources in the DIRpopulus database. As long as the database contains 90% or more DMOZ resources, an editor will need to review 9 DMOZ and 1 DIRpopulus resources (9/1 ratio) for each suggested resource. Once the number of newly suggested resources in our database passes 10% (which means the DMOZ resources now represent less than 90%) the ratio changes to 8/2 and so on.
Whenever you suggest a resource the system will check and inform you about:
- duplicate URL - if the very same URL exists in the database it would be senseless to re-suggest
- duplicate description - if the same or very similar description exists, the system will inform you and you'll be required to change the description
- other URLs from the same website - there may be several pages of a website having the same or a very similar subject. Only one page/subject will be accepted
If everything checks out right the web resource will be submitted to the review queue.
Each suggested resource will have to be reviewed by at least 3 different editors.
2.2. Web resource review
In order to achieve the goal of being a reliable human edited web directory we need to make sure the review of every resource (those from DMOZ as well as the new ones) is completed in a manner that the resource is actually meaningful and useful for the DIRpopulus community.
After filling out the info regarding the resource you wish to suggest, you will be redirected to a section of the application (questionnaire) where the DIRpopulus system will try to figure out what resources are you qualified to review. The questionnaire will include questions about the languages you speak, the web directory categories you're interested in, weather you've been a DMOZ editor or not, etc...
After you complete the questionnaire, you'll be able to start reviewing web resources. At least 10 in order for your suggestion to be accepted but you're welcome to review more.
The resources to be reviewed will be provided by the system based on the results of the questionnaire and will include a certain number of DMOZ resources and a certain number of DIRpopulus resources based on the calculated ratio mentioned above under 2.1.
The DMOZ resources you'll review will always contain (as long as available) some resources that have been identified by the system as non-existing resources (see 1.1. above), some duplicate resources (see 1.2. above), as well as others, based on the questionnaire results.
Unlike the DMOZ system -- where you didn't know anything about your suggested resource -- DIRpopulus will keep you up to date with the status of your suggested web resource.
You will know that:
- the resource made it to our database
- how many times it was reviewed
- who reviewed it (usernames)
- when was your resource reviewed
- is it approved or not
- if not approved, what is the reason
2.3. Web resource approval
It takes 3 consecutive approvals (for cat, title and desc.) from 3 different editors for a suggested resource to be approved and appear in the DIRpopulus directory.
A resource reviewed by an editor will not be presented again for review to that same editor. In other words, an editor can review a resource once so please be careful before you complete your review!
If the resource has 3 consecutive disapprovals, it is automatically marked as "not appropriate" and the submitter is being informed via Email so that he/she can take the necessary actions which will include removing the resource from the approval queue.
If a resource that has been approved by two editors, then disapproved by one editor and then approved by two editors again, that resource can not make it to the DIRpopulus directory. It needs at least 3 consecutive approvals, regardless of the number of approvals/disapprovals that took place before.
This is not an exhaustive list of problems. It's just the list of the problems related to the DMOZ data that need to be addressed ASAP in order to turn that data into useful resources again.
Other problems may be added to this list once the fixing process for the above has been started and works properly.
The suggested solutions are just that, suggested solutions so let's discuss them and find the best ones to make the most useful human edited web directory available!