Designing E-Learning 3.0 in gRSShopper - 10

Harvesting, Again

I'm having issues with the harvester again. There are two distinct problems:
- The harvester isn't harvesting
- When I do harvest, it's downloading things that were downloaded before

I'm going to deal with the second thing first.

I think the problem is that the previous things were deleted. Because we don't want to fill up our server with old harvested stuff that we'll never use, gRSShopper deletes 'stale' links. I think it's deleting these too early, then harvesting them again (which means they keep showing up in the reader as 'new' over and over again).

So I'm opening up the cron_tasks code in admin.cgi to see if I can't fix this. There should be a log of what's happening, but the log isn't writing properly. It seems to me that the cron_tasks code is long overdue for an overhaul. But one thing at a time.

Can I at least get it to email me?

I put the handy-dandy&send_email() function at the head of the cron_tasks subroutine to see if it will send me an email. If this doesn't work, I have to edit 'Cron Jobs' in CPanel and remove the /dev/null command that diverts output to the garbage can instead of to my email address (in other words, removing the command means I get email every time cron runs, which is every minute).

And... that's what I have to do because send_email() isn't working. *sigh*

Figure 122 - Cron Commands in Reclaim, one (downes.ca) with the /dev/null command, and the other (el30.mooc.ca) without the /dev/null command. I will get email from el30.mooc.ca
OK, I don't know why send_email() isn't working, but I have a steady stream of email coming from cron. Next step: make it so that email contains the log report.

The way I have it set up is that there is a variable, $loglevel, which can be a number between 0 and 10, and there's a variable $log that contains the log contents. If $loglevel is above a certain value, the content is added to the log.

So I'll set $loglevel=10 and put in a line  'print $log' at the bottom of the crontasks, and it should include the log in the email it's sending me every minute.

(Time passes, during which I have supper)

Time passes, as I work the problem, random comments follow:

- So I've found the first problem just by looking at the messages. The harvester is getting a "Not Allowed" response from a blog. Annoying, but it happens. But as a result, it's not updating the 'last harvested' time. Which means that the same blog is always next in line to be harvested, because harvesting is done one feed at a time, which is always the earliest 'last harvested' time.

- So I'll edit the harvester, to make sure it updates this. I should also set an error counter, so that after a few failed harvests, it changes the status and stops trying to harvest. I need a new 'status' colour... hmmm....

- Oh! It wasn't even that! I had disabled the harvesting from queue for some reason (probably related to testing) and never enabled it again. So let me do that and see what happens...

- Nope, that didn't fix it. Looks like I'm passing values incorrectly from harvest_queue() to harvest_feed().

- I took a few minutes to comment out the references to the Facebook OAuth module (which is now depreciated) and also to remove a 'defined' operator (which is also depreciated).

- Also, it wasn't printing to the log because I had commented out the code that actually prints to the log (because I was afraid of really big logfiles).

- OK, I think I've addressed the duplicate link issue, by commenting out the 'delete link' commands. Still something there to fix.

Which means it should be both harvesting and not storing duplicates. But it's still pretty messy in there.

(Time passes... the harvesting is working fine... new issues emerge).

Pub Date and MailChimp

This is actually a problem on my main site, but will become a problem for the E-Learning 3.0 course as well.

I haven't talked about it yet, but I'm going to need to make the site work with MailChimp. I've been doing the MailChimp emails by hand ever since I moved to Reclaim because I can't load the Reclaim server with a mass emailing (also, they can't guarantee my emails won't be flagged with spam, which is an even greater problem with a shared host).

There is an API but that will be an all-day job. There's also a way to do it by RSS. It turns out, though, that MailChimp is pretty particular in how it handles RSS feeds. It loads new posts automatically into the newsletter, but only posts with a pub date of today.

Up until now, the 'publicationDate' RSS tag has been throwaway. It didn't matter what the date was, so long as it was correctly formatted. But now it matters.

I *had* been using 'post_crdate' to define the publicationDate element, converting the epoch time in which it is stored to the RFC 822 required by the RSS standard. My &autodates() fundtion does this, and you create the date with a date tag, like this:

Figure 123 - Date format
However, if you create the post ahead of time and schedule it for later (which was the whole purpose of pub_date in the first place) then the publicationDate won't match the pub_date, and worse, MailChimp won't include it in the RSS-based email. I sent a couple of one-item emails before I figured it out.

So, ok, I should just change the input time to pub_date, right? Like this:

Figure 124 - Date format with pub_date
However, pub_date is stored in the datepicker format. It looks like this: 11/15/2018 and autodates() will throw an error with the invalid input. So I coded a new parameter into the date function, specifying an input type, for which I've defined only one possibility so far, input=date for cases where the input is in the datepicker format. So now I have:

Figure 125 - Date format with input parameter

and this works perfectly.

You might ask, why don't I standardize my database input to always use the datetime format, or at the very least, the epoch time format. And yes, I should. But it actually doesn't solve any problems. The computer's time is always in epoch, the RSS is always in RFC 822, the datepicker is always in datepicker, and no matter what I do, I'm going to be converting date formats.

Logs

Back to the whole question of publishing pages. Autopublish not been working in the E-Learning 3.0 course, and it's only by a miracle (so I discover after the fact) that I have published and archive copies of my OLDaily newsletter.

When I set up the installer for Reclaim, I defined the directory locations as relative locations. Effectively, that means '.\' for the CGI directory ($Site->{st_cgif} in the code) and "..\" for the page directory (Site->{st_urlf}). All of that works fine if the script is running in the CGI directory, but what if it's not?

In the case of cron, the script doesn't start running in the CGI directory, as we would expect, but it starts running in the user directory (we ran into this when the harvester itself wasn't running because of this). It also impacts page-writes. And it generates the unhelpful 'permission denied' error report.

Now cron jobs are supposed to report their output to a log. But they haven't been (because of this very problem). So it has been a major pain to track stuff like this down.

So, first things first, get the logs running properly.

Any time I want to log something, I use the function &log_cron(level,message). The level runs from 0 (most important) to 10 (least important). There's a global variable $Site->{log_level} I can set, so it will actually write the log report only if it's below the value of $Site->{log_level}. I need to make this something the use can set in the admin pages (right now it's just hard coded).

The problem is resolved by defining $Site->{data_dir} and always using that. Did that in the gRSShopper::Site package.

And while I was at it, I defined $Site->{cgif} and $Site->{urlf} relative to $Site->{data_dir}, which should solve the page publishing problem. Now that I have a functioning log, I can confirm that.

(A couple of hours later)

OK. As an aside, it's really hard to get a script to define its working directories properly when it's started from various locations. But that says:
- if it is started as a cgi process, the directories are defined relative to the script location
- if it's started as a cron process or command line, the directories are defined relative to the multisite.txt location, as defined in the script arguments

*That* should make everything publish properly.

Update: looks like it is.

I'll post this article now - it's a bit short but actually represents several day's effort. I'm hoping to move more swiftly next week.







Popular Posts