Harry Stevens

Stories and graphics at The Washington Post. Arlington, VA.

Last updated on 28 April, 2017 | Originally published on Medium

How I built a Twitter bot that tweets 100-year-old New York Times articles

It was with great interest that I learned of RT 1917, a project that thrusts us into the Russian revolution as its protagonists lived it, one hundred years ago to the day. It’s the next best thing to an actual time machine: a bunch of Twitter accounts purporting to be the revolution’s major players, live tweeting their “experiences.”

Lenin is there, having just arrived in Petrograd from exile, busy releasing his April Theses to the eager proletariat one tweet at a time. So are the former czar and czarina, Nicholas Romanov and Alexandra Fedorovna, trapped in Tsarskoe Selo and watching in horror as their usurpers giddily dismantle everything they’ve ever cared about. And Stalin, still largely unknown to the world, is plotting somewhere in the background, or else endlessly doodling wolf heads in his notebooks.

Several people, not content merely to watch the events play out, have created their own Twitter accounts as part of the #1917CROWD. Among the growing cast of characters, which now numbers in the several dozens, are a Russian journalist in Paris, the French Envoy to the Russian Empire, a Bolshevik metalworker, and the King of England.

I, too, wanted to play a part in the drama. But I have neither the knowledge of history nor the time to create a character of my own and respond to events as people tweet about them (I don’t even do this very well in the present). Instead, I thought I’d write some software to do it for me. What follows is an account of how I did it.

Step 1: Get New York Times articles about Russia from 100 years ago today.

The New York Times has an Article Search API that lets you access all its articles from September 18, 1851, until today. All you have to do is sign up for an API Key and then write some code to get the articles. For this sort of thing, I’m most comfortable with Node.js, so that’s what I’m going to use. To set up my Node project, I open my Terminal and type:

mkdir nyt-1917
cd nyt-1917
npm init

Then I press enter until it creates my npm project. Right off the bat, I also need to install the request module.

npm install request --save

I also want to make my Javascript file.

touch index.js

Now I’ll open up Sublime Text and write a little code to get some articles.

var request = require("request");
request.get({
url: "https://api.nytimes.com/svc/search/v2/articlesearch.json",
qs: {
"api-key": your_api_key
},
}, function(err, response, body) {
body = JSON.parse(body);
console.log(body);
});

This will return the most recent articles. But we want articles from 100 years ago today, so we’ll get today’s date and subtract 100 years from it. Javascript has a native Date Object, but it’s difficult to use, so we’ll use Moment.js to deal with dates.

npm install moment --save

And then use it to get the date, 100 years ago.

var moment = require("moment");
var date = moment().subtract(100, "years"); // "today", but actually 100 years ago

Wasn’t that easy? Now we can add a few properties to our request to get the articles from 100 years ago today.

request.get({
url: "https://api.nytimes.com/svc/search/v2/articlesearch.json",
qs: {
"api-key": your_api_key,
"begin_date": date.format("YYYYMMDD"),
"end_date": date.format("YYYYMMDD"),
},
}, function(err, response, body) {
body = JSON.parse(body);
console.log(body);
});

This will return all sort of articles, many of which have nothing to do with Russia or the revolution. So we’ll add a property to filter the results by topic.

var query = "russia OR lenin OR trotsky OR germany OR czar OR socialism";

This will get us stories about Russia and Russians, and we’ll also include stories about Germany, which are likely to be of interest to the folks back home in Petrograd because the two countries are currently at war. Add the query to the qs object.

request.get({
url: "https://api.nytimes.com/svc/search/v2/articlesearch.json",
qs: {
"api-key": your_api_key,
"fq": query,
"begin_date": date.format("YYYYMMDD"),
"end_date": date.format("YYYYMMDD"),
},
}, function(err, response, body) {

body = JSON.parse(body);
console.log(body);
});

Now we encounter an interesting thing about the Article Search API: it only returns 10 hits at a time. We can loop through the results 10 at a time by adding another property to the qs object called page, but first we need to know how many pages we’re looping through, which means we need to know how many total articles match our query. Fortunately, the API returns that information.

var hits = JSON.parse(body).response.meta.hits;
var pages = Math.ceil(hits / 10);
console.log("Found " + hits + " articles on " + pages + " pages of results.");

Now we know how many pages there are. So, after our initial request, we can write a function that takes as its argument the page number and loops through the pages from 0 (the page index begins at 0, like an array) to the final page, making a request for the articles from each page as it goes.

function makeRequest(page){
request.get({
url: "https://api.nytimes.com/svc/search/v2/articlesearch.json",
qs: {
"api-key": your_api_key,
"fq": query,
"begin_date": date.format("YYYYMMDD"),
"end_date": date.format("YYYYMMDD"),
"page": page
},
}, function(err, response, body) {

body = JSON.parse(body);
console.log(body);
  });

}

Now, we could simply loop through those pages with a for loop. But so as to avoid putting unnecessary pressure on the New York Times’ servers, we’ll “throttle” the requests, spacing them out every ten seconds. For this, I’ve used an extension to the Underscore.js library. So we install underscore.

npm install underscore --save

Then we’ll add our rate limiting function. I didn’t write this function; this guy did.

var _ = require("underscore");
// underscore rateLimit function
_.rateLimit = function(func, rate, async) {
var queue = [];
var timeOutRef = false;
var currentlyEmptyingQueue = false;

var emptyQueue = function() {
if (queue.length) {
currentlyEmptyingQueue = true;
_.delay(function() {
if (async) {
_.defer(function() { queue.shift().call(); });
} else {
queue.shift().call();
}
emptyQueue();
}, rate);
} else {
currentlyEmptyingQueue = false;
}
};

return function() {
var args = _.map(arguments, function(e) { return e; }); // get arguments into an array
queue.push( _.bind.apply(this, [func, this].concat(args)) ); // call apply so that we can pass in arguments as parameters as opposed to an array
if (!currentlyEmptyingQueue) { emptyQueue(); }
};
};

And now we’ll loop through the pages, making a request every ten seconds.

// a rate-limited version of the request, where it runs
// every 10 seconds
var makeRequest_limited = _.rateLimit(makeRequest, 10000);
// loop through the pages
for (var i = 0; i < pages; i++){
makeRequest_limited(i);
}

So we have our articles. Next, let’s turn them into tweets.

Step 2: Turn the articles into tweets

Within the request, we’ve parsed the body that the API returns by writing the line body = JSON.parse(body). But to get the actual articles, we’ll have to go a little deeper.

var docs = body.response.docs; // the actual articles

We can loop through these and create tweets out of them. Here’s an example that includes a hashtag at the end.

//write filtered tweets to json
docs.forEach((d, i) => {

// an empty object to store data
var obj = {};
  // create the tweet of the headline and a hashtag
obj.headline = d.headline.main
var tweet_start = obj.headline;
  var tweet_end = "#1917LIVE";
  // the 1 is for the space between the tweet and the end
var end_len = tweet_end.length + 1;
  // the tweet content is dependent upon the length of the headline
// if the headline is too long, we'll cut it off, if necessary
// adding 3 dots
if (tweet_start.length > (140 - end_len - 3)) {
     tweet_start = tweet_start.substr(0, (140 - end_len - 3));
     // put the elipses after the space
var li = tweet_start.lastIndexOf(" ");
tweet_start = tweet_start.substr(0, li) + "...";
  }
  obj.tweet = tweet_start.toTitleCase() + " " + tweet_end;
});

I’ve added a .toTitleCase() function to the tweet_start string so it’s not completely capitalized. This is my function (Manas Sharma helped me with the regular expressions).

String.prototype.toTitleCase = function() {

var x = this;
    var smalls = [];
var articles = ["A", "An", "The"].forEach(function(d){ smalls.push(d); });
var conjunctions = ["And", "But", "Or", "Nor", "So", "Yet"].forEach(function(d){ smalls.push(d); })
var prepositions = ["As", "At", "Atop", "By", "Into", "It", "In", "For", "From", "Of", "Onto", "On", "Out", "Over", "Per", "To", "Unto", "Up", "Upon", "With"].forEach(function(d){ smalls.push(d); });

x = x.split("").reverse().join("") + " ";
    x = x.replace(/['"]?[a-z]['"]?(?= )/g, function(match){ return match.toUpperCase(); });
    x = x.split("").slice(0, -1).reverse().join("")
    x = x.replace(/ .*?(?= )/g, function(match){

if (smalls.indexOf(match.substr(1)) !== -1) {
return match.toLowerCase();
}
return match;
});
    //smalls at the start of sentences shouldbe capitals. Also includes when the sentence ends with an abbreviation.
x = x.replace(/(([^\.]\w\. )|(\.[\w]*?\.\. )).*?(?=[ \.])/g, function(match) {
var word = match.split(" ")[1];
var letters = word.split("");
letters[0] = letters[0].toUpperCase();
word = letters.join("");
      if(smalls.indexOf(word) !== -1) {
return match.split(" ")[0] + " " + word;
}
      return match;
});
    x = x.replace(/: .*?(?= )/g, function(match){ 
var first_letter = match.match(/\b[a-z]/);
return match.replace(first_letter[0], first_letter[0].toUpperCase());
});

return x;
}

The New York Times Article Search API also returns keywords for each article. The keywords often include famous people, so we can check to see if any of the #1917CROWD are subjects of the article. If they are, we’ll mention them at the end of the tweet.

var people = [
{
name: "WILSON, WOODROW",
handle: "POTUS28_1917"
},
{
name: "NICHOLAS II., CZAR OF RUSSIA",
handle: "NicholasII_1917"
},
{
name: "NICHOLAS II.,",
handle: "NicholasII_1917"
},
{
name: "ALEXANDRA FEODOROVNA, CZARINA OF RUSSIA",
handle: "EmpressAlix1917"
},
{
name: "KERENSKY, ALEXANDER F.",
handle: "Kerensky_1917"
},
{
name: "WILLIAM II., EMPEROR OF GERMANY",
handle: "Kaiser_1917"
},
{
name: "TROTZKY, LEON",
handle: "LeoTrotsky_1917"
},
{
name: "ALFONSO XIII., KING OF SPAIN",
handle: "AlfonsoXIII1917"
},
{
name: "LENIN, NIKOLAI",
handle: "VLenin_1917"
},
{
name: "GEORGE V., KING OF ENGLAND",
handle: "GeorgeV_1917"
},
{
name: "MCADOO, WILLIAM GIBBS",
handle: "WillMcAdoo_1917"
},
{
name: "MILUKOFF, PAUL N.",
handle: "Milyukov_1917"
},
{
name: "LUXEMBURG, ROSA",
handle: "luxemburgquotes"
},
{
name: "KORNILOFF",
handle: "GenKornilov1917"
},
{
name: "GUCHKOFF, ALEXANDER J.",
handle: "Guchkov_1917"
},
{
name: "BRUSILOFF, ALEXIS",
handle: "GenBrusilov1917"
},
{
name: "BRUSILOFF, ALEXEI A.",
handle: "GenBrusilov1917"
},
{
name: "LVOFF, GEORGE E.",
handle: "PrinceLvov_1917"
},
{
name: "MOLOTOFF, VIACHESLAV MICHAELOVICH",
handle: "Molotov_1917"
},
{
name: "RODZIANKO, MICHAEL",
handle: "MRodzianko_1917"
},
{
name: "ALEXEIEFF, MICHAEL V.",
handle: "GenAlexeev_1917 "
},
{
name: "FREUD, SIGMUND",
handle: "SigmundFreud_BP"
},
{
name: "CLEMENCEAU, GEORGES",
handle: "Clemenceau_BP"
},
{
name: "BERTIE, FRANCIS LEVESON",
handle: "WW1Bertie"
},
{
name: "KROPOTKIN , PETER",
handle: "PKropotkin_1917‏"
},
{
name: "KROPOTKIN, PETER",
handle: "PKropotkin_1917‏"
},
{
name: "KROPOTKIN, PETER ALEXEIEVITCH",
handle: "PKropotkin_1917‏"
},
{
name: "KROPOTKIN , PETER ALEXEIVICH",
handle: "PKropotkin_1917‏"
},
{
name: "POINCARE, RAYMOND",
handle: "RPoincare_1917"
},
{
name: "BUBLIKOFF, ALEXANDER ALEXANDROVITCH",
handle: "Bublikov_1917"
},
{
name: "TSERETELLI",
handle: "Mensheviks_1917"
}
];

These are the people I’ve found so far, but there may be more. Let’s write some code that goes in the docs loop to check if any of these are present in the keywords. If they are, we’ll add them to the end of the tweet.

// figure out if any of the 1917crowd are mentioned
var persons = d.keywords.filter(function(key){
return key.name == "persons"
}).map(function(person){
return person.value
});
var lookup = people.map(function(p){
return p.name;
})
var mentions = _.intersection(persons, lookup).map(function(p){
return "@" + _.where(people, {name: p})[0].handle;
});
// if they are add them to the end of the tweet
if (mentions.length > 0){
tweet_end = tweet_end + " " + mentions.join(" ");
}

We also want to attach an image of the article itself to the tweet. For this we’ll use a module called pdf-image and a module called cheerio that basically lets us use jQuery in a Node.js environment.

npm install pdf-image --save
npm install cheerio --save

Add them to the project, preferably somewhere way up at the top. You’ll also need to require http, which doesn’t need to be installed because it comes with Node.js.

var cheerio = require("cheerio"),
http = require("http"),
PDFImage = require("pdf-image").PDFImage;

And implement this by requesting the URL that hosts the PDF of the article, downloading that PDF to a temp directory, and converting that PDF file into an image. Again, we do this within the docs loop.

obj.url = "http://query.nytimes.com/mem/archive-free/pdf?res=" + d.web_url.split("res=")[1];
// some variables for creating a unique pdf file name
obj.date = d.pub_date.split("T")[0];
obj.page = page + 1;
obj.page_index = i + 1;
// this is the pdf file name
obj.pdf_file_name = "temp/" + obj.date + "_" + obj.page + "_" + obj.page_index + ".pdf";
// a stream to write the pdf file for downloading
var file = fs.createWriteStream(output);
// get the html of the nyt page
request(obj.url, function(error, response, body){
  if (!error && response.statusCode == 200){
// load cheerio
var $ = cheerio.load(body);
    // find the pdf url in the response
var pdf = $("iframe").attr("src");
    // download the pdf
var request = http.get(pdf, function(response) {

// pipe the response to the file
var stream = response.pipe(file);

// when it's done, we'll convert to an image
stream.on("finish", function(){
        // convert to image, with white background
var pdfImage = new PDFImage(output, {
convertOptions: {
'-background': 'white',
'-flatten': ''
}
});
        pdfImage.convertPage(0).then(function (imagePath) {
          console.log(imagePath);
        });
      });
    });

} else {
console.log("Error getting PDF");
console.log(error);
}
});

Now that we’ve written some nice tweets and created images to go with them, it’s time to send them out into the world.

Step 3: Send the tweet

To gain access to Twitter’s API, you’ll need to create a Twitter app and follow the steps until you have a consumer key, consumer secret, access token, and access token secret.

The next part’s fairly simple, thanks to the excellent Twit module. Go ahead and install that first.

npm install twit --save

Now require the module and create a Twit instance using your Twitter app credentials.

var Twit = require("twit");
// a new Twit instance
var T = new Twit({
consumer_key: your_consumer_key,
consumer_secret: your_consumer_secret,
access_token: your_access_token,
access_token_secret: your_access_token_secret
});

And then, at the end of the docs loop, add some code.

pdfImage.convertPage(0).then(function (imagePath) {                      

// post a tweet with media
var b64content = fs.readFileSync(imagePath, { encoding: "base64" });
  // first we must post the media to Twitter
T.post("media/upload", { media_data: b64content }, function (err, data, response) {

// now we can assign alt text to the media
// for use by screen readers and
// other text-based presentations and interpreters
var mediaIdStr = data.media_id_string;
var altText = obj.headline;
var meta_params = { media_id: mediaIdStr, alt_text: { text: altText } };
   // create the tweet's metadata, which contains the image
T.post("media/metadata/create", meta_params, function (err, data, response) {
if (!err) {
// now we can reference the media and post a tweet
// (media will attach to the tweet)
var params = { status: obj.tweet, media_ids: [mediaIdStr] }
        // post the tweet
T.post("statuses/update", params, function (err, data, response) {
if (!err){
// we'll console log each tweet so we know it was sent
            console.log(data.text);
console.log(" ");
          } else {
console.log(err.message);
}
});
      }

});
  });

});

Step 4: Automatically run the code ever hour

Now, if we run node index.js, we’ll tweet out everything at once. But that’s not how real newspapers use Twitter. Rather, they tweet their stories out over the course of the day. And that’s what we want our bot to do.

For this task, we’ll need to put our code on some server and run it every hour, only sending out 1/24th of the tweets each time. To do this, we’ll be using Heroku. If you’re new to Heroku, you should create an account and go through the tutorial on how to deploy a Node.js app, as that will give you a basic understanding of what comes next.

We are going to be using an addon called Scheduler. If you’d like, you can read a simple tutorial of how to use Scheduler here. Now, in your terminal, you must login to Heroku.

heroku login

Create your app.

heroku create nyt-1917

Deploy your app.

git push heroku master

Alright, let’s now add the Scheduler addon.

heroku addons:add scheduler

To use Scheduler, you will need to provide a valid credit card. Don’t worry, you won’t be charged unless you go over a certain usage limit. I’ve been running my bot for a while, and my estimated monthly cost, which Heroku provides, is $0.00. In other words, free. Follow the link in your terminal to add your credit card details.

Once you’ve provided your credit card details, you’ll need to create a directory called bin and put a file in it called tweet (or any other name you’d like to use to describe the process you want to schedule). Note that there is no file extension.

mkdir bin
touch bin/tweet

Open up that tweet file and add a line at the top called a shebang.

#!/usr/bin/env node

And copy everything from index.js into the tweet file. Now, there’s just one more thing we must add to the code before we start up the Scheduler, because we need to decide which tweets to send out at any given hour of the day. To do that, we’ll first get the current hour. Obviously, this will change depending on when the app runs. Somewhere at the top of your code, make a variable for the current hour.

var current_hour = moment().format("H");

This will return the current hour in 24-hour time. We just need to assign an hour to every tweet, and when we run the code, we’ll only send out the tweets whose hours match the current_hour. Within the docs loop, add some code.

obj.page = page + 1;
obj.page_index = i + 1;
obj.tweet_number = (page * 10) + obj.page_index;

// calculate the hour of the day that the tweet
// should go out, based on the total number of tweets
obj.hour_of_tweet = Math.round((obj.tweet_number * 24) / hits);

Before you actually post the tweet, you can write a conditional statement to see if the obj.hour_of_tweet matches the current_hour .

if (obj.hour_of_tweet == current_hour){
// here is where you put the code that
// 1. downloads the pdf
// 2. converts it to an image
// 3. posts the tweet
// (see above for instructions on how to do this)
}

Log in to Heroku, select your app, and, in the top left, under “Installed add-ons”, select Heroku Scheduler. Click “Add new job”, where it says rake do_something type tweet, and set the “FREQUENCY” to “Hourly”.

That’s it. Here’s the bot’s Twitter page, so you can see it in action, mixing it up with all those famous Russians from 100 years ago. And here’s the GitHub repo with all the code.