Crossword Chronicles Issue 1 of...probably 1

teejaydub
May 3, 2020
6 min read

What's fifty one letters long and an overall terrible non-joke?

That lede.

Crosswords! Nothing makes you feel quite as inept or inspire such feelings of deep frustration like a good crossword. They can make you feel out of touch with pop culture, current events, historical events, and even challenge what you believed to be your solid understanding of the English language. Sometimes, though, you get through one clue and the answers start to cascade and fall into place. Then it's kinda worth it.

After I made many attempts to solve crossword puzzles, I started to dabble in making crossword puzzles.

It's illegible by choice

Now I know you might be saying :

"but terrence there's that online tool that teachers use to generate crosswords how is this even a hobby?"

First off, kind of aggressive. Secondly, whereas those internet crossword generators have few rules, the crosswords you would find in a newspaper such as the New York Times have a very rigid set of rules:

15x15 squares (21x21 on Sunday)
Rotational symmetry (spin 'em 180 degrees and the black spaces appear in the same spots)
Clues must match the type and tense of the answer (eg. "South FL electric co." would lead to the abbreviated answer "FPL")
Three letter minimum word length

These rules, among several others, make puzzle creation a fun challenge.

I started creating them using pen and paper, with the more-than-occasional assistance of a website intended to help solve (read:cheat) crossword puzzles. The site allows you to select a word length and fill in a few slots with letters, and it will spit out a list of potential words that fit that pattern. It is an invaluable assistant when creating crosswords.

I was going to use 'Harambe' as the example word in the image but this tool didn't even know His name. I have a new motivation to continue developing what surely must be a superior tool.

After writing it out on paper, I occasionally transposed the paper results to Excel to format it nicely. It's not tremendously convenient, so I'm mildly determined to make the whole process better - a desktop tool with the power to create crosswords and that features a built-in assistant to generate those potential words or phrases to fit within a space on the board. Such tools exist, but they are either tremendously outdated or just plain expensive. Are the days of my livelihood spent on this project more valuable than the cost of the ready-existing expensive product? Yeah, probably. But I'm learning along the waaaaay and that has some added value, right?

Before I can make the entire tool, I first want to prototype some ideas for the assistant due to its importance as a feature of the overall tool. This post will focus on the research and initial testing to determine what it takes to make a somewhat comprehensive database for the word generation assistant to utilize.

How much data is too much data?

The first stage of this crossword maker starts with research into how I will create the word generation assistant. There are a few things that I need to keep in mind. Mainly, the fact that crossword puzzle answers don't have to be dictionary words. They can be celebrity names, famous places, historical events, movie titles, book titles, quotes, idioms, proverbs, aphorisms, catchphrases, exclamations, and similes too.

Dictionary words should be easy to get a hold of in a digital, preferably csv format. The others will prove to be more difficult. If only there was an online encyclopedia that had millions of pages on places, people, media, and historical events.

If their donation campaign was going on right now I would have seriously thought about the possibility of considering to make a pledge, maybe.

Yes, Wikipedia. And they have a place where you can download all of the pages! Literal petabytes of data! Fortunately, we don't need the whole page, just the titles. Wikipedia offers the titles as a single downloadable file, and it clocks in at around three hundred megabytes.

Notepad++ reports that this text file has fifteen MILLION lines, one line for each page title. The entire English dictionary has two orders of magnitude less words (that is without all conjugation variations, which clocks in closer to 500k words). That's more data than I need. Looking at some of the page titles, I see that there are a tremendous amount of irrelevant or useless entries within the context of a helpful crossword dictionary. These include:

Hundreds of pages on individual air traffic controller codes
Area codes
Historical Members of the Queen's Privy Council for Canada (1867-1899)
Historical Members of the Queen's Privy Council for Canada (1910-1940)
Hundreds of twitter handles
This chemical: 1-organyl-2-arachidonoyl-sn-glycero-3-phosphocholine:1-organyl-2-lyso-sn-glycero-3-phosphoethanolamine_arachidonoyltransferase_(CoA-independent),
1974_Arizona_gubernatorial_election
1975_German_motorcycle_Grand_Prix, you get the idea.

My guess is that millions of those pages get little to no visits at all. If I could determine which pages are actually visited, I could probably narrow down the list to contain more relevant page titles, and in turn, a more useful crossword dictionary reference.

It's a popularity contest

Wikipedia gives users access to their web API that allows you to make HTTP requests about a page, including the popularity of a page during a certain period of time, be it months or days. The idea is this: ask Wikipedia how many people visited a page, and if I decide it's above my threshold of "popularity", then I will capture that page title and save it in a separate text document called PopularOnes. Rinse, wash, repeat x 15 million. After a couple hours, I knock out a script in Python that does exactly this.

How do I decide what is "popular"? Well, first let's sample some data. How many visitors did Cary Grant's page get in January of 2019? Look at that, hundreds of thousands. Parkinson's Disease? 100k. After looking at a few more statistics, I start looking at the other end of the spectrum - things that shouldn't show up in a crossword puzzle.

The balance was struck somewhere between the drummer of Vampire Weekend and the lead singer of Depeche Mode. Sorry Chris Tomson, I've decided that only front-runners of English electronic music bands from the 80's can be popular enough to appear in crosswords. You're welcome Mr. Gahan.

As it turns out according to my algorithm, not all People Are People.

The threshold I've set for "popular" is 2500 views per month. What's good is that once I've processed all of the data, I can take a finer-toothed comb to it and distill only the page titles with, for example, 50k views per month without having to run over the entire 15 million list again. I've chosen January 2019 to be the sampling date for popularity. I don't think this will be a huge issue unless there is a tremendously seasonal page that is left alone in the cold month of January. While I could have averaged page popularity over the span of a year or multiple years, I was worried about the additional required processing and requesting time since each month is appended to the initial request. Speaking of time…

How long will this take?

Wikipedia has imposed limits for how many requests you can make to their servers to get this information - 100 requests per second. At first glance that seems like a great number. In practice, though, I am only able to achieve closer to 20 requests per second. Doing the math, we're looking at 208 hours of the script running without interruption to roll through all of the data.

If we look at the positive, we can realize that this process has to run only a single time. I didn't do too much in an effort to circumvent this limit beyond asking my friends for help. I packaged up the Python script, requisite plugins, and I handed them about 1.5 million page titles each to process. Thank you Jay, Angel, and Anthony.

How's it looking so far?

So far, so good! I've gone through about a million lines, and we're seeing about a 30x reduction in Wikipedia page titles. In other words, at the end of this process, the 15 million page titles that currently exist on Wikipedia will be distilled into approximately 500k popular-ish titles. When the assistant has been coded, it will be able to leverage the 500k Wiki titles, an english dictionary resource of approximately 400k words, as well as a list of idioms, similes, aphorisms, quotes, proverbs, and catchphrases.

There will still need to be some additional filtering of extraneous page titles from the Wiki sources. I also want to make it easy for the user of the tool to add their own dictionaries of words, since I was advised not to include urbandictionary as a crossword resource.

What's next?

Well, I'm imagining the desktop tool allowing for the creation of crossword puzzles from black tile placement to clue creation and the ability to export and print the generated crossword. Further goals may see the creation of a mobile app, at least for the assistant functionality, and the potential to turn the tool into a full-fledged way to experience and solve other people's crossword puzzles on the PC.

A quick search on Steam reveals that there are zero serious crossword puzzle games, which is a clear indication that there is zero interest, or that there is a potentially untapped market for it. Probably the former, but I'd have to do more research before coming to a hasty conclusion.

teejaydub's blog