Year round learning for product, design and engineering professionals

I for one welcome our GPT overlords co-workers

Yes, yes, I promised I’d focus less on AI after a flurry of posts late last year. 
But honestly, I can’t–I think it is that significant. 

As we put together the program for Code in June, and Summit later this year, submission after submission across dev, product, design, content is focussed on the impact of AI on all these areas of practice.

Of course when something is suddenly everywhere there is among the thoughtful and useful responses the ?-bros, hot takers, listicle writers and assorted grifters. Gotta go somewhere when the Web3 ship has run aground I guess.

But there really is a “there” there. This as I’ve mentioned quite a few times is not something I simply talk about, it’s something I’ve been not just exploring and experimenting with, and increasingly implementing day to day.

I’ve written quite a lot on various AI technologies (I’m just going to use the AI shorthand ok) we’re using with Conffab, but increasingly I’m using it day to day while programming (yes I still do that!).

Here’s an experience this week that might make give a sense of how this can be useful. 

One of the significant workflow bottlenecks for adding presentations to Conffab is adding what we call ‘accessible slides’ to a video. In short, we take every slide in a presentation, convert it to structured html, add text descriptions for images, charts, graphs etc, and determine where in the presentation the slide first appears. Ultimately for every presentation we end up with markup that looks like this for each slide

<section data-starttime='00:06:32'>
<h4>Documentation should not be sprinkles.</h4>
<p class="description">background image of sprinkles</p>
</section>

So how do we do this?

Well, the obvious answer is you ask the speaker for their slide deck. You then <waves magic wand> convert it to HTML. Manually go through the presentation to see where each slide starts and markup that starting point. Et voila.
As you can imagine it’s a lot more complicated that that.

First, despite everyone’s best intentions, you won’t get the deck from every speaker for all sorts of reasons.

And some presentations don’t even have decks–for example for live coded presentations. (I’ll leave my thoughts about live coding as a presentation technique for elsewhere).

And decks come in so many different formats–Powerpoint, Keynote, Google Slides, PDF, any number of HTML formats, including speakers’ own bespoke HTML (something I’ve done), markdown.

Even getting from any of these to the text of a slide deck is a lot of work. Over the years I’ve developed a whole heap of scripts and techniques, and there’s still almost always a partly manual process.

My instinct was that there was a better way.

We’d actually started using OCR on screenshots of a presentation (Mac OS X for example can now do OCR of any text onscreen, but we use a tool called Text Sniper that makes the process cleaner in our experience).

But, I thought to myself, what if we could

take the video of the presentation

  1. identify where each new slide starts
  2. take a screenshot at that point
  3. and OCR the screenshot?
  4. Then format that text in HTML, automatically timestamped because we know when we took the screenshot in the video.

This is what I spent a couple of days last week doing, while recuperating from Covid (very mild, doing ok, thanks for asking).

But it was a tale of two halves.

I began very old school, using search engines, with terms like ‘detecting scenes in videos’. Which lead to FFmpeg (an open source video engine that probably powers over 50% of the video apps in the world like Blender, VLAC and Handbrake, handling video in Chrome and Linux Firefox among much else).

And so for a day or two I used good old fashioned search engines, Stack Overflow, articles and so on I found to fashion a version .01. Almost exactly as I would have done countless times the last 20 years or more.

Basically, it could do scene detection (there’s a FFmpeg filter for that), and extract screenshots as pngs, whose name was the time stamp for that slide.

Toward the end of the process I thought ‘hmmm, maybe I could now see how ChatGPT goes about this?’.
And so I duly asked it to do more or less what I had been doing for the last day or so.
And in about 15s, it replicated all that work, very similarly to what I had spend the last day doing.

Plus I got it to do really painful things I have done in all sorts of programming languages like convert times formatted as hh:mm:ss to hh-mm-ss, add 3s to a time formatted as hh:mm:ss, all the stuff I’m just too old and lazy to do anymore myself.

In the past I would have looked up the regular expression, or gone hunting for some code in whatever language I was using at the time–nope, ChaGPT just did it in 5s.

At first I felt a bit like I was cheating… but I got over that. Sort of. It’s a strange feeling (more on that a bit later).

As I started to ‘trust’ ChatGPT, I started leaning more and more heavily on it.

It certainly make mistakes–I’ll run a script and get an error, go back to the chat and say ‘I got this error’. It apologises then fixes it and the fix almost always works.

All in

I’d not yet got to the OCR part. That’s a bit of a step up from giving me a regex or code snippet for a pretty straightforward problem. I decided I’d upgrade to GPT4 (it’s $20 a month, chatGPT is free), and asked

can you create a bash script to take every png img file in a folder, do OCR on the image for text it contains and append the text to an HTML file in the format “<section data-starttime='00:25:58'>\n ” then the OCR content then “\n</section>

It suggested using Tesseract (good old fashioned search would have helped me find that relatively quickly too) and how to install it on my Mac (which I probably would have worked out pretty quickly using brew, but now it was just a copy and paste away, and that worked).

And from that prompt it just wrote the entire script. I ran it, and got a couple of errors–which it then corrected. I ran it again, and it just worked.

All this part took maybe 10 mins. Others could probably do it in a similar amount of time without ChatGPT, but I certainly couldn’t. I hadn’t even really heard of Tesseract 10 mins before, now I had OCR’d dozens of slides from a video.

And I suspect it would have taken way more than 10mins to resolve the error by finding some reference via a search engine, going to whatever Stack Overflow question addressed it, wading through the answers, trying a few suggestions out.

As yet, this hasn’t replaced all the work we have been doing manually, but it has replaced 75%, including the most exacting, tedious and error prone part-time coding the slides.

More ambitious

Simon Willison recently wrote

The thing I’m most excited about in our weird new AI-enhanced reality is the way it allows me to be more ambitious with my projects.

The thing I’m most excited about in our weird new AI-enhanced reality is the way it allows me to be more ambitious with my projects.

As an experienced developer, ChatGPT (and GitHub Copilot) save me an enormous amount of “figuring things out” time. For everything from writing a for loop in Bash to remembering how to make a cross-domain CORS request in JavaScript—I don’t need to even look things up any more, I can just prompt it and get the right answer 80% of the time.

This doesn’t just make me more productive: it lowers my bar for when a project is worth investing time in at all.

In the past I’ve had plenty of ideas for projects which I’ve ruled out because they would take a day—or days—of work to get to a point where they’re useful. I have enough other stuff to build already!

But if ChatGPT can drop that down to an hour or less, those projects can suddenly become viable.

Which means I’m building all sorts of weird and interesting little things that previously I wouldn’t have invested the time in.

AI-enhanced development makes me more ambitious with my projects

Others I’ve seen express similar ideas. One significant barrier to starting any project is the sense it will be hours until you have even gauged the feasibility of it, and got a proper sense of how much work it is really going to be. ChatGPT I’ve found can get you to the point of being able to make that decision much more quickly (plus by drastically reducing the boilerplate, and mundane aspects of coding, allows you keep your energy for and to focus on the more subtle aspects, the what instead of the how.)

Am I redundant now?

I was relaying this story to a friend who is not a developer (they’re a scientist who has done a lot of programming though) who observed

Demonstrates also how much knowledge you need to ask the right question and identify the limits

Which is a very astute observation–a lot of this is about intuitions, and years (in may case decades) of experience, knowledge, abstractions I’ve developed, that give me a good sense of what it is doing, and to an extent why.

First, to even conceive of a strategy that might address the problem you’re trying to solve. To assess the feasibility of a suggested solution. And of course enough knowledge to know what the code is doing.

It’s really about learning to trust the AI, something that slowly happened over the time I spent using it (I mean just executing `brew install arbitrary-binary`… ), something that goes hand in hand with understanding. Otherwise code is close to simply magical incantation–reciting special words, and something magical happens.

The takeaway? If you’ve not yet, start exploring how ChatGPT can augment your work–not just as a developer. Develop intuitions about what it is, and isn’t capable of. Which only comes through use.

As Alan Downey observes in the tweet we opened with

everyone who writes code should spend the next month doing professional development on writing code with LLM-assist. This is how code will be written from now on

Alan Downey

And we’ll cover this all much more at Code, Summit and elsewhere in the coming months.

delivering year round learning for front end and full stack professionals

Learn more about us

Web Directions South is the must-attend event of the year for anyone serious about web development

Phil Whitehouse General Manager, DT Sydney