What Should a robots.txt File Contain?

Knowing what your robots.txt file contains is helpful in many ways, and here is what it needs to have.

Video Transcript:

Dave: Do you have anything on Curious Ants on that?

David: I don’t. Yeah. That’s a great question. I should add something. I’ll add that to the list of things to add. But it really depends on… A good robot.txt depends on the platform you’re using. So, number one, with WordPress, the only thing we really want to exclude Google from seeing is WP admin. Old school robots.txt files are used to prevent Google from seeing the JavaScript and the CSS, which is not a best practice at this point because Google is using them to see how the site renders mobile, to see if it passes the mobile-friendly test, and using it for page speed things like the web core vitals. So, it needs to see all that. So, if you are preventing Google in the WP content, for WP includes even, you would potentially be hurting Google’s ability to see all that needs to see to evaluate the quality of a website.

Dave: Alright. So, let me bring up…

David: I will add, while you’re looking at that, remember that robots.txt is a suggestion to Google that it respects, typically. If it thinks something is excluded that shouldn’t be, it will warn you in the Search Console notifications. But it also will give nefarious or mal-intended spiders access to things you might not want them to see. It doesn’t password-protect files, right? I think I’ve mentioned this before. Don’t hide your credit card files in a CSV with your robots.txt file. You’re basically telling malicious spiders where to get that stuff. It’s not password protected. It is a list of pages you don’t want spiders to go to. Bad spiders will try to go to them. So, don’t put anything secure behind it thinking you’re protected.

Dave: Okay. So, what we’ve got in it? We’ve got user agent. Can I just put what we have in chat?

David: Yeah.

Dave: Okay. That’s easier. Okay. So, here we go. Okay. Excuse the formatting, but let’s put it in chat.

David: Yeah. Good. Okay. Google does not respect the allow command. So, for Google, it doesn’t really do anything, but that’s okay. User agent asterisks would mean for every user agent, the following rules apply to it. And then disallow admin. Great. Disallowed trackback… Okay. This is actually pretty old WordPress. I don’t think trackback even exists in WordPress installs anymore.

Dave: I thought you could still do those, but maybe…

David: I don’t think xml-rpc exists in WordPress anymore, does it? You’re a developer. So again, that’s just telling Google not to go there or the bots not to go there, but it doesn’t prevent them from going.

Dave: Correct.

David: What I like about this is, especially, the sitemap listing. Listing the XML sitemap in the robots.txt file. That’s super important. Number one, remember when Google visits your site, it doesn’t visit every page of your site every time it visits. Right? However, every time it visits your site, it will look at the robots.txt file. So, if you’ve got your XML sitemap listed, it will look at that. Now, what it will do is it will then determine whether anything has changed in your sitemap since the last time it came for a visit. It’ll prioritize the new things.

Dave: So, what you’re seeing is having the sitemap in the robots.txt will make it easier for them?

David: It will make it quicker to see the stuff that’s changed. Now, again, it doesn’t mean Google comes to your site every day. Sometimes it comes once a week. I’ve had one client where it would come once a day or more, but for most clients, it comes every few days or every week if it’s a really new site. But this helps Google to say, hey, spend your time on these pages because they’re new, which really we want Google to look at, right? The new stuff has just changed. Otherwise, it will just kind of start at the beginning and crawl through. But even with the XML sitemap, it still won’t crawl the whole page. It will download the whole sitemap, but it won’t crawl the whole site. And I don’t know what a user agent seek port bot is.

Dave: Yeah, that was something I think I brought up. I don’t know if it was last week or the week before. The hosting company said, hey, this looks like this bot, this generic search engine thing, is doing a little bit too much. They say that they will respect that, so they recommend putting that in there.

David: Okay. Yeah. So, this is a crawl delay. And that just prevents the server from being overloaded. But it is because there is no disallow direction that it is allowing it to access everything.

Dave: Sure. And I think that’s fine.

David: Yeah, right. Yeah, actually, so great. I think this is fine. There’s nothing in there that would be a red flag to me.

Dave: Okay, good.

David: The classic mistake is accidentally having a robot.txt that disallows everything. Right. Which is a good developer practice. It’s actually kind of an old-school developer practice. But oftentimes, developers forget to remove that or edit it once they launch. My launch list process is number one, check that.

Dave: Yeah, for sure. Okay. Thank you for that. Awesome.

Have a question about this process? Ask it here: Cancel reply

Get started doing SEO today

Blog and Processes