On Tuesday, June 28, 2022, Google released an update to their documentation about Googlebot, which clarified that Googlebot can only "see" the first 15 megabytes of certain file types when fetching them. This limit has been in place for many years, but was only recently added to the documentation in order to be helpful for those debugging. It should be noted that this limit only applies to the initial request made by Googlebot - not the referenced resources within the page (e.g. if an HTML page references a JavaScript file, Googlebot will still be able to see and fetch that JavaScript file).
Most likely, the new 15 MB limit for Googlebot won't have much of an impact since very few pages on the internet are bigger in size. However, if you do happen to own an HTML page that's over 15 MB, you could try moving some inline scripts and CSS to external files.
The content after the first 15 MB is dropped by Googlebot and only the first 15 MB gets forwarded to indexing. This applies to fetches made by Googlebot (Googlebot Smartphone and Googlebot Desktop) when fetching file types supported by Google Search.
No. Googlebot fetches videos and images that are referenced in the HTML with a URL (for example, <img src="https://example.com/images/puppy.jpg" alt="cute puppy looking very disappointed" />
separately with consecutive fetches.
Yes. Using data URIs will contribute to the HTML file size since they are in the HTML file.
There are a number of ways, but the easiest is probably using your own browser and its Developer Tools. Load the page as you normally would, then launch the Developer Tools and switch to the Network tab. Reload the page, and you should see all the requests your browser had to make to render the page. The top request is what you're looking for, with the byte size of the page in the Size column.
For example, in the Chrome Developer Tools might look something like this, with 150 kB in the size column:
If you want to check how much data Googlebot is downloading when it crawls your site, you can use the Network tab in Chrome Developer Tools or cURL from a command line. To use cURL, type in the following code:
curl
-A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
-so /dev/null https://example.com/puppies.html -w '%{size_download}'
Replace "https://example.com/puppies.html" with the URL of the page you want to check. If you have more questions about this process, you can find more information on Twitter and in the Search Central Forums. You can also leave feedback on the documentation pages themselves if you need more clarification about something.