The Earth’s population is 7.9 billion people as of 02/14/2022
You have to start with some assumptions:
- You’re using UTF-8 for your encoding, which can represent all names in a single character format. UTF-8 uses 1 to 4 bytes per character, with most languages using 1 to 3 bytes.
- You are storing names in their native character sets: Hanzi for Chinese names, Devanagari/Punjabi/Tamil/etc for Indian names, Arabic for Arab names, Roman for European names, etc.
- You’re using a variable-width representation.
- You allow arbitrary numbers of “names”, as people can have anywhere from 1 to a dozen or more names in some parts of the world.
- You care about storage efficiency, so you aren’t going to use a verbose representation such as XML.
- You do NOT care about lookup speed. You can add indexes later to speed up performance.
- You may need some metadata to indicate where the given name, surname, and other “extra” names happen to be.
- You may be able to do lots of optimizations, particularly with short Asian names; Chinese names, in particular are rarely more than 4 characters for the entire name when stored in Hanzi encoding.
- Other names at the other extreme are proper Spanish names, with several “parts”.
- Some names, such as some South Indian names, are quite long, so you have to allow for variable-length name representation.
So, let’s say we have a representation such as
<nbytes:1 byte int><bytes> for each name element (why? It’s easier to parse a length-encoded string than it is to parse a string looking for a magic terminal character, which is why I’ve never liked NUL-terminated C strings for high-speed, very large data storage applications.)
So, a reasonable representation for each name is something like
<nbytes-in-full-name:1 byte int>
This is the total number of bytes in the name, including metadata.
<nfields-in-name: 1 byte int>
This is the total number of “fields” in the name. Most English names would have three fields for this, but some languages have only one name, while others may have several fields. So, it’s best to allow for arbitrary numbers of names, up to 255…
<name-language:2 byte int>
The language the name is from. This would be useful for determining which of the fields is the given name, the surname, and other parts of the name.
The actual names…
My guess is a good upper-bound is something like 64 bytes per name, including metadata. This is probably an upper bound, particularly since 1.5 billion out of the 7.5 billion are Chinese names that could be represented with about 18 bytes or less in our format. Even though they take 3 bytes per character in UTF-8, there are at most 4 characters in Chinese names, and usually 2 or 3.
So, at 64 bytes per name * 7.53 billion, you end up with 481,920,000,000 bytes, or somewhat less than 500 GB.
This would fit in RAM in many newer database servers.
Note that indexing the name-list structure would add a whole bunch of extra bytes to the representation…
(Sure, storing this in RAM is kinda dumb. But I interpreted the question to be how many bytes – in RAM or a disk file – would it take to reasonably store all human names, and this is how I answered the question.)
Update: if you normalised for redundant names – or just used a good Lossless compression algorithm – you’d probably get pretty awesome space reduction. This data would be ideal for dictionary-based compression algorithms…