@Felthry to start, it needs to be more than 40kHz because of the shannon-nyquist theorem (in order to reproduce a certain frequency, you need a sampling rate of at least twice that, and human hearing goes up to around 20kHz)
as for the specific number, it seems to have something to do with the early digital-audio solution of recording to existing analog cassette formats like VCRs. on an NTSC cassette, you have 60 fields per second, and if you record 3 samples per video scanline at 245 lines per field, you get a sample rate of 3*245*60 = 44100 Hz. you can also use 50fps PAL format stuff in the same way with a different number of active lines per field
https://en.wikipedia.org/wiki/44,100_Hz and https://en.wikipedia.org/wiki/PCM_adaptor are good related reading