Source de données#

Wikipédia#

mlstatpy.data.wikipedia.download_dump(country, name, folder='.', unzip=True, timeout=-1, overwrite=False)[source][source]#

Downloads wikipedia dumps.

Paramètres:
  • country – country

  • name – name of the stream to download

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

mlstatpy.data.wikipedia.download_pageviews(dt, folder='.', unzip=True, timeout=-1, overwrite=False)[source][source]#

Downloads wikipedia pagacount for a precise date (up to the hours), the url follows the pattern:

https://dumps.wikimedia.org/other/pageviews/%Y/%Y-%m/pagecounts-%Y%m%d-%H0000.gz
Paramètres:
  • dt – datetime

  • folder – where to download

  • unzip – unzip the file

  • timeout – timeout

  • overwrite – overwrite

Renvoie:

filename

More information on page pageviews.

mlstatpy.data.wikipedia.download_titles(country, folder='.', unzip=True, timeout=-1, overwrite=False)[source][source]#

Downloads wikipedia titles from dumps.wikimedia.org/frwiki/latest/latest-all-titles-in-ns0.gz.

:param country country :param folder where to download :param unzip unzip the file :param timeout timeout :param overwrite overwrite

mlstatpy.data.wikipedia.enumerate_titles(filename, norm=True, encoding='utf8')[source][source]#

Enumerates titles from a file.

:param filename filename :param norm normalize in the function :param encoding encoding