Preferred file types

Data at the Language Bank of Finland is usually distributed using the file types and formats described below. While we accept that incoming data will not always be in our preferred formats, publication will be faster, if the data is already fully or partially in our preferred formats.


  • UTF-8 for text encoding, combined characters are normalized to single characters if possible
  • VRT
  • PDF for rendered text (will be provided alongside original format if possible)
  • JSON and TEI are in preparation (will be generated from VRT)

Time-aligned annotations of audio and video

  • Praat TextGrid (text/praat-textgrid)
  • EAF / ELAN (Eudico Annotation Format, IMDI document type: text/x-eaf+xml; MIME type: text/xml)


  • WAV for uncompressed audio
  • AAC for compressed audio


  • mp4/mpeg-4 for compressed video