== Description == We trained 100 dimensional word embeddings using word2vec (Mikolov et al., 2013) on 116 Million German sentences. Word2Vec was run in Skip-Gram mode with 5 negative samples for 10 full iterations over the input courpus. The following six corpora with a total of 116 million sentences were used: German Wikipedia, the Leipzig Corpora Collection (Biemann et al., 2007), the SDeWac corpus (Faaß and Eckart, 2013), the print archive of Spiegel (http://www.spiegel.de/spiegel/print/), the print archive of ZEIT(http://www.zeit.de/2014/index), and the articles crawled from ZEIT Online (http://www.zeit.de). Apart from tokenization, we performed the following pre-processing steps: Numbers are substituted by the special token 0, diacritics are removed, except for German umlauts. All tokens are lowercased; the semantics of capitalization in the German orthography can be captured by the capitalization feature (cf. section 4 in our paper). Two special tokens are added to the vocab-files: UNKNOWN and PADDING. The vectors for these two tokens were generate uniform at random and they can be found in the first two lines of each file. Words with a count below a certain threshold were removed from the corpus. This results to a vocabulary size of: embeddings_5_min_count.vocab: 3.363.088 words embeddings_50_min_count.vocab: 648.462 words embeddings_100_min_count.vocab: 403.558 words For further details see: GermEval-2014: Nested Named Entity Recognition with Neural Networks https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/publikationen/2014/2014_GermEval_Nested_Named_Entity_Recognition_with_Neural_Networks.pdf == Format == Each line of the .vocab files has the token in the first position, followed by 100 floating point numbers. Words and the individual entries of the vectors are seperated by a space. Using Python, you can read in the data via: import gzip import numpy as np idx = 0 embeddings = [] word2Idx = {} #Mapping of words to the row in the embeddings matrix with gzip.open('embeddings.vocab.gz', 'r') as fIn: idx = 0 for line in fIn: split = line.strip().split(' ') embeddings.append(np.array([float(num) for num in split[1:]])) word2Idx[split[0]] = idx idx += 1 == License == Feel free to distribute these word embeddings under the CC-By License (http://creativecommons.org/licenses/by/4.0/). If you use these word embeddings in your research, please cite; Nils Reimers, Judith Eckle-Kohler, Carsten Schnober, Jungi Kim, Iryna Gurevych, GermEval-2014: Nested Named Entity Recognition with Neural Networks, Workshop Proceedings of the 12th Edition of the KONVENS Conference, 2014. @inproceedings{ TUD-CS-2014-0973, author = {Nils Reimers and Judith Eckle-Kohler and Carsten Schnober and Jungi Kim and Iryna Gurevych}, title = {GermEval-2014: Nested Named Entity Recognition with Neural Networks}, month = oct, year = {2014}, publisher = {Universit{\"a}tsverlag Hildesheim}, address = {Hildesheim}, booktitle = {Workshop Proceedings of the 12th Edition of the KONVENS Conference}, editor = {Gertrud Faa{\ss} and Josef Ruppenhofer }, pages = {117-120}, location = {Hildesheim, Germany}, pubkey = {TUD-CS-2014-0973}, research_area = {Ubiquitous Knowledge Processing}, research_sub_area = {UKP_not_reviewed, UKP_p_DARIAHDE, UKP_a_LangTech4eHum}, Website = {http://www.uni-hildesheim.de/konvens2014/data/konvens2014-workshop-proceedings.pdf}, pdf = {fileadmin/user_upload/Group_UKP/publikationen/2014/2014_GermEval_Nested_Named_Entity_Recognition_with_Neural_Networks.pdf}, }