We used a data set already described by Kloß et al. [
14] which features common agents of urinary tract infection. It consists of a database of 11 important bacterial species:
Enterococcus faecalis (DSM 20478),
Enterococcus faecium (DSM 20477),
Staphylococcus epidermidis (DSM 20044),
Staphylococcus haemolyticus (DSM 20263),
Staphylococcus hominis (DSM 20328),
Staphylococcus saprophyticus (DSM 20229),
Staphylococcus aureus (ATCC 43300), two strains of
Escherichia coli (DSM 10806 and ATCC 35218),
Klebsiella pneumoniae (ATCC 700603),
Pseudomonas aeruginosa (ATCC 27853), and
Proteus mirabilis (DSM 4479). The data set contains also an independent validation data set and spectra of patient samples. In total, there were 2952 spectra to train the model, 514 independent spectra for the validation of it and ten patient urine samples with a combined number of 491 spectra. All strains were provided by the Institute of Medical Microbiology, Jena University Hospital, and were originally purchased from the German Collection of Microorganisms and Cell Cultures (DSMZ) and the American Type Culture Collection (ATCC), except for the isolates from patient samples. All Raman spectroscopic measurements were performed with the Raman microscope BioParticleExplorer (MicrobioID 0.5, RapID, Berlin, Germany) with a 532 nm excitation. The spectral resolution was about 10 cm
−1 [
14]. Although we were tempted to adjust the pre-processing to account for developments in the meantime, to ensure comparability, the data set has been preprocessed exactly as before. This involved background correction with the statistics-sensitive nonlinear iterative peak-clipping (SNIP) algorithm [
15], despiking using a robust variant of the upper-bound spectrum algorithm [
16] and wavenumber calibration with acetaminophen [
17]. After cutting them to the ranges of 450−1740 and 2610−3100 cm
−1, each spectrum consists of 553 spectral points. The exact composition of the training and validation data set is shown in Table 1.