This is a free windows application that uses AI to detect spam or virus in given email. It takes e-mail (.eml) file and by using previously trained ML model and some rules detects whether it is a spam or virus. I trained model on public email dataset but I also explained training on your own emails to get best results. This is a next version, AI version, of my older
virus scanner so I recommend reading about it first. Written in .Net.
(use gear icon in video to switch to HD resolution)

Background
A few years ago I developed
virus scanner application that uses some heuristics to detect viruses in emails, as well as spam. However, over the time I started getting a lot of spam emails and those heuristics that I used, those rules that I used, were not enough. So I decided to try Microsoft's Machine learning library (Microsoft.ML) to train model and then based on that model predict whether incoming e-mail is a spam or not. Application that relies on machine learning will be called by my old application in cases when heuristics could not confirm that e-mail is a spam or virus. It returns 1 in case of virus/spam, 0 if file is clean and -1 if error occurred.
Minimum System Requirements
- Windows 10 (or newer), Windows Server 2012-R2 (or newer)
- .Net Framework 8.0
- 83 Mbytes disk space
License
- If you are using just source code and/or training model on your own dataset then MIT license applies:
MIT License
Copyright (c) 2025 F4CIO (spare-time projects only)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Download
GitHub
You can download source code from my GitHub page. Feel free to contribute there by committing your improvements for this project.
Other Projects
See my
other projects.
Comments
Video transcript
Hi, today I will show you how I used Microsoft's Machine Learning Library to train and then detect spam emails. A few years ago I developed an application that uses some heuristics to detect viruses in emails, as well as spam. However, over the time I started getting a lot of spam emails and those heuristics that I used, those rules that I used, were not enough. So I decided to try Microsoft's Machine learning library (Microsoft.ML) to train model and then based on that model predict whether incoming e-mail is a spam or not. Application that relies on machine learning will be called by my old application in cases when heuristics did not confirm that e-mail is a spam or virus. Here is the code for old application in case the the rules did not result in spam or virus detection it will call new application and here is the configuration for for old application there are a few settings added uh and this is code for new one. As this new application relies on previously trained model, code for spam detection in emails is not complicated it basically calls, allows that model from from the disk and then using that model predicts whether the the e-mail is a spam or or not. The challenge is not to use a trained model, but to find and prepare a proper data set to train a model. I first took thise-mail data set from from them and it did have few thousands of emails, some labeled as spam, some labeled as not being spam, so-called ham and it did result with a model, this one. However, when I tried it, I wasn't satisfied with results. Simply some emails were not properly detected. Then I decided to train a model based on my own emails that I was getting over the years and I took a few hundreds of them. Challenge is to properly label your own emails as spam or not being spam, because we are talking about few hundreds of emails here. However, you can use shortcuts to to do that. There is a rule that whenever you responded to an e-mail, most probably it wasn't a spam. So I took all e-mail addresses to which I responded and I marked emails of those people as not being spam. Because this program works with EML files, those are standard ohh file types for for emails, you need to export your past emails into a folder. I use this one here and you can do this if you're using Thunderbird, there there is an option to export selected emails into a folder and they will be saved as EML files. Now, as I said, you need to label those emails as spam or not spam, so you can also export all emails that that you responded to in order to mark them as as not being spam. Thunderbird you you can take take your address book, go to collected addresses and over there you can export comma separated value of those addresses. So those people were definitely not spammers. And then it will help you move all your exported emails to either ham or spam folder. Another rule that you can use besides addresses of people to whom you responded is to overtime collect some spam phrases like luxury watch or get rich quick and then just search every all emails to find which one contains those phrases. Of course I did not do this manually. I wrote some routines to to do this work automatically or semi-automatically and you you could use them as well. And in the end I ended up in emails being either in spam or Here is the those routines that helped me prepare my emails for for training. There are steps and in in an ideal world you would make a pipeline process pipeline from those steps so you can. run them in one click and do model training multiple times with some adjustments until you're satisfied with the results. And here is the the method that is doing actually actual training. when I ran it against around 1500 emails it took less than 10 minutes uh it resulted with a zip file which I saved to this this line here is doing the actual training of the model The important thing is to choose the right training algorithm here and for this case it wasn't a big challenge because uh As we are working with label data for which we knew whether it was a spam or ham and that's why we we are working here with supervised learning and as we are expecting for results to be in one of categories, in this case we have two categories, we are talking about classification here and that's why I took binary classification. I could experiment more with different algorithms comparing their results, however I was satisfied with the result I get from from this one. the practice is to take your data set then set a percentage of those items that will be used for testing. Sometimes it is like 15% of the total data set. And then automate the the whole process and measure the the results. However, as I said, I just used my trained model and later over the next few days I was getting proper results, expected behavior from from this application. And here is the model that I got by training it on my own emails. However, I did not include this model with the source code because it does contain some personal information. So I uh included models trained on on this dataset. so be aware of the of the license and ohh as you can see every such model contains some personal information for example in thisVocabulary file. There are some some emails there. So those methods that will help that help me prepare my emails are implemented as test cases, so you can simply run them in Visual Studio by each one by one, as well as training the the model itself. However, as I said, you could automate this process and create pipelines for those processes. And here is the execution. I have one simple spam e-mail that I'm using. Note that I uh specified that sample spam e-mail in debug options so when you run this application it will target that file basically I told him to to check this this file and you could see the log for the execution It tried using some heuristics, however it had to call uh our new applicationand it was loading model from the disk and predicting whether it is spam or not uh and it saved that spam in our virus folder One thing to be careful about is to train your model on the data set of emails which are in all in same format. For that purpose I wrote some routines to convert, to strip out HTML, from my emails practically to simplify all my emails and they look something like this every e-mail on which I was doing training I have basic information with pure text content of e-mail. As you can see it's not perfect. There are still some HTML tags and in future I could strip them out as well. But important thing is that all my emails are pre-formatted before being put into a dataset for training. To give you a little bit of info about the solution structure, all the magic happens here, so the the training and prediction is implemented in the engine project. This you know spam detector app is just a console application as a shell for calling the the engine. You specify practically a single parameter attached to the .EML file and it will output either one or zero depending on the on the result. And as I said, this old application is calling that new application. This library SelfishBuidingBlocks is a library with some helper methods. I explained that library earlier. This FodyWeavers is a NuGet that I included in every project because it tells compiler to produce smaller numbers of DLLs in output.