Given this code:
// Decode the text string
string test = "Version 21.1.0 - 2021 Edition (22nd March 2021)";
string[] textitems = test.Split(' ');
// The text should split down like this:
// [0] Version
// [1] 21.1.0
// [2] -
// [3] 2021
// [4] Edition
// [5] (22nd
// [6] March
// [7] 2021)
I have created a enum
to use:
enum UpdateInfo
{
Version = 1,
Edition = 3,
Day = 5,
Month = 6,
Year = 7
}
The information I am interested in is:
Version
and Edition
are straightforward:
writer.WriteAttributeString("Version", textitems[(int)UpdateInfo.Version]);
writer.WriteAttributeString("Edition", textitems[(int)UpdateInfo.Edition]);
But the Date
is not. I found out that I can't parse (eg.):
(22nd March 2021)
I want the short date so I have come up with the following code after doing research:
// Rebuild date as short date
// Day - strip off "(" and "st", "nd", "rd" or "th"
string day = string.Empty;
for (int i = 0; i < textitems[(int)UpdateInfo.Day].Length; i++)
{
if (Char.IsDigit(textitems[(int)UpdateInfo.Day][i]))
day += textitems[(int)UpdateInfo.Day][i];
}
// Rebuilt long date
string datetest = day + " " + textitems[(int)UpdateInfo.Month] + " " + textitems[(int)UpdateInfo.Year];
// Remove trailing ")"
datetest = datetest.Trim(')');
// Now we can parse the long date string
DateTime date = DateTime.ParseExact(datetest, "d MMMM yyyy", CultureInfo.InstalledUICulture, DateTimeStyles.None);
if (date != null)
writer.WriteAttributeString("Date", date.ToShortDateString());
Is there a simpler way to achieve the same result without bloating the code?
Note:
<p class="rvps2">
<img alt="New Version Icon"
style="vertical-align: middle; padding : 1px; margin : 0px 5px;"
src="lib/IMG_NewVersion.png">
<span class="rvts16">Version 21.1.0 - 2021 Edition</span>
<span class="rvts15"> (22nd March 2021)</span>
</p>
So I actually have a HtmlNode
(the p
element`).
I would not split by spaces, there are too many. I would split by "-"
and then use regex to extract the date part. Then it's easy with TryParseExact
and dd'nd' MMMM yyyy
:
string[] textitems = test.Split('-');
string version = textitems[0].Trim();
string edition = textitems[1].Substring(0, textitems[1].IndexOf("(")).Trim();
string dateStr = Regex.Match(textitems[1], @"\(([^)]*)\)").Groups[1].Value;
string[] formats = { "d'st' MMMM yyyy", "d'nd' MMMM yyyy" };
bool validDate = DateTime.TryParseExact(dateStr, formats, CultureInfo.InvariantCulture, DateTimeStyles.None, out DateTime date );
I have added also d'st' MMMM yyyy
since i can imagine that this would be your next issue. Another option was to include the brackets in the format: "'('d'nd' MMMM yyyy')'"
.
You might want to add some code to validate the input first, i have omitted that.
For this I wouldn't even bother with splitting the text, you can do this with a regular expression and named matches.
string test = "Version 21.1.0 - 2021 Edition (22nd August 2021)";
var regex = new Regex(@"Version (?'version'[\d.]+) - (?'edition'\d+) Edition \((?'date'[^)]+)", RegexOptions.None);
var matches = regex.Matches(test);
var version = matches[0].Groups["version"].Value;
var edition = matches[0].Groups["edition"].Value;
var dateString = matches[0].Groups["date"].Value;
// remove date ordinal before parsing
dateString = Regex.Replace(dateString, @"^(\d+)(st|nd|rd|th)", "$1");
var date = DateTime.ParseExact(dateString, "dd MMMM yyyy", CultureInfo.CurrentCulture);
date.ToShortDateString().Dump();
Normally I would use TryParseExact
and handle any parse exceptions properly.
You can get an explanation of the main regular expression here: https://regex101.com/r/Nzpa5h/1
I have come up with a solution that combines both approaches. Since the original data is actually a HtmlNode
(as indicated at the bottom of the question) and is already split into two span
elements, I decided to do it this way:
// The paragraph element should only have two "span" elements
var listSpan = itemParagraph.Descendants("span");
if(listSpan != null)
{
if(listSpan.Count() == 2)
{
// The first "span" element should contain: Version 21.1.0 - 2021 Edition
var regex = new Regex(@"Version (?'version'[\d.]+) - (?'edition'\d+) Edition", RegexOptions.None);
var matches = regex.Matches(listSpan.ElementAt(0).InnerText.Trim());
writer.WriteStartElement("Update");
writer.WriteAttributeString("Version", matches[0].Groups["version"].Value);
writer.WriteAttributeString("Edition", matches[0].Groups["edition"].Value);
// The second "span" element should contain: eg. (22nd March 2021)
string dateString = listSpan.ElementAt(1).InnerText.Trim(' ', '(', ')');
string[] formats =
{
"d'st' MMMM yyyy",
"d'nd' MMMM yyyy",
"d'rd' MMMM yyyy",
"d'th' MMMM yyyy"
};
if (DateTime.TryParseExact(dateString,
formats, CultureInfo.CurrentUICulture, DateTimeStyles.None, out DateTime dateRevision))
{
writer.WriteAttributeString("Date", dateRevision.ToShortDateString());
}
}
}
I admit that I do not quite follow how this bit of code actually works:
var regex = new Regex(@"Version (?'version'[\d.]+) - (?'edition'\d+) Edition", RegexOptions.None);
var matches = regex.Matches(listSpan.ElementAt(0).InnerText.Trim());
The above code is modified from one of the supplied answers. But it works. :)
I decided to construct the date using the accepted answer approach as I understand what it is doing, as opposed to the regex suggestion.
@phuzi maybe you could add some explanations or pointers to flesh out your answer concerning the regex syntax?