• Jobs
  • About Us
  • professionals
    • Home
    • Jobs
    • Courses and challenges
  • business
    • Home
    • Post vacancy
    • Our process
    • Pricing
    • Assessments
    • Payroll
    • Blog
    • Sales
    • Salary Calculator

0

195
Views
Whats the correct format of Java String REGEX to identify DOI

I am conducting some research on identify DOI in free format text.

I am using Java 8 and REGEX

I Have found these REGEX's that are supposed to fulfil my requirements

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

The code I am trying is

private static final Pattern pattern_one = Pattern.compile("/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i", Pattern.CASE_INSENSITIVE);

Matcher matcher = pattern_one.matcher("http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1");
while (matcher.find()) {
                System.out.print("Start index: " + matcher.start());
                System.out.print(" End index: " + matcher.end() + " ");
                System.out.println(matcher.group());
        }

However the matcher doesnt find anything.

Where have I gone wrong?

UPDATE

I have encountered a valid DOI that my set of REGEXs do not match

heres an example DOI : 10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2

Why doesn't this pattern work?

/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
about 3 years ago · Santiago Trujillo
2 answers
Answer question

0

Your pattern looks incorrect to me. You are currently using this:

/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i

But I think you intend to use this:

^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$

Problems with your pattern include that you are using JavaScript regex syntax, or some other language's syntax. Also, you were not escaping a literal dot in the regex, and the start of the pattern marker was out of place.

Code:

String pattern = "^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$";
String url = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(url);
if (m.find( )) {
    System.out.println("Found value: " + m.group(0) );
} else {
    System.out.println("NO MATCH");
}

Demo here:

Rextester

about 3 years ago · Santiago Trujillo Report

0

In Java, a regex is written as a String. In other languages, the regex is quoted using /.../, with options like i given after the ending /. So, what is written as /XXX/i will in Java be done like this:

// Using flags parameter
Pattern p = Pattern.compile("XXX", Pattern.CASE_INSENSITIVE);

// Using embedded flags
Pattern p = Pattern.compile("(?i)XXX");

In most languages, regex are using to find a matching substring. Java can do that too, using the find() method (or any of the many replaceXxx() regex methods), however Java also has the matches() method which will match against the entire string, eliminating the need for the begin and end boundary matchers ^ and $.

Anyway, your problem is that the regex has both ^ and $ boundary matchers, which means it will only work if string is nothing but the text you want to match. Since you actually want to find a substring, remove those matchers.

To search for one of multiple patterns, using the | logical regex operator.

And finally, since Java regex is given as a String literal, any special characters, most notably \, needs to be escaped.

So, to build a single regex that can find substrings matching any of the following:

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

You would write it like this:

String regex = "10.\\d{4,9}/[-._;()/:A-Z0-9]+" +
              "|10.1002/[^\\s]+" +
              "|10.\\d{4}/\\d+-\\d+X?(\\d+)\\d+<[\\d\\w]+:[\\d\\w]*>\\d+.\\d+.\\w+;\\d" +
              "|10.1021/\\w\\w\\d++" +
              "|10.1207/[\\w\\d]+\\&\\d+_\\d+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

String input = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println("Start index: " + m.start() +
                       " End index: " + m.end() +
                       " " + m.group());
}

Output

Start index: 37 End index: 54 10.1175/JPO3002.1
about 3 years ago · Santiago Trujillo Report
Answer question
Find remote jobs

Discover the new way to find a job!

Top jobs
Top job categories
Business
Post vacancy Pricing Our process Sales
Legal
Terms and conditions Privacy policy
© 2025 PeakU Inc. All Rights Reserved.

Andres GPT

Recommend me some offers
I have an error