The fastest way to remove non-numeric characters from VARCHAR in SQL Server

I am writing an import utility that uses phone numbers as a unique key in an import.

I need to check that the phone number does not yet exist in my DB. The problem is that phone numbers in the database can have things like dashes and brackets, and possibly other things. I wrote a function to delete these things, the problem is that it is slow and with thousands of records in my database and thousands of records to import right away, this process can be unacceptably slow. I already made the column number of the phone number index.

I tried using the script from this entry:
T-SQL trim & nbsp (and other non-alphanumeric characters)

But that did not speed him up.

Is there a faster way to remove non-numeric characters? Something that may work well when you need to compare 10,000 to 100,000 records.

Whatever is done must be completed quickly .

Update
Given that people have responded, I think I will have to clear the fields before running the import utility.

To answer the question of what I am writing in the import utility, this is a C # application. Right now I am comparing BIGINT with BIGINT and you do not need to modify the DB data, and I am still doing a performance hit with a very small data set (about 2000 records).

Can comparing BIGINT with BIGINT slow down?

I optimized the code part of my application as much as I could (deleted regular expressions, deleted unnecessary database calls). Although I can no longer isolate SQL as the source of the problem, I still feel that it is.

+59
performance optimization sql sql-server
Sep 19 '08 at 22:42
source share
15 answers

I may misunderstand, but you have two sets of data to delete rows from one for the current data in the database, and then a new set when importing.

To update existing records, I would just use SQL, which should happen only once.

However, SQL is not optimized for this kind of operation, since you said that you were writing an import utility, I would make these updates in the context of the import utility itself, and not in SQL. It will be much better. What do you write the utility?

In addition, I can completely misunderstand this process, so I apologize if it is outside the database.

Edit:
For an initial upgrade, if you are using SQL Server 2005, you can try the CLR function. Here is quick using regex. Not sure how performance compares, I never used it myself, except for a quick test right now.

using System; using System.Data; using System.Text.RegularExpressions; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; public partial class UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction] public static SqlString StripNonNumeric(SqlString input) { Regex regEx = new Regex(@"\D"); return regEx.Replace(input.Value, ""); } }; 

After deploying for an upgrade, you can simply use:

 UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber) 
+15
Sep 19 '08 at 22:46
source share

I saw this solution with T-SQL code and PATINDEX. I like it: -)

 CREATE Function [fnRemoveNonNumericCharacters](@strText VARCHAR(1000)) RETURNS VARCHAR(1000) AS BEGIN WHILE PATINDEX('%[^0-9]%', @strText) > 0 BEGIN SET @strText = STUFF(@strText, PATINDEX('%[^0-9]%', @strText), 1, '') END RETURN @strText END 
+98
Jun 30 2018-11-11T00:
source share

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(string,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')*1 AS string ,

:)

+35
Jul 30 '13 at 14:59
source share

If you do not want to create a function or you need only one built-in call in T-SQL, you can try:

 set @Phone = REPLACE(REPLACE(REPLACE(REPLACE(@Phone,'(',''),' ',''),'-',''),')','') 

Of course, this is specific for removing the formatting of a phone number, and not for the general removal of all special characters from a string function.

+15
Aug 11 2018-11-11T00:
source share

Simple function:

 CREATE FUNCTION [dbo].[RemoveAlphaCharacters](@InputString VARCHAR(1000)) RETURNS VARCHAR(1000) AS BEGIN WHILE PATINDEX('%[^0-9]%',@InputString)>0 SET @InputString = STUFF(@InputString,PATINDEX('%[^0-9]%',@InputString),1,'') RETURN @InputString END GO 
+9
Mar 20 '14 at 11:49
source share
 create function dbo.RemoveNonNumericChar(@str varchar(500)) returns varchar(500) begin declare @startingIndex int set @startingIndex=0 while 1=1 begin set @startingIndex= patindex('%[^0-9]%',@str) if @startingIndex <> 0 begin set @str = replace(@str,substring(@str,@startingIndex,1),'') end else break; end return @str end go select dbo.RemoveNonNumericChar('aisdfhoiqwei352345234@#$%^$@345345%^@#$^') 
+6
Nov 03 '10 at 12:54
source share

You can delete them in night mode, saving them in a separate field, and then perform an update for the changed records immediately before starting the process?

Or, when pasting / updating, save the "numeric" format to reference later. A trigger would be an easy way to do this.

+1
Sep 19 '08 at 22:44
source share

Working with varchars is significantly slow and inefficient compared to working with numbers for obvious reasons. The functions you refer to in the original post will indeed be rather slow as they scroll through each character in the string to determine if it is a number. Do this for thousands of records and the process will be slow. This is ideal for regular expressions, but they are not supported in SQL Server. You can add support using the CLR function, but it's hard to say how slowly this will happen, but I did not expect it to be much faster than the cycle of each character of each phone number, however!

Once you get the phone numbers formatted in your database so that they are only numbers, you can switch to the numeric type in SQL, which would provide a lightning fast comparison with other numeric types. You may find that depending on how fast your new data is, performing trimming and converting to a numerical value on the database side is fast enough as soon as you compare, it is formatted correctly, but if possible, it would be better for you to disable the utility record .NET imports that take care of these formatting issues before deleting the database.

In any case, you will have a big problem with additional formatting. Even if your numbers are guaranteed to be only North American in origin, some people will put 1 in front of the full phone number with the full code, while others will not, which will make it possible for multiple entries of the same phone number. In addition, depending on what your data represents, some people will use their home phone number, which may have several people living there, so the only restriction on it will be only one member of the database in the household. Some of them will use their work number and have the same problem, and some will or will not include an extension that will again cause artificial uniqueness.

All of this may or may not affect you, depending on your specific data and customs, but it is important to remember!

+1
Sep 19 '08 at 23:16
source share

I will try the Scott CLR function first, but add a WHERE clause to reduce the number of updated records.

 UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber) WHERE phonenumber like '%[^0-9]%' 

If you know that the vast majority of your entries have non-numeric characters, this may not help.

+1
Sep 19 '08 at 23:47
source share

I know it's late in the game, but here is the function I created for T-SQL that quickly removes non-numeric characters. Note that I have a "String" scheme in which I put the utility functions for strings in ...

 CREATE FUNCTION String.ComparablePhone( @string nvarchar(32) ) RETURNS bigint AS BEGIN DECLARE @out bigint; -- 1. table of unique characters to be kept DECLARE @keepers table ( chr nchar(1) not null primary key ); INSERT INTO @keepers ( chr ) VALUES (N'0'),(N'1'),(N'2'),(N'3'),(N'4'),(N'5'),(N'6'),(N'7'),(N'8'),(N'9'); -- 2. Identify the characters in the string to remove WITH found ( id, position ) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (n1+n10) DESC), -- since we are using stuff, for the position to continue to be accurate, start from the greatest position and work towards the smallest (n1+n10) FROM (SELECT 0 AS n1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) AS d1, (SELECT 0 AS n10 UNION SELECT 10 UNION SELECT 20 UNION SELECT 30) AS d10 WHERE (n1+n10) BETWEEN 1 AND len(@string) AND substring(@string, (n1+n10), 1) NOT IN (SELECT chr FROM @keepers) ) -- 3. Use stuff to snuff out the identified characters SELECT @string = stuff( @string, position, 1, '' ) FROM found ORDER BY id ASC; -- important to process the removals in order, see ROW_NUMBER() above -- 4. Try and convert the results to a bigint IF len(@string) = 0 RETURN NULL; -- an empty string converts to 0 RETURN convert(bigint,@string); END 

Then use it to compare to insert, something like this;

 INSERT INTO Contacts ( phone, first_name, last_name ) SELECT i.phone, i.first_name, i.last_name FROM Imported AS i LEFT JOIN Contacts AS c ON String.ComparablePhone(c.phone) = String.ComparablePhone(i.phone) WHERE c.phone IS NULL -- Exclude those that already exist 
+1
Apr 7 '10 at 21:30
source share

I would recommend using a strict format for phone numbers in the database. I am using the following format. (Assuming US phone numbers)

Database: 555555555555x555

Display: (555) 555-5555 ext. 555

Input: 10 digits or more digits embedded in any line. (Regex replacement removes all non-numeric characters)

0
Sep 19 '08 at 22:55
source share

"Although I can no longer isolate SQL as the source of the problem, I still feel that it is."

Launch SQL Profiler and take a look. Take the requests received and check their execution plans to make sure this index is used.

0
Sep 19 '08 at 23:54
source share

Thousands of records versus thousands of records are usually not a problem. I used SSIS to import millions of deduplicated entries like this.

I would clear the database to remove non-numeric characters first and save them.

0
Sep 20 '08 at 3:13
source share

Looking for a super simple solution:

 SUBSTRING([Phone], CHARINDEX('(', [Phone], 1)+1, 3) + SUBSTRING([Phone], CHARINDEX(')', [Phone], 1)+1, 3) + SUBSTRING([Phone], CHARINDEX('-', [Phone], 1)+1, 4) AS Phone 
0
Jun. 07 '11 at 15:08
source share

I would use the Inline function in terms of performance, see below: Note that characters are like "+", "-", etc. will not be deleted

 CREATE FUNCTION [dbo].[UDF_RemoveNumericStringsFromString] ( @str varchar(100) ) RETURNS TABLE AS RETURN WITH Tally (n) as ( -- 100 rows SELECT TOP (Len(@Str)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n) CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) ) SELECT OutStr = STUFF( (SELECT SUBSTRING(@Str, n,1) st FROM Tally WHERE ISNUMERIC(SUBSTRING(@Str, n,1)) = 1 FOR XML PATH(''),type).value('.', 'varchar(100)'),1,0,'') GO /*Use it*/ SELECT OutStr FROM dbo.UDF_RemoveNumericStringsFromString('fjkfhk759734977fwe9794t23') /*Result set 759734977979423 */ 

You can define it with over 100 characters ...

0
Oct. 16 '16 at 12:20
source share



All Articles