Is there a way to filter out a field that does not contain something in the spark frame using scala?

I hope I'm stupid and it will be easy.

I have a dataframe containing the columns 'url' and 'referrer'.

I want to extract all referrers that contain the top-level domain "www.mydomain.com" and "mydomain.co".

I can use

val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain.")) 

However, this pulls out the URL of the URL www.google.co.uk, which for some reason also contains my web domain. Is there a way using scala in a lawsuit so that I can filter anything using google in it while maintaining the correct results?

thanks

Dean

+6
source share
2 answers

You can cancel the predicate using either not or ! , so all that remains is to add another condition:

 import org.apache.spark.sql.functions.not df.where($"referrer".contains("www.mydomain.") && not($"referrer".contains("google"))) 

or a separate filter:

 df .where($"referrer".contains("www.mydomain.")) .where(!$"referrer".contains("google")) 
+10
source

You can use Regex . Here you can find a link to using regex in Scala. And here you can find some tips on how to create the correct regular expression for urls.

So in your case you will have something like:

 val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match { case Some => true case None => false } ) 

This solution requires a bit of work, but is the safest.

0
source

All Articles