How to extract lines in sframe where there is a joint condition and two separate conditions?

I have sframeas such:

+---------+------+-------------------------------+-----------+------------------+
| term_id | lang |            term_str           | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 |  ro  |    Agenție de aprovizionare   |  fullForm |        3         |
| IATE-84 |  bg  |    ... |  fullForm |        3         |
| IATE-84 |  cs  | příslušnost členských stát... |  fullForm |        3         |
| IATE-84 |  da  |     medlemsstatskompetence    |  fullForm |        3         |
| IATE-84 |  de  | Zuständigkeit der Mitglied... |  fullForm |        3         |
| IATE-84 |  el  | αρμοδιότητα των κρατών μελ... |  fullForm |        3         |
| IATE-84 |  en  | competence of the Member S... |  fullForm |        3         |
| IATE-84 |  es  | competencias de los Estado... |  fullForm |        3         |
| IATE-84 |  et  |     liikmesriikide pädevus    |  fullForm |        3         |
| IATE-84 |  fi  |   jäsenvaltioiden toimivalta  |  fullForm |        3         |
| IATE-84 |  fr  |  compétence des États membres |  fullForm |        3         |
| IATE-84 |  ga  |    inniúlacht na mBallstát    |  fullForm |        3         |
| IATE-84 |  hu  |       tagállami hatáskör      |  fullForm |        3         |
| IATE-84 |  it  | competenza degli Stati membri |  fullForm |        3         |
| IATE-84 |  lt  |  valstybių narių kompetencija |  fullForm |        2         |
| IATE-84 |  lv  |     dalībvalstu kompetence    |  fullForm |        3         |
| IATE-84 |  nl  |  bevoegdheid van de lidstaten |  fullForm |        3         |
| IATE-84 |  pl  | kompetencje państw członko... |  fullForm |        3         |
| IATE-84 |  pt  | competência dos Estados-Me... |  fullForm |        3         |
| IATE-84 |  ro  | competența statelor membre... |  fullForm |        3         |
+---------+------+-------------------------------+-----------+------------------+

I need to extract all the lines where lang == 'de' or lang == 'en', but the lines that I am extracting with lang == 'en'must have the appropriate lang == 'de'one so that they have the same one term_id.

I did this as such with graphlaband sframe:

sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='\0', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']

and de.print_rows(10):

+------------+------+-------------------------------+-----------+------------------+
|  term_id   | lang |            term_str           | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
|  IATE-84   |  de  | Zuständigkeit der Mitglied... |  fullForm |        3         |
|  IATE-290  |  de  | Schutz der öffentlichen Ge... |  fullForm |        3         |
|  IATE-662  |  de  | mengenmäßigen Ausfuhrbesch... |  fullForm |        3         |
|  IATE-801  |  de  |     Eintragungshindernisse    |  fullForm |        2         |
| IATE-1326  |  de  | Sonderregelung für Reisebü... |  fullForm |        4         |
| IATE-1702  |  de  |          Erwerbslose          |  fullForm |        4         |
| IATE-2818  |  de  |    Verwaltungsvorschriften    |  fullForm |        3         |
| IATE-21139 |  de  |    frisches Obst und Gemüse   |  fullForm |        3         |
| IATE-21563 |  de  | chemische Erzeugnisse zur ... |  fullForm |        3         |
| IATE-21564 |  de  |         Mineralsäuren         |  fullForm |        3         |
+------------+------+-------------------------------+-----------+------------------+

And then:

en = sf[sf['lang'] == 'en']
en.print_rows(10)

[output]:

+------------+------+-------------------------------+--------------+------------------+
|  term_id   | lang |            term_str           |  term_type   | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
|  IATE-84   |  en  | competence of the Member S... |   fullForm   |        3         |
|  IATE-254  |  en  | award of public works cont... |   fullForm   |        3         |
|  IATE-290  |  en  |    public health protection   |   fullForm   |        3         |
|  IATE-662  |  en  | quantitative restriction o... |   fullForm   |        3         |
|  IATE-801  |  en  |      grounds for refusal      |   fullForm   |        2         |
| IATE-1299  |  en  |              CEP              | abbreviation |        3         |
| IATE-1326  |  en  | special scheme for travel ... |   fullForm   |        3         |
| IATE-2818  |  en  |          regulations          |   fullForm   |        3         |
| IATE-7128  |  en  |          company name         |   fullForm   |        2         |
| IATE-21139 |  en  |  fresh fruits and vegetables  |   fullForm   |        3         |
+------------+------+-------------------------------+--------------+------------------+

I tried:

en_de = en[en['term_id'] in de_termids]

But I misunderstand the syntax giving me this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
      1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
    691 
    692         """
--> 693         return (self == item).any()
    694 
    695     def contains(self, item):

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
    973                 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
    974             else:
--> 975                 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
    976 
    977 

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Array size mismatch

How to filter sframe so that I get lines with enand deand corresponding term_id?

The resulting data file should look something like this:

+---------+-----------------+-------------+
| term_id |     term_str_en | term_str_de | 
+---------+-------------------------------+
| IATE-999 |    something    |  etwas      |
...
+---------+-----------------+-------------+

How do I do the same with pandas?

+4
4

, : df_en df_de. merge :

new_df = pd.merge(df_en[['term_id','term_str']], df_de[['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))

inner . merge pandas docs

Edit

(df - , , ):

new_df = pd.merge(df.loc[df['lang']=='en',['term_id','term_str']], df.loc[df['lang']=='de',['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
+2

pandas, SFrame, de_termids en :

en.filter_by(de_termids, 'term_id')
+1

, term_id lang 'de' ('de', 'en') "en"? ,

, lang = 'de' 'en' term_id, 'en' lang

0

, , .

.

de_terms = sf [sf ['lang'] == 'de'].set_index ('term_id')

term_id . en, .

Usually, the assignment determines which indices are included, for example, if you want to include only records with en, if they also have de, but all are 'de', combine the en column into a de-frame; indices only in the en-frame will be discarded. or you can do filtering to get logical combinations no nan.

0
source

All Articles