Skip to content

Conversation

@sinhrks
Copy link
Member

@sinhrks sinhrks commented Jul 22, 2016

  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

because factorize internally localize datetimetz, it raises when data contains DST boundary.

dti = pd.date_range('2016-11-06', freq='H', periods=5, tz='US/Eastern')
dti.factorize()
# AmbiguousTimeError: Cannot infer dst time from Timestamp('2016-11-06 01:00:00'), try using the 'ambiguous' argument

Skipped this localization to fix, also it improves perf.

dti = pd.date_range('2011-01-01', freq='H', periods=1000000, tz='Asia/Tokyo')
%timeit dti.factorize()
# on current master
#1 loop, best of 3: 475 ms per loop

# after this PR
#1 loop, best of 3: 262 ms per loop

asv:

   before     after       ratio
  [bb6b5e54] [a2c3370a]
-   22.46ms     9.97ms      0.44  timeseries.datetime_algorithm.time_dti_tz_factorize
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
@sinhrks sinhrks added Bug Performance Memory or execution speed performance Timezones Timezone data dtype labels Jul 22, 2016
@sinhrks sinhrks added this to the 0.19.0 milestone Jul 22, 2016
@codecov-io
Copy link

Current coverage is 84.56% (diff: 100%)

Merging #13750 into master will not change coverage

@@             master     #13750   diff @@
==========================================
  Files           141        141          
  Lines         51196      51196          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43296      43296          
  Misses         7900       7900          
  Partials          0          0          

Powered by Codecov. Last update bb6b5e5...a2c3370

@jreback
Copy link
Contributor

jreback commented Jul 22, 2016

lgtm.

I think this may solve some of these Ambiguous time issues: https://github.com/pydata/pandas/search?q=cannot+infer+dst&type=Issues&utf8=%E2%9C%93

maybe

@jreback jreback merged commit 42cc66d into pandas-dev:master Jul 24, 2016
@jreback
Copy link
Contributor

jreback commented Jul 24, 2016

thanks! see my comment above if you have some time.

@sinhrks
Copy link
Member Author

sinhrks commented Jul 24, 2016

yeah, i think we can remove most of tz_localize(None) logic from core.

@sinhrks sinhrks deleted the perf_factorize branch July 24, 2016 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Performance Memory or execution speed performance Timezones Timezone data dtype

3 participants