python - Efficient time series data extract -
i have problem in python not sure how solve efficient. have large set of time series data read in generator. of now, when call yield, each data given me 1 one. fine when each time series have same index, each start on same date , end on same date. problem when have set of time series data not have same start date, same end date.
what best implementation whereby when query, return values specific date. way not have worry start date. point in time.
i use pandas , have no clue how implement efficiently.
code use import csv file file:
def _open_convert_csv_files(self): comb_index = none s in self.symbol_list: print s # load csv file no header information, indexed on date self.symbol_data[s] = pd.io.parsers.read_csv( os.path.join(self.csv_dir, '%s.csv' % s), header=0, index_col=0, parse_dates=true, names=['date','open','high','low','close','total volume'] ).sort() # combine index pad forward values if comb_index none: comb_index = self.symbol_data[s].index else: comb_index.union(self.symbol_data[s].index) # set latest symbol_data none self.latest_symbol_data[s] = [] print '' # reindex dataframes s in self.symbol_list: self.symbol_data[s] = self.symbol_data[s].reindex(index=comb_index, method='pad').iterrows()
as can see, self.symbol_data[s]
works fine when time series have same start date, when don't, wont work during simulation, loop through each symbol within loop data. word need take in account cross-sectional price data each date of iteration
love hear others doing achieve this.
i understand can line them side side dates match , loop row row, when have 100k different securities, slow in memory. besides, each csv file not single column multiple columns...
thanks,
date open high low close total volume 19991118 29.69620186 32.63318885 26.10655108 28.71720619 685497 19991119 28.02375093 28.06454241 25.98417662 26.3513 166963 19991122 26.96317229 28.71720619 26.14734257 28.71720619 72092 19991123 27.73821052 28.47245727 26.10655108 26.10655108 65492 19991124 26.18813405 27.37108715 26.10655108 26.80000634 53081 19991126 26.67763189 27.08554675 26.59604891 26.88158932 18955
let's start this:
pd.read_csv(file_path, parse_dates=true, index_col=0) open high low close total volume date 1999-11-18 29.696202 32.633189 26.106551 28.717206 685497 1999-11-19 28.023751 28.064542 25.984177 26.351300 166963 1999-11-22 26.963172 28.717206 26.147343 28.717206 72092 1999-11-23 27.738211 28.472457 26.106551 26.106551 65492 1999-11-24 26.188134 27.371087 26.106551 26.800006 53081 1999-11-26 26.677632 27.085547 26.596049 26.881589 18955
how not sufficient needs?